bdgenomics.adam.ds.FragmentDataset¶

class bdgenomics.adam.ds.FragmentDataset(jvmDataset, sc)[source]¶

Wraps a GenomicDataset with Fragment metadata and functions.

__init__(jvmDataset, sc)[source]¶

Constructs a Python FragmentDataset from a JVM FragmentDataset. Should not be called from user code; instead, go through bdgenomics.adamContext.ADAMContext.

Parameters:	jvmDataset – Py4j handle to the underlying JVM FragmentDataset. sc (pyspark.context.SparkContext) – Active Spark Context.

Methods

`__init__`(jvmDataset, sc)	Constructs a Python FragmentDataset from a JVM FragmentDataset.
`broadcastRegionJoin`(genomicDataset[, flankSize])	Performs a broadcast inner join between this genomic dataset and another genomic dataset.
`broadcastRegionJoinAndGroupByRight`(…[, …])	Performs a broadcast inner join between this genomic dataset and another genomic dataset.
`cache`()	Caches underlying RDD in memory.
`filterByOverlappingRegion`(query)	Runs a filter that selects data in the underlying RDD that overlaps a single genomic region.
`filterByOverlappingRegions`(querys)	Runs a filter that selects data in the underlying RDD that overlaps a several genomic regions.
`fullOuterShuffleRegionJoin`(genomicDataset[, …])	Performs a sort-merge full outer join between this genomic dataset and another genomic dataset.
`leftOuterShuffleRegionJoin`(genomicDataset[, …])	Performs a sort-merge left outer join between this genomic dataset and another genomic dataset.
`leftOuterShuffleRegionJoinAndGroupByLeft`(…)	Performs a sort-merge left outer join between this genomic dataset and another genomic dataset, followed by a groupBy on the left value.
`markDuplicates`()	Marks reads as possible fragment duplicates.
`persist`(sl)	Persists underlying RDD in memory or disk.
`pipe`(cmd, tFormatter, xFormatter, convFn[, …])	Pipes genomic data to a subprocess that runs in parallel using Spark.
`rightOuterBroadcastRegionJoin`(genomicDataset)	Performs a broadcast right outer join between this genomic dataset and another genomic dataset.
`rightOuterBroadcastRegionJoinAndGroupByRight`(…)	Performs a broadcast right outer join between this genomic dataset and another genomic dataset.
`rightOuterShuffleRegionJoin`(genomicDataset)	Performs a sort-merge right outer join between this genomic dataset and another genomic dataset.
`rightOuterShuffleRegionJoinAndGroupByLeft`(…)	Performs a sort-merge right outer join between this genomic dataset and another genomic dataset, followed by a groupBy on the left value, if not null.
`save`(filePath)	Saves fragments to Parquet.
`shuffleRegionJoin`(genomicDataset[, flankSize])	Performs a sort-merge inner join between this genomic dataset and another genomic dataset.
`shuffleRegionJoinAndGroupByLeft`(genomicDataset)	Performs a sort-merge inner join between this genomic dataset and another genomic dataset, followed by a groupBy on the left value.
`sort`()	Sorts our genome aligned data by reference positions, with contigs ordered by index.
`sortLexicographically`()	Sorts our genome aligned data by reference positions, with contigs ordered lexicographically
`toAlignments`()	Splits up the reads in a Fragment back into alignments, and creates a new genomic dataset.
`toDF`()	Converts this GenomicDataset into a DataFrame.
`transform`(tFn)	Applies a function that transforms the underlying DataFrame into a new DataFrame using the Spark SQL API.
`transmute`(tFn, destClass[, convFn])	Applies a function that transmutes the underlying DataFrame into a new genomic dataset of a different type.
`union`(datasets)	Unions together multiple genomic datasets.
`unpersist`()	Unpersists underlying RDD from memory or disk.