bdgenomics.adam.ds.GenomicDataset¶

class bdgenomics.adam.ds.GenomicDataset(jvmDataset, sc)[source]¶

Wraps an RDD, Dataframe, or Dataset of genomic data with helpful metadata.

__init__(jvmDataset, sc)[source]¶

Constructs a Python GenomicDataset from a JVM GenomicDataset. Should not be called from user code; should only be called from implementing classes.

Parameters:	jvmDataset – Py4j handle to the underlying JVM GenomicDataset. sc (pyspark.context.SparkContext) – Active Spark Context.

Methods

`__init__`(jvmDataset, sc)	Constructs a Python GenomicDataset from a JVM GenomicDataset.
`broadcastRegionJoin`(genomicDataset[, flankSize])	Performs a broadcast inner join between this genomic dataset and another genomic dataset.
`broadcastRegionJoinAndGroupByRight`(…[, …])	Performs a broadcast inner join between this genomic dataset and another genomic dataset.
`cache`()	Caches underlying RDD in memory.
`filterByOverlappingRegion`(query)	Runs a filter that selects data in the underlying RDD that overlaps a single genomic region.
`filterByOverlappingRegions`(querys)	Runs a filter that selects data in the underlying RDD that overlaps a several genomic regions.
`fullOuterShuffleRegionJoin`(genomicDataset[, …])	Performs a sort-merge full outer join between this genomic dataset and another genomic dataset.
`leftOuterShuffleRegionJoin`(genomicDataset[, …])	Performs a sort-merge left outer join between this genomic dataset and another genomic dataset.
`leftOuterShuffleRegionJoinAndGroupByLeft`(…)	Performs a sort-merge left outer join between this genomic dataset and another genomic dataset, followed by a groupBy on the left value.
`persist`(sl)	Persists underlying RDD in memory or disk.
`pipe`(cmd, tFormatter, xFormatter, convFn[, …])	Pipes genomic data to a subprocess that runs in parallel using Spark.
`rightOuterBroadcastRegionJoin`(genomicDataset)	Performs a broadcast right outer join between this genomic dataset and another genomic dataset.
`rightOuterBroadcastRegionJoinAndGroupByRight`(…)	Performs a broadcast right outer join between this genomic dataset and another genomic dataset.
`rightOuterShuffleRegionJoin`(genomicDataset)	Performs a sort-merge right outer join between this genomic dataset and another genomic dataset.
`rightOuterShuffleRegionJoinAndGroupByLeft`(…)	Performs a sort-merge right outer join between this genomic dataset and another genomic dataset, followed by a groupBy on the left value, if not null.
`shuffleRegionJoin`(genomicDataset[, flankSize])	Performs a sort-merge inner join between this genomic dataset and another genomic dataset.
`shuffleRegionJoinAndGroupByLeft`(genomicDataset)	Performs a sort-merge inner join between this genomic dataset and another genomic dataset, followed by a groupBy on the left value.
`sort`()	Sorts our genome aligned data by reference positions, with contigs ordered by index.
`sortLexicographically`()	Sorts our genome aligned data by reference positions, with contigs ordered lexicographically
`toDF`()	Converts this GenomicDataset into a DataFrame.
`transform`(tFn)	Applies a function that transforms the underlying DataFrame into a new DataFrame using the Spark SQL API.
`transmute`(tFn, destClass[, convFn])	Applies a function that transmutes the underlying DataFrame into a new genomic dataset of a different type.
`union`(datasets)	Unions together multiple genomic datasets.
`unpersist`()	Unpersists underlying RDD from memory or disk.