bdgenomics.adam.ds.GenotypeDataset¶

class bdgenomics.adam.ds.GenotypeDataset(jvmDataset, sc)[source]¶

Wraps a GenomicDataset with Genotype metadata and functions.

__init__(jvmDataset, sc)[source]¶

Constructs a Python GenotypeDataset from a JVM GenotypeDataset. Should not be called from user code; instead, go through bdgenomics.adamContext.ADAMContext.

Parameters:	jvmDataset – Py4j handle to the underlying JVM GenotypeDataset. sc (pyspark.context.SparkContext) – Active Spark Context.

Methods

__init__(jvmDataset, sc) Constructs a Python GenotypeDataset from a JVM GenotypeDataset.

addAllAlleleArrayFormatHeaderLine(name, …) Adds a VCF header line describing an ‘R’ array format field.

addAllAlleleArrayInfoHeaderLine(name, …) Adds a VCF header line describing an ‘R’ array info field.

addAlternateAlleleArrayFormatHeaderLine(…) Adds a VCF header line describing an ‘A’ array format field.

addAlternateAlleleArrayInfoHeaderLine(name, …) Adds a VCF header line describing an ‘A’ array info field.

addFilterHeaderLine(name, description) Adds a VCF header line describing a variant/genotype filter.

addFixedArrayFormatHeaderLine(name, count, …) Adds a VCF header line describing an array format field, with fixed count.

addFixedArrayInfoHeaderLine(name, count, …) Adds a VCF header line describing an array info field, with fixed count.

addGenotypeArrayFormatHeaderLine(name, …) Adds a VCF header line describing an ‘G’ array format field.

addScalarFormatHeaderLine(name, description, …) Adds a VCF header line describing a scalar format field.

addScalarInfoHeaderLine(name, description, …) Adds a VCF header line describing a scalar info field.

broadcastRegionJoin(genomicDataset[, flankSize]) Performs a broadcast inner join between this genomic dataset and another genomic dataset.

broadcastRegionJoinAndGroupByRight(…[, …]) Performs a broadcast inner join between this genomic dataset and another genomic dataset.

cache() Caches underlying RDD in memory.

filterByOverlappingRegion(query) Runs a filter that selects data in the underlying RDD that overlaps a single genomic region.

filterByOverlappingRegions(querys) Runs a filter that selects data in the underlying RDD that overlaps a several genomic regions.

fullOuterShuffleRegionJoin(genomicDataset[, …]) Performs a sort-merge full outer join between this genomic dataset and another genomic dataset.

leftOuterShuffleRegionJoin(genomicDataset[, …]) Performs a sort-merge left outer join between this genomic dataset and another genomic dataset.

leftOuterShuffleRegionJoinAndGroupByLeft(…) Performs a sort-merge left outer join between this genomic dataset and another genomic dataset, followed by a groupBy on the left value.

persist(sl) Persists underlying RDD in memory or disk.

pipe(cmd, tFormatter, xFormatter, convFn[, …]) Pipes genomic data to a subprocess that runs in parallel using Spark.

rightOuterBroadcastRegionJoin(genomicDataset) Performs a broadcast right outer join between this genomic dataset and another genomic dataset.

rightOuterBroadcastRegionJoinAndGroupByRight(…) Performs a broadcast right outer join between this genomic dataset and another genomic dataset.

rightOuterShuffleRegionJoin(genomicDataset) Performs a sort-merge right outer join between this genomic dataset and another genomic dataset.

rightOuterShuffleRegionJoinAndGroupByLeft(…) Performs a sort-merge right outer join between this genomic dataset and another genomic dataset, followed by a groupBy on the left value, if not null.

saveAsParquet(filePath) Saves this genomic dataset of genotypes to disk as Parquet.

shuffleRegionJoin(genomicDataset[, flankSize]) Performs a sort-merge inner join between this genomic dataset and another genomic dataset.

shuffleRegionJoinAndGroupByLeft(genomicDataset) Performs a sort-merge inner join between this genomic dataset and another genomic dataset, followed by a groupBy on the left value.

sort() Sorts our genome aligned data by reference positions, with contigs ordered by index.

sortLexicographically() Sorts our genome aligned data by reference positions, with contigs ordered lexicographically

toDF() Converts this GenomicDataset into a DataFrame.

toVariantContexts()

return:	These genotypes, converted to variant contexts.

toVariants([dedupe]) Extracts the variants contained in this genomic dataset of genotypes.

transform(tFn) Applies a function that transforms the underlying DataFrame into a new DataFrame using the Spark SQL API.

transmute(tFn, destClass[, convFn]) Applies a function that transmutes the underlying DataFrame into a new genomic dataset of a different type.

union(datasets) Unions together multiple genomic datasets.

unpersist() Unpersists underlying RDD from memory or disk.