bdgenomics.adam.ds.AlignmentDataset¶

class bdgenomics.adam.ds.AlignmentDataset(jvmDataset, sc)[source]¶

Wraps a GenomicDataset with alignment metadata and functions.

__init__(jvmDataset, sc)[source]¶

Constructs a Python AlignmentDataset from a JVM AlignmentDataset. Should not be called from user code; instead, go through bdgenomics.adamContext.ADAMContext.

Parameters:	jvmDataset – Py4j handle to the underlying JVM AlignmentDataset. sc (pyspark.context.SparkContext) – Active Spark Context.

Methods

`__init__`(jvmDataset, sc)	Constructs a Python AlignmentDataset from a JVM AlignmentDataset.
`broadcastRegionJoin`(genomicDataset[, flankSize])	Performs a broadcast inner join between this genomic dataset and another genomic dataset.
`broadcastRegionJoinAndGroupByRight`(…[, …])	Performs a broadcast inner join between this genomic dataset and another genomic dataset.
`cache`()	Caches underlying RDD in memory.
`countKmers`(kmerLength)	Cuts reads into _k_-mers, and then counts the number of occurrences of each _k_-mer.
`filterByOverlappingRegion`(query)	Runs a filter that selects data in the underlying RDD that overlaps a single genomic region.
`filterByOverlappingRegions`(querys)	Runs a filter that selects data in the underlying RDD that overlaps a several genomic regions.
`flagStat`()	Runs a quality control pass akin to the Samtools FlagStat tool.
`fullOuterShuffleRegionJoin`(genomicDataset[, …])	Performs a sort-merge full outer join between this genomic dataset and another genomic dataset.
`leftOuterShuffleRegionJoin`(genomicDataset[, …])	Performs a sort-merge left outer join between this genomic dataset and another genomic dataset.
`leftOuterShuffleRegionJoinAndGroupByLeft`(…)	Performs a sort-merge left outer join between this genomic dataset and another genomic dataset, followed by a groupBy on the left value.
`markDuplicates`()	Marks reads as possible fragment duplicates.
`persist`(sl)	Persists underlying RDD in memory or disk.
`pipe`(cmd, tFormatter, xFormatter, convFn[, …])	Pipes genomic data to a subprocess that runs in parallel using Spark.
`realignIndels`([isSorted, maxIndelSize, …])	Realigns indels using a consensus-based heuristic from reads.
`realignIndelsFromKnownIndels`(knownIndels[, …])	Realigns indels using a consensus-based heuristic from prior called INDELs.
`reassembleReadPairs`(secondPairRdd[, …])	Reassembles read pairs from two sets of unpaired reads.
`recalibrateBaseQualities`(knownSnps[, …])	Runs base quality score recalibration on a set of reads.
`rightOuterBroadcastRegionJoin`(genomicDataset)	Performs a broadcast right outer join between this genomic dataset and another genomic dataset.
`rightOuterBroadcastRegionJoinAndGroupByRight`(…)	Performs a broadcast right outer join between this genomic dataset and another genomic dataset.
`rightOuterShuffleRegionJoin`(genomicDataset)	Performs a sort-merge right outer join between this genomic dataset and another genomic dataset.
`rightOuterShuffleRegionJoinAndGroupByLeft`(…)	Performs a sort-merge right outer join between this genomic dataset and another genomic dataset, followed by a groupBy on the left value, if not null.
`save`(filePath[, isSorted])	Saves this genomic dataset to disk, with the type identified by the extension.
`saveAsFastq`(fileName[, …])	Saves reads in FASTQ format.
`saveAsPairedFastq`(fileName1, fileName2, …)	Saves these Alignments to two FASTQ files.
`saveAsSam`(filePath[, asType, isSorted, …])	Saves this genomic dataset to disk as a SAM/BAM/CRAM file.
`saveAsSamString`()	Converts a genomic dataset into the SAM spec string it represents.
`shuffleRegionJoin`(genomicDataset[, flankSize])	Performs a sort-merge inner join between this genomic dataset and another genomic dataset.
`shuffleRegionJoinAndGroupByLeft`(genomicDataset)	Performs a sort-merge inner join between this genomic dataset and another genomic dataset, followed by a groupBy on the left value.
`sort`()	Sorts our genome aligned data by reference positions, with contigs ordered by index.
`sortByReadName`()	Sorts our alignments by read name.
`sortByReferencePosition`()	Sorts our alignments by reference position, with references ordered by name.
`sortByReferencePositionAndIndex`()	Sorts our alignments by reference position, with references ordered by index.
`sortLexicographically`()	Sorts our genome aligned data by reference positions, with contigs ordered lexicographically
`toCoverage`([collapse])	Converts this set of reads into a corresponding CoverageDataset.
`toDF`()	Converts this GenomicDataset into a DataFrame.
`toFragments`()	Convert this set of reads into fragments.
`transform`(tFn)	Applies a function that transforms the underlying DataFrame into a new DataFrame using the Spark SQL API.
`transmute`(tFn, destClass[, convFn])	Applies a function that transmutes the underlying DataFrame into a new genomic dataset of a different type.
`union`(datasets)	Unions together multiple genomic datasets.
`unpersist`()	Unpersists underlying RDD from memory or disk.