bdgenomics.adam.ds.AlignmentDataset

class bdgenomics.adam.ds.AlignmentDataset(jvmDataset, sc)[source]

Wraps a GenomicDataset with alignment metadata and functions.

__init__(jvmDataset, sc)[source]

Constructs a Python AlignmentDataset from a JVM AlignmentDataset. Should not be called from user code; instead, go through bdgenomics.adamContext.ADAMContext.

Parameters:
  • jvmDataset – Py4j handle to the underlying JVM AlignmentDataset.
  • sc (pyspark.context.SparkContext) – Active Spark Context.

Methods

__init__(jvmDataset, sc) Constructs a Python AlignmentDataset from a JVM AlignmentDataset.
broadcastRegionJoin(genomicDataset[, flankSize]) Performs a broadcast inner join between this genomic dataset and another genomic dataset.
broadcastRegionJoinAndGroupByRight(…[, …]) Performs a broadcast inner join between this genomic dataset and another genomic dataset.
cache() Caches underlying RDD in memory.
countKmers(kmerLength) Cuts reads into _k_-mers, and then counts the number of occurrences of each _k_-mer.
filterByOverlappingRegion(query) Runs a filter that selects data in the underlying RDD that overlaps a single genomic region.
filterByOverlappingRegions(querys) Runs a filter that selects data in the underlying RDD that overlaps a several genomic regions.
flagStat() Runs a quality control pass akin to the Samtools FlagStat tool.
fullOuterShuffleRegionJoin(genomicDataset[, …]) Performs a sort-merge full outer join between this genomic dataset and another genomic dataset.
leftOuterShuffleRegionJoin(genomicDataset[, …]) Performs a sort-merge left outer join between this genomic dataset and another genomic dataset.
leftOuterShuffleRegionJoinAndGroupByLeft(…) Performs a sort-merge left outer join between this genomic dataset and another genomic dataset, followed by a groupBy on the left value.
markDuplicates() Marks reads as possible fragment duplicates.
persist(sl) Persists underlying RDD in memory or disk.
pipe(cmd, tFormatter, xFormatter, convFn[, …]) Pipes genomic data to a subprocess that runs in parallel using Spark.
realignIndels([isSorted, maxIndelSize, …]) Realigns indels using a consensus-based heuristic from reads.
realignIndelsFromKnownIndels(knownIndels[, …]) Realigns indels using a consensus-based heuristic from prior called INDELs.
reassembleReadPairs(secondPairRdd[, …]) Reassembles read pairs from two sets of unpaired reads.
recalibrateBaseQualities(knownSnps[, …]) Runs base quality score recalibration on a set of reads.
rightOuterBroadcastRegionJoin(genomicDataset) Performs a broadcast right outer join between this genomic dataset and another genomic dataset.
rightOuterBroadcastRegionJoinAndGroupByRight(…) Performs a broadcast right outer join between this genomic dataset and another genomic dataset.
rightOuterShuffleRegionJoin(genomicDataset) Performs a sort-merge right outer join between this genomic dataset and another genomic dataset.
rightOuterShuffleRegionJoinAndGroupByLeft(…) Performs a sort-merge right outer join between this genomic dataset and another genomic dataset, followed by a groupBy on the left value, if not null.
save(filePath[, isSorted]) Saves this genomic dataset to disk, with the type identified by the extension.
saveAsFastq(fileName[, …]) Saves reads in FASTQ format.
saveAsPairedFastq(fileName1, fileName2, …) Saves these Alignments to two FASTQ files.
saveAsSam(filePath[, asType, isSorted, …]) Saves this genomic dataset to disk as a SAM/BAM/CRAM file.
saveAsSamString() Converts a genomic dataset into the SAM spec string it represents.
shuffleRegionJoin(genomicDataset[, flankSize]) Performs a sort-merge inner join between this genomic dataset and another genomic dataset.
shuffleRegionJoinAndGroupByLeft(genomicDataset) Performs a sort-merge inner join between this genomic dataset and another genomic dataset, followed by a groupBy on the left value.
sort() Sorts our genome aligned data by reference positions, with contigs ordered by index.
sortByReadName() Sorts our alignments by read name.
sortByReferencePosition() Sorts our alignments by reference position, with references ordered by name.
sortByReferencePositionAndIndex() Sorts our alignments by reference position, with references ordered by index.
sortLexicographically() Sorts our genome aligned data by reference positions, with contigs ordered lexicographically
toCoverage([collapse]) Converts this set of reads into a corresponding CoverageDataset.
toDF() Converts this GenomicDataset into a DataFrame.
toFragments() Convert this set of reads into fragments.
transform(tFn) Applies a function that transforms the underlying DataFrame into a new DataFrame using the Spark SQL API.
transmute(tFn, destClass[, convFn]) Applies a function that transmutes the underlying DataFrame into a new genomic dataset of a different type.
union(datasets) Unions together multiple genomic datasets.
unpersist() Unpersists underlying RDD from memory or disk.