Working with genomic data using GenomicDatasets¶
As described in the section on using the
ADAMContext, ADAM loads genomic data into a
GenomicDataset which is specialized for each datatype. This
GenomicDataset wraps Apache Spark’s Resilient Distributed Dataset (RDD,
(Zaharia et al. 2012)) API with genomic metadata. The RDD
abstraction presents an array of data which is distributed across a
cluster. RDDs are backed by a computational lineage, which allows
them to be recomputed if a node fails and the results of a computation
are lost. RDDs are processed by running functional
transformations across the whole dataset.
Around an RDD, ADAM adds metadata which describes the genome,
samples, or read group that a dataset came from. Specifically, ADAM
supports the following metadata:
GenomicDatasetbase: A sequence dictionary, which describes the reference assembly that data are aligned to, if it is aligned. Applies to all types.MultisampleGenomicDataset: Adds metadata about the samples in a dataset. Applies toGenotypeDataset.ReadGroupGenomicDataset: Adds metadata about the read groups attached to a dataset. Applies toAlignmentDatasetandFragmentDataset.
Additionally, GenotypeDataset, VariantDataset, and VariantContextDataset
store the VCF header lines attached to the original file, to enable a
round trip between Parquet and VCF.
GenomicDatasets can be transformed several ways. These include:
- The core preprocessing algorithms in ADAM:
- Reads:
- Reads to coverage
- Recalibrate base qualities
- INDEL realignment
- Mark duplicate reads
- Fragments:
- Genomic dataset transformations
- Spark SQL transformations
- By using ADAM to pipe out to another tool
Transforming GenomicDatasets¶
Although GenomicDatasets do not extend Apache Spark’s RDD class,
RDD operations can be performed on them using the transform
method. Currently, we only support RDD to RDD transformations
that keep the same type as the base type of the GenomicDataset. To apply
an RDD transform, use the transform method, which takes a
function mapping one RDD of the base type into another RDD of
the base type. For example, we could use transform on an
AlignmentDataset to filter out reads that have a low mapping
quality, but we cannot use transform to translate those reads into
Features showing the genomic locations covered by reads.
If we want to transform a GenomicDataset into a new GenomicDataset that
contains a different datatype (e.g., reads to features), we can instead
use the transmute function. The transmute function takes a
function that transforms an RDD of the type of the first
GenomicDataset into a new RDD that contains records of the type of
the second GenomicDataset. Additionally, it takes an implicit function
that maps the metadata in the first GenomicDataset into the metadata
needed by the second GenomicDataset. This is akin to the implicit
function required by the pipe API. As an example, let us
use the transmute function to make features corresponding to reads
containing INDELs:
// pick up implicits from ADAMContext
import org.bdgenomics.adam.ds.ADAMContext._
val reads = sc.loadAlignments("path/to/my/reads.adam")
// the type of the transmuted RDD normally needs to be specified
// import the FeatureDataset, which is the output type
import org.bdgenomics.adam.ds.feature.FeatureDataset
import org.bdgenomics.formats.avro.Feature
val features: FeatureDataset = reads.transmute(rdd => {
rdd.filter(r => {
// does the CIGAR for this read contain an I or a D?
Option(r.getCigar)
.exists(c => c.contains("I") || c.contains("D"))
}).map(r => {
Feature.newBuilder
.setContigName(r.getContigName)
.setStart(r.getStart)
.setEnd(r.getEnd)
.build
})
})
ADAMContext provides the implicit functions needed to run the
transmute function between all GenomicDatasets contained within
the org.bdgenomics.adam.ds package hierarchy. Any custom
GenomicDataset can be supported by providing a user defined conversion
function.
Transforming GenomicDatasets via Spark SQL¶
Spark SQL introduced the strongly-typed Dataset API in Spark 1.6.0. This API supports seamless translation between the RDD API and a strongly typed DataFrame style API. While Spark SQL supports many types of encoders for translating data from an RDD into a Dataset, no encoders support the Avro models used by ADAM to describe our genomic schemas. In spite of this, Spark SQL is highly desirable because it has a more efficient execution engine than the Spark RDD APIs, which can lead to substantial speedups for certain queries.
To resolve this, we added an adam-codegen package that generates
Spark SQL compatible classes representing the ADAM schemas. These
classes are available in the org.bdgenomics.adam.sql package. All
Avro-backed GenomicDatasets now support translation to Datasets via the
dataset field, and transformation via the Spark SQL APIs through the
transformDataset method. As an optimization, we lazily choose either
the RDD or Dataset API depending on the calculation being performed. For
example, if one were to load a Parquet file of reads, we would not
decide to load the Parquet file as an RDD or a Dataset until we saw your
query. If you were to load the reads from Parquet and then were to
immediately run a transformDataset call, it would be more efficient
to load the data directly using the Spark SQL APIs, instead of loading
the data as an RDD, and then transforming that RDD into a SQL Dataset.
The functionality of the adam-codegen package is simple. The goal of
this package is to take ADAM’s Avro schemas and to remap them into
classes that implement Scala’s Product interface, and which have a
specific style of constructor that is expected by Spark SQL.
Additionally, we define functions that translate between these Product
classes and the bdg-formats Avro models. Parquet files written with
either the Product classes and Spark SQL Parquet writer or the Avro
classes and the RDD/ParquetAvroOutputFormat are equivalent and can be
read through either API. However, to support this, we must explicitly
set the requested schema on read when loading data through the RDD read
path. This is because Spark SQL writes a Parquet schema that is
equivalent but not strictly identical to the Parquet schema that the
Avro/RDD write path writes. If the schema is not set, then schema
validation on read fails. If reading data using the
ADAMContext APIs, this is handled properly; this is
an implementation note necessary only for those bypassing the ADAM APIs.
Similar to transform/transformDataset, there exists a
transmuteDataset function that enables transformations between
GenomicDatasets of different types.
Using partitioned Parquet to speed up range based queries¶
GenomicDatasets of types AlignmentDataset, GenotypeDataset,
VariantDataset, and NucleotideFragmentContigDataset can be written as Parquet
using a Hive-style hierarchical directory scheme that is based on contig and
genomic position. This partitioning reduces the latency of genomic range
queries against these datasets, which is particularly important for interactive
applications such as a genomic browser backed by an ADAM dataset.
The GenomicDataset function
GenomicDataset.filterByOverlappingRegions(queryRegionsList) builds a Spark SQL
query that uses this partitioning scheme. This can reduce latencies by more
than 20x when repeatedly querying a datset with genomic range filters.
On a high coverage alignment dataset, this partitioning strategy improved
latency from 1-2 minutes to 1-3 seconds when looking up genomic ranges.
Saving partitioned parquet files to disk
A GenomicDataset can be written to disk as a partitioned Parquet dataset with the
GenomicDataset function saveAsPartitionedParquet. The optional
partitionSize parameter defines the width in base pairs of the partitions
within each contig.
data.saveAsPartitionedParquet("dataset1.adam", partitionSize = 2000000)
A partitioned dataset can also be created from an input Parquet or SAM/BAM/CRAM
file using the ADAM transformAlignments CLI, or Parquet/VCF files using the
ADAM transformGenotypes CLI.
Loading partitioned parquet files
A GenomicDataset can be loaded from a partitioned Parquet dataset using the
ADAMContext function loadPartitionedParquet[*] specific to each data type
such as loadPartitionedParquetAlignments.
Layout of Partitioned Parquet directory
An ADAM partitioned Parquet dataset is written as a three level directory hierarchy.
The outermost directory is the name of the dataset and contains metadata files as is
found in unpartitioned ADAM Parquet datasets. Within the outer dataset directory,
subdirectories are created with names based on the genomic contigs found
in the dataset, for example a subdirectory will be named contigName=22 for chromosome 22.
Within each contigName directory, there are subdirectories named using a computed value
positionBin, for example a subdirectory named positionBin=5. Records from the
dataset are written into Parquet files within the appropriate positionBin directory, computed
based on the start position of the record using the calculation floor(start / partitionSize).
For example, when using the default partitionSize of 1,000,000 base pairs, an
alignment with start position 20,100,000 on chromosome 22 would be found in a
Parquet file at the path mydataset.adam/contigName=22/positionBin=20. The splitting
of data into one or more Parquet fields in these leaf directories is automatic based on
Parquet block size settings.
mySavedAdamDataset.adam
|
|-- _partitionedByStartPos
L-- contigName=1
L-- positionBin=0
|-- part-r-00001.parquet
+-- part-r-00002.parquet
L-- positionBin=1
|-- part-r-00003.parquet
|-- part-r-00004.parquet
L-- positionBin= ( N bins ...)
L-- contigName= ( N contigs ... )
|-- (N bins ... )
The existence of the file _partitionedByStartPos can be tested with the public
function ADAMContext.isPartitioned(path) and can be used to determine explicitly
if an ADAM Parquet dataset is partitioned using this scheme. The partition size which was used
when the dataset was written to disk is stored in _partitionedByStartPos and is
read in as a property of the dataset by the loadPartitionedParquet functions.
The Spark dataset API recognizes that the field positionBin is defined implicitly
by the Parquet files’ partitioning scheme, and makes positionBin available as a field
that can be queried through the Spark SQL API. positionBin is used internally by
the public function GenomicRDD.filterByOverlappingRegions. User code in ADAM-shell or user applications
could similarly utilize the positionBin field when creating Spark
SQL queries on a genomicDataset.dataset backed by partitioned Parquet.
Re-using a previously loaded partitioned dataset:
When a partitioned dataset is first created within an ADAM session, a partition
discovery/initialization step is performed that can take several minutes for large datasets.
The original GenomicDataset object can then be re-used multiple times as the parent
of different filtration and processing transformations and actions, without incurring
this initializiation cost again. Thus, re-use of a parent partitioned GenomicDataset
is key to realizing the latency advantages of partitioned datasets described above.
val mydata = loadPartitionedParquetAlignments("alignmets.adam")
val filteredCount1 = mydata.filterByOverlappingRegions(regions1).dataset.count
val filteredCount2 = mydata.filterByOverlappingRegions(regions2).dataset.count