Loading data with the ADAMContext¶
The ADAMContext is the main entrypoint to using ADAM. The ADAMContext
wraps an existing
SparkContext
to provide methods for loading genomic data. In Scala, we provide an
implicit conversion from a SparkContext
to an ADAMContext
. To
use this, import the implicit, and call an ADAMContext
method:
import org.apache.spark.SparkContext
// the ._ at the end imports the implicit from the ADAMContext companion object
import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.rdd.read.AlignmentRecordRDD
def loadReads(filePath: String, sc: SparkContext): AlignmentRecordRDD = {
sc.loadAlignments(filePath)
}
In Java, instantiate a JavaADAMContext, which wraps an ADAMContext:
import org.apache.spark.apis.java.JavaSparkContext;
import org.bdgenomics.adam.apis.java.JavaADAMContext;
import org.bdgenomics.adam.rdd.ADAMContext;
import org.bdgenomics.adam.rdd.read.AlignmentRecordRDD;
class LoadReads {
public static AlignmentRecordRDD loadReads(String filePath,
JavaSparkContext jsc) {
// create an ADAMContext first
ADAMContext ac = new ADAMContext(jsc.sc());
// then wrap that in a JavaADAMContext
JavaADAMContext jac = new JavaADAMContext(ac);
return jac.loadAlignments(filePath);
}
}
From Python, instantiate an ADAMContext, which wraps a SparkContext:
from bdgenomics.adam.adamContext import ADAMContext
ac = ADAMContext(sc)
reads = ac.loadAlignments("my/read/file.adam")
With an ADAMContext
, you can load:
Single reads as an
AlignmentRecordRDD
: - From SAM/BAM/CRAM usingloadBam
(Scala only) - Selected regions from an indexed BAM/CRAM usingloadIndexedBam
(Scala only)
- From FASTQ using
loadFastq
,loadPairedFastq
, andloadUnpairedFastq
(Scala only) - From Parquet using
loadParquetAlignments
(Scala only) - The
loadAlignments
method will load from any of the above formats, and will autodetect the underlying format (Scala, Java, Python, and R, also supports loading reads from FASTA)
- From FASTQ using
Paired reads as a
FragmentRDD
: - From interleaved FASTQ usingloadInterleavedFastqAsFragments
(Scala only)
- From Parquet using
loadParquetFragments
(Scala only) - The
loadFragments
method will load from either of the above formats, as well as SAM/BAM/CRAM, and will autodetect the underlying file format. If the file is a SAM/BAM/CRAM file and the file is queryname sorted, the data will be converted to fragments without performing a shuffle. (Scala, Java, Python, and R)
- From Parquet using
VCF lines as a
VariantContextRDD
from VCF/BCF1 usingloadVcf
(Scala only)Selected lines from a tabix indexed VCF using
loadIndexedVcf
(Scala only)Genotypes as a
GenotypeRDD
: - From Parquet usingloadParquetGenotypes
(Scala only) - From either Parquet or VCF/BCF1 usingloadGenotypes
(Scala, Java,Python, and R)
Variants as a
VariantRDD
: - From Parquet usingloadParquetVariants
(Scala only) - From either Parquet or VCF/BCF1 usingloadVariants
(Scala, Java,Python, and R)
Genomic features as a
FeatureRDD
: - From BED usingloadBed
(Scala only) - From GFF3 usingloadGff3
(Scala only) - From GFF2/GTF usingloadGtf
(Scala only) - From NarrowPeak usingloadNarrowPeak
(Scala only) - From IntervalList usingloadIntervalList
(Scala only) - From Parquet usingloadParquetFeatures
(Scala only) - Autodetected from any of the above usingloadFeatures
(Scala,Java, Python, and R)
Fragmented contig sequence as a
NucleotideContigFragmentRDD
: - From FASTA withloadFasta
(Scala only) - From Parquet withloadParquetContigFragments
(Scala only) - Autodetected from either of the above usingloadSequences
(Scala,Java, Python, and R)
Coverage data as a
CoverageRDD
: - From Parquet usingloadParquetCoverage
(Scala only) - From Parquet or any of the feature file formats usingloadCoverage
(Scala only)- Contig sequence as a broadcastable
ReferenceFile
usingloadReferenceFile
, which supports 2bit files, FASTA, and Parquet (Scala only)
- Contig sequence as a broadcastable
The methods labeled “Scala only” may be usable from Java, but may not be convenient to use.
The JavaADAMContext
class provides Java-friendly methods that are
equivalent to the ADAMContext
methods. Specifically, these methods
use Java types, and do not make use of default parameters. In addition
to the load/save methods described above, the ADAMContext
adds the
implicit methods needed for using ADAM’s pipe_ API.