Deploying ADAM on AWS

Input and Output data on HDFS and S3

Apache Spark requires a file system, such as HDFS or a network file mount, that all machines can access.

The typical flow of data to and from your ADAM application on EC2 will be:

  • Upload data to AWS S3
  • Transfer from S3 to the HDFS on your cluster
  • Compute with ADAM, write output to HDFS
  • Copy data you wish to persist for later use to S3

For small test files you may wish to skip S3 by uploading directly to spark-master using scp and then copying to HDFS using:

hadoop fs -put sample1.bam /datadir/

From the ADAM shell, or as a parameter to ADAM submit, you would refer to HDFS URLs like this:

adam-submit \
  transformAlignments \
  hdfs://spark-master/work_dir/sample1.bam \
  hdfs://spark-master/work_dir/sample1.adam

Bulk Transfer between HDFS and S3

To transfer large amounts of data back and forth from S3 to HDFS, we suggest using Conductor. It is also possible to directly use AWS S3 as a distributed file system, but with some loss of performance.

Conductor currently does not support uploading Apache Avro records in Parquet directories to S3, as are written out by ADAM. For uploads from HDFS to S3, we suggest using s3-dist-cp.

Directly accessing data stored in S3

To directly access data stored in S3, we can leverage one of the Hadoop FileSystem API implementations that access S3. Specifically, we recommend using the S3a file system. To do this, you will need to configure your Spark job to use S3a. If you are using a vendor-supported Spark distribution like Amazon EMR or Databricks, your Spark distribution may already have the S3a file system installed. If not, you will need to add JARs that contain the classes needed to support the S3a file system. For most Spark distributions built for Apache Hadoop 2.6 or higher, you will need to add the following dependencies:

Instead of downloading these JARs, you can ask Spark to install them at runtime using the --packages flag. Additionally, if you are using the S3a file system, your file paths will need to begin with the s3a:// scheme:

adam-submit \
  --packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.4 \
  -- \
  transformAlignments \
  s3a://my_bucket/my_reads.adam \
  s3a://my_bucket/my_new_reads.adam

If you are loading a BAM, CRAM, or VCF file, you will need to add an additional JAR. This is because the code that loads data stored in these file formats uses Java’s nio package to read index files. Java’s nio system allows users to specify a “file system provider,” which implements nio’s file system operations on non-POSIX file systems like HDFS or S3. To use these file formats with the s3a:// scheme, you should include the following dependency:

You will need to do this even if you are not using the index for said format.