Deploying ADAM¶
Running ADAM on AWS EC2 using CGCloud¶
CGCloud provides an automated means to create a cluster on EC2 for use with ADAM.
Prior to following these instructions, make sure you have set up your AWS account and know your AWS access keys. See https://aws.amazon.com/ for details.
Configure CGCloud¶
Begin by reading the CGcloud readme.
Next, configure CGCloud core and then install the CGcloud spark plugin.
pip install cgcloud-core
and pip install cgcloud-spark
with
the single command:pip install cgcloud-spark==1.6.0
which will install the correct version of both cgcloud-core and cgcloud-spark.
Note, the steps to register your ssh key and create the template boxes only need to be done once.
cgcloud register-key ~/.ssh/id_rsa.pub
cgcloud create generic-ubuntu-trusty-box
cgcloud create -IT spark-box
Launch a cluster¶
Spin up a Spark cluster named cluster1
with one leader and two
worker nodes of instance type m3.large
with the command:
cgcloud create-cluster spark -c cluster1 -s 2 -t m3.large
Once running, you can ssh to spark-master
with the command:
cgcloud ssh -c cluster1 spark-master
Spark is already installed on the spark-master
machine and slaves,
test it by starting a spark-shell.
spark-shell
exit()
Install ADAM¶
To use the ADAM application on top of Spark, we need to download and
install ADAM on spark-master
. From the command line on
spark-master
download a release
here. As of this
writing, CGCloud supports Spark 1.6.2, not Spark 2.x, so download the
Spark 1.x Scala2.10 release:
wget https://repo1.maven.org/maven2/org/bdgenomics/adam/adam-distribution_2.10/0.20.0/adam-distribution_2.10-0.20.0-bin.tar.gz
tar -xvfz adam-distribution_2.10-0.20.0-bin.tar.gz
You can now run ./bin/adam-submit
and ./bin/adam-shell
using
your EC2 cluster.
Input and Output data on HDFS and S3¶
Spark requires a file system, such as HDFS or a network file mount, that all machines can access. The CGCloud EC2 Spark cluster you just created is already running HDFS.
The typical flow of data to and from your ADAM application on EC2 will be:
- Upload data to AWS S3
- Transfer from S3 to the HDFS on your cluster
- Compute with ADAM, write output to HDFS
- Copy data you wish to persist for later use to S3
For small test files you may wish to skip S3 by uploading directly to
spark-master using scp
and then copying to HDFS using:
hadoop fs -put sample1.bam /datadir/
From the ADAM shell, or as a parameter to ADAM submit, you would refer to HDFS URLs like this:
adam-submit \
transformAlignments \
hdfs://spark-master/work_dir/sample1.bam \
hdfs://spark-master/work_dir/sample1.adam
Bulk Transfer between HDFS and S3¶
To transfer large amounts of data back and forth from S3 to HDFS, we suggest using Conductor. It is also possible to directly use AWS S3 as a distributed file system, but with some loss of performance.
Directly accessing data stored in S3¶
To directly access data stored in S3, we can leverage one of the Hadoop
FileSystem API
implementations that access
S3. Specifically, we recommend using the S3a
file system. To do
this, you will need to configure your Spark job to use S3a. If you are
using a vendor-supported Spark distribution like Amazon
EMR or
Databricks, your Spark distribution may
already have the S3a file system installed. If not, you will need to add
JARs that contain the classes needed to support the S3a file system. For
most Spark distributions built for Apache Hadoop 2.6 or higher, you will
need to add the following dependencies:
Instead of downloading these JARs, you can ask Spark to install them at
runtime using the --packages
flag. Additionally, if you are using
the S3a file system, your file paths will need to begin with the
s3a://
scheme:
adam-submit \
--packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.4 \
-- \
transformAlignments \
s3a://my_bucket/my_reads.adam \
s3a://my_bucket/my_new_reads.adam
If you are loading a BAM, CRAM, or VCF file, you will need to add an
additional JAR. This is because the code that loads data stored in these
file formats uses Java’s nio
package to read index files. Java’s
nio
system allows users to specify a “file system provider,” which
implements nio
’s file system operations on non-POSIX file systems
like HDFS or S3. To
use these file formats with the s3a://
scheme, you should include
the following dependency:
You will need to do this even if you are not using the index for said format.
CGCloud options and Spot Instances¶
View help docs for all options of the cgcloud create-cluster
command:
cgcloud create-cluster -h
In particular, note the --spot-bid
and related spot options to
utilize AWS spot instances in order to save on costs. To avoid
unintended costs, it is a good idea to use the AWS console to double
check that your instances have terminated.
Accessing the Spark GUI¶
In order to view the Spark server or application GUI pages on port 4040
and 8080 on spark-master
, go to Security Groups in the AWS console
and open inbound TCP for those ports from your local IP address. Find
the IP address of spark-master
, which is part of the Linux command
prompt. On your local machine, you can then open
http://ip_of_spark_master:4040/
in a web browser, where
ip_of_spark_master
is replaced with the IP address you found.