Running ADAM on CDH 5, HDP, and other YARN based Distros

Apache Hadoop YARN is a widely used scheduler in the Hadoop ecosystem. YARN stands for “Yet Another Resource Negotiator”, and the YARN architecture is described in (Vavilapalli et al. 2013). YARN is used in several common Hadoop distributions, including the Cloudera Hadoop Distribution (CDH) and the Hortonworks Data Platform (HDP). YARN is supported natively in Spark.

The ADAM CLI and shell can both be run on YARN. The ADAM CLI can be run in both Spark’s YARN cluster and client modes, while the ADAM shell can only be run in client mode. In the cluster mode, the Spark driver runs in the YARN ApplicationMaster container. In the client mode, the Spark driver runs in the submitting process. Since the Spark driver for the Spark/ADAM shell takes input on stdin, it cannot run in cluster mode.

To run the ADAM CLI in YARN cluster mode, run the following command:

adam-submit \
  --master yarn \
  --deploy-mode cluster \
  -- \
  <adam_command_name> [options] \

In the adam-submit command, all options before the -- are passed to the spark-submit script, which launches the Spark job. To run in client mode, we simply change the deploy-mode to client:

adam-submit \
  --master yarn \
  --deploy-mode client \
  -- \
  <adam_command_name> [options] \

In the adam-shell command, all of the arguments are passed to the spark-shell command. Thus, to run the adam-shell on YARN, we run:

adam-shell \
  --master yarn \
  --deploy-mode client

All of these commands assume that the Spark assembly you are using is properly configured for your YARN deployment. Typically, if your Spark assembly is properly configured to use YARN, there will be a symbolic link at ${SPARK_HOME}/conf/yarn-conf/ that points to the core Hadoop/YARN configuration. Though, this may vary by the distribution you are running.

The full list of configuration options for running Spark-on-YARN can be found online. Most of the standard configurations are consistent between Spark Standalone and Spark-on-YARN. One important configuration option to be aware of is the YARN memory overhead parameter. From 1.5.0 onwards, Spark makes aggressive use of off-heap memory allocation in the JVM. These allocations may cause the amount of memory taken up by a single executor (or, theoretically, the driver) to exceed the --driver-memory/--executor-memory parameters. These parameters are what Spark provides as a memory resource request to YARN. By default, if one of your Spark containers (an executors or the driver) exceeds its memory request, YARN will kill the container by sending a SIGTERM. This can cause jobs to fail. To eliminate this issue, you can set the spark.yarn.<role>.memoryOverhead parameter, where <role> is one of driver or executor. This parameter is used by Spark to increase its resource request to YARN over the JVM Heap size indicated by --driver-memory or --executor-memory.

As a final example, to run the ADAM transformAlignments CLI using YARN cluster mode on a 64 node cluster with one executor per node and a 2GB per executor overhead, we would run:

adam-submit \
  --master yarn \
  --deploy-mode cluster \
  --driver-memory 200g \
  --executor-memory 200g \
  --conf spark.driver.cores=16 \
  --conf spark.executor.cores=16 \
  --conf spark.yarn.executor.memoryOverhead=2048 \
  --conf spark.executor.instances=64 \
  -- \
  transformAlignments in.sam out.adam

In this example, we are allocating 200GB of JVM heap space per executor and for the driver, and we are telling Spark to request 16 cores per executor and for the driver.