Headless spark on YARN

Run your custom version of spark on HDP

Apache Spark

Very mature software is used in bundled hadoop distributions. Enterprises regularly rely on vendors like Cloudera to budle the open source hadoop ecosystem. This entails that the versions of the tools are considered quite mature as it takes a while until the distributor includes them in an upgrade and then again more time passes until an enterprise updates their off-the-shelf hadoop distribution. However, for some projects like spark which are fast moving this easily results in running an quite outdated version which lacks important features. In my experience enterprises oftentimes are still running on spark 2.2.x if using an on-premise installation.

In the OSS world support has been discontinued for the 2.2 line of spark with 2.2.3

For spark several really interesting features are missing if still you are still running on 2.2.x:

  • Higher order functions
  • Bucketed hive tables
  • Improved structured streaming. Now properly supported stream-stream joins
  • Scala 2.12 support
  • Many bug fixes
  • K8S integration

and many more.

But there is another possibility:

Spark implements the YARN API, therefore you can simply download any version of spark with YARN support and simply run the JAR on YARN.

  • download your version of spark with yarn support. As I am running on a HDP cluster with existing Hadoop binaries, I will choose the headless edition (“pre-built with scala 2.11”) here. At the time of writing 2.4.2 is the most up to date released version.
  • move to your cluster and extract the tar file
  • change directory inside your extracted folder

Then figure out your version of HDP:

ls /usr/hdp
# 2.6.xxx

Due to a bad substitution bug in Ambari the current HDP version needs to be manually put onto the classpath.

export SPARK_DIST_CLASSPATH=$(hadoop classpath)
export HADOOP_CONF_DIR=/usr/hdp/current/spark2-client/conf
export SPARK_HOME=/path/to/my/extracted/spark/folder/
./bin/spark-shell --master yarn --deploy-mode client --queue <<my_yarn_queue>> --conf spark.driver.extraJavaOptions='-Dhdp.version=2.6.xxx' --conf spark.yarn.am.extraJavaOptions='-Dhdp.version=2.6.xxx'

you should then be greeted with:

19/05/01 13:08:31 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.2
      /_/

Using Scala version 2.11.12 (OpenJDK 64-Bit Server VM, Java 1.8.0_201)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

Congratulations. You now can run any version of spark.

Georg Heiler
Georg Heiler
Researcher & data scientist

My research interests include large geo-spatial time and network data analytics.