Production grade pyspark jobs

Using conda pack

When working with a pyspark job it can be really useful to include additional python packages. To manage python packages, conda usually does a great job. When distributing these to a spark cluster conda-pack can zip up the environment.

To automate this, consider a Makefile with the following contents:

# Note that the extra activate is needed to ensure that the activate floats env to the front of PATH
CONDA_ACTIVATE=source $$(conda info --base)/etc/profile.d/conda.sh ; conda activate ; conda activate
# https://stackoverflow.com/questions/53382383/makefile-cant-use-conda-activate
setup:
	# initial setup. Needs to be re-executed every time a dependency is changed (transitive dependency)
	# typically only once when the workflow is deployed
	conda env create -f environment.yml && \
	    rm -rf  my-custom-foo-env.tar.gz && \
		($(CONDA_ACTIVATE) my-custom-foo-env ; conda pack -n my-custom-foo-env )
prepare:
	# needs to re-executed any time contents of the lib module are changed (basically every time)
	python setup.py bdist_egg
training: prepare
	spark-submit --verbose \
	    --master yarn \
	    --deploy-mode cluster \
	    --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=my-custom-foo-env/bin/python \
	    --archives my-custom-foo-env.tar.gz#my-custom-foo-env \
	    --py-files dist/my_library-0.0.1-py2.7.egg \
	    main.py

Firstly conda pack -n my-custom-foo-env the desired conda environment.yml file is taken to generate a ZIP file using conda-pack. Secondly, the job must be defined as a python-library itself to nicely include any submodules. For this python setup.py bdist_egg is used to generate an Egg file (mostly just another zip file.)

Finally, we need to tell pyspark to a) use the dependencies and modules and b) use our own specific version of python distributed in the conda-pack ZIP file. For a:

--archives my-custom-foo-env.tar.gz#my-custom-foo-env \
--py-files dist/PNDA_lib-0.0.1-py2.7.egg \

are required. The configuration property: --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=my-custom-foo-env/bin/python ensures b.

edit

Please keep in mind that the example above supposes to submit to spark using yarn cluster mode! spark.yarn.executorEnv.PYSPARK_PYTHON and spark.yarn.appMasterEnv.PYSPARK_PYTHON would be set to my-custom-foo-env/bin/python (though the executor one seems to be optional). When you instead want to execute this in yarn client mode (like for example in a jupyter notebook started from an edge node of the cluster) you must set:

import os
os.environ['PYSPARK_PYTHON'] = 'my-custom-foo-env/bin/python'

as spark otherwise will not find the right version of python and the right environment

Georg Heiler
Georg Heiler
Researcher & data scientist

My research interests include large geo-spatial time and network data analytics.