Production grade pyspark jobs

Using conda pack

When working with a pyspark job it can be really useful to include additional python packages. To manage python packages, conda usually does a great job. When distributing these to a spark cluster conda-pack can zip up the environment.

To automate this, consider a Makefile with the following contents:

# Note that the extra activate is needed to ensure that the activate floats env to the front of PATH
CONDA_ACTIVATE=source $$(conda info --base)/etc/profile.d/ ; conda activate ; conda activate
	# initial setup. Needs to be re-executed every time a dependency is changed (transitive dependency)
	# typically only once when the workflow is deployed
	conda env create -f environment.yml && \
	    rm -rf  my-custom-foo-env.tar.gz && \
		($(CONDA_ACTIVATE) my-custom-foo-env ; conda pack -n my-custom-foo-env )
	# needs to re-executed any time contents of the lib module are changed (basically every time)
	python bdist_egg
training: prepare
	spark-submit --verbose \
	    --master yarn \
	    --deploy-mode cluster \
	    --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=my-custom-foo-env/bin/python \
	    --archives my-custom-foo-env.tar.gz#my-custom-foo-env \
	    --py-files dist/my_library-0.0.1-py2.7.egg \

Firstly conda pack -n my-custom-foo-env the desired conda environment.yml file is taken to generate a ZIP file using conda-pack. Secondly, the job must be defined as a python-library itself to nicely include any submodules. For this python bdist_egg is used to generate an Egg file (mostly just another zip file.)

Finally, we need to tell pyspark to a) use the dependencies and modules and b) use our own specific version of python distributed in the conda-pack ZIP file. For a:

--archives my-custom-foo-env.tar.gz#my-custom-foo-env \
--py-files dist/PNDA_lib-0.0.1-py2.7.egg \

are required. The configuration property: --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=my-custom-foo-env/bin/python ensures b.

Georg Heiler
Georg Heiler
PhD candidate & data scientist

My research interests include large geo-spatial time and network data analytics.

comments powered by Disqus