Production grade pyspark jobs
Using conda pack
When working with a pyspark job it can be really useful to include additional python packages. To manage python packages, conda usually does a great job. When distributing these to a spark cluster conda-pack can zip up the environment.
To automate this, consider a
Makefile with the following contents:
# Note that the extra activate is needed to ensure that the activate floats env to the front of PATH CONDA_ACTIVATE=source $$(conda info --base)/etc/profile.d/conda.sh ; conda activate ; conda activate # https://stackoverflow.com/questions/53382383/makefile-cant-use-conda-activate setup: # initial setup. Needs to be re-executed every time a dependency is changed (transitive dependency) # typically only once when the workflow is deployed conda env create -f environment.yml && \ rm -rf my-custom-foo-env.tar.gz && \ ($(CONDA_ACTIVATE) my-custom-foo-env ; conda pack -n my-custom-foo-env ) prepare: # needs to re-executed any time contents of the lib module are changed (basically every time) python setup.py bdist_egg training: prepare spark-submit --verbose \ --master yarn \ --deploy-mode cluster \ --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=my-custom-foo-env/bin/python \ --archives my-custom-foo-env.tar.gz#my-custom-foo-env \ --py-files dist/my_library-0.0.1-py2.7.egg \ main.py
conda pack -n my-custom-foo-env the desired conda
environment.yml file is taken to generate a ZIP file using conda-pack.
Secondly, the job must be defined as a python-library itself to nicely include any submodules. For this
python setup.py bdist_egg is used to generate an Egg file (mostly just another zip file.)
Finally, we need to tell pyspark to a) use the dependencies and modules and b) use our own specific version of python distributed in the conda-pack ZIP file. For a:
--archives my-custom-foo-env.tar.gz#my-custom-foo-env \ --py-files dist/PNDA_lib-0.0.1-py2.7.egg \
The configuration property:
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=my-custom-foo-env/bin/python ensures b.
Please keep in mind that the example above supposes to submit to spark using yarn cluster mode!
spark.yarn.appMasterEnv.PYSPARK_PYTHON would be set to
my-custom-foo-env/bin/python (though the executor one seems to be optional). When you instead want to execute this in yarn client mode (like for example in a jupyter notebook started from an edge node of the cluster) you must set:
import os os.environ['PYSPARK_PYTHON'] = 'my-custom-foo-env/bin/python'
as spark otherwise will not find the right version of python and the right environment