I Can't Seem to Get --Py-Files on Spark to Work

I can't seem to get --py-files on Spark to work

To get this dependency distribution approach to work with compiled extensions we need to do two things:

Run the pip install on the same OS as your target cluster (preferably on the master node of the cluster). This ensures compatible binaries are included in your zip.
Unzip your archive on the destination node. This is necessary since Python will not import compiled extensions from zip files. (https://docs.python.org/3.8/library/zipimport.html)

Using the following script to create your dependencies zip will ensure that you are isolated from any packages already installed on your system. This assumes virtualenv is installed and requirements.txt is present in your current directory, and outputs a dependencies.zip with all your dependencies at the root level.

env_name=temp_env

# create the virtual env
virtualenv --python=$(which python3) --clear /tmp/${env_name}

# activate the virtual env
source /tmp/${env_name}/bin/activate

# download and install dependencies
pip install -r requirements.txt

# package the dependencies in dependencies.zip. the cd magic works around the fact that you can't specify a base dir to zip
(cd /tmp/${env_name}/lib/python*/site-packages/ && zip -r - *) > dependencies.zip

The dependencies can now be deployed, unzipped, and included in the PYTHONPATH as so

spark-submit \
  --master yarn \
  --deploy-mode cluster \
  --conf 'spark.yarn.dist.archives=dependencies.zip#deps' \
  --conf 'spark.yarn.appMasterEnv.PYTHONPATH=deps' \
  --conf 'spark.executorEnv.PYTHONPATH=deps' \
.
.
.

spark.yarn.dist.archives=dependencies.zip#deps

distributes your zip file and unzips it to a directory called deps

spark.yarn.appMasterEnv.PYTHONPATH=deps

spark.executorEnv.PYTHONPATH=deps

includes the deps directory in the PYTHONPATH for the master and all workers

--deploy-mode cluster

runs the master executor on the cluster so it picks up the dependencies

pyspark addPyFile to add zip of .py files, but module still not found

Fixed problem. Admittedly, solution is not totally spark-related, but leaving question posted for the sake of others who may have similar problem, since the given error message did not make my mistake totally clear from the start.

TLDR: Make sure the package contents (so they should include an __init.py__ in each dir.) of the zip file being loaded in are structured and named the way your code expects.

The package I was trying to load into the spark context via zip was of the form

mypkg
    file1.py
    file2.py
    subpkg1
        file11.py
    subpkg2
        file21.py

my zip when running less mypkg.zip, showed

file1.py file2.py subpkg1 subpkg2

So two things were wrong here.

Was not zipping the toplevel dir. that was the main package that the coded was expecting to work with
Was not zipping the lower level dirs.

Solved with
zip -r mypkg.zip mypkg

More specifically, had to make 2 zip files

for the dist-keras package:
cd dist-keras; zip -r distkeras.zip distkeras

see https://github.com/cerndb/dist-keras/tree/master/distkeras

for the keras package used by distkeras (which is not installed across the cluster):
cd keras; zip -r keras.zip keras

see https://github.com/keras-team/keras/tree/master/keras

So declaring the spark session looked like

conf = SparkConf()
conf.set("spark.app.name", application_name)
conf.set("spark.master", master)  #master='yarn-client'
conf.set("spark.executor.cores", `num_cores`)
conf.set("spark.executor.instances", `num_executors`)
conf.set("spark.locality.wait", "0")
conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");

# Check if the user is running Spark 2.0 +
if using_spark_2:
    from pyspark.sql import SparkSession

    sc = SparkSession.builder.config(conf=conf) \
            .appName(application_name) \
            .getOrCreate()
    sc.sparkContext.addPyFile("/home/me/projects/keras-projects/exploring-keras/keras-dist_test/dist-keras/distkeras.zip")
    sc.sparkContext.addPyFile("/home/me/projects/keras-projects/exploring-keras/keras-dist_test/keras/keras.zip")
    print sc.version

How to import a python module that I added to a cluster via --py-files?

A (better) solution is to use the SparkFiles object in pyspark to locate your imports.

from pyspark import SparkFiles

spark.sparkContext.addPyFile(SparkFiles.get("pyspark_jdbc.zp"))

pyspark import user defined module or .py files

It turned out that since I'm submitting my application in client mode, then the machine I run the spark-submit command from will run the driver program and will need to access the module files.

Sample Image

I added my module to the PYTHONPATH environment variable on the node I'm submitting my job from by adding the following line to my .bashrc file (or execute it before submitting my job).

export PYTHONPATH=$PYTHONPATH:/home/welshamy/modules

And that solved the problem. Since the path is on the driver node, I don't have to zip and ship the module with --py-files or use sc.addPyFile().

The key to solving any pyspark module import error problem is understanding whether the driver or worker (or both) nodes need the module files.

Important
If the worker nodes need your module files, then you need to pass it as a zip archive with --py-files and this argument must precede your .py file argument. For example, notice the order of arguments in these examples:

This is correct:

./bin/spark-submit --py-files wesam.zip mycode.py

this is not correct:

./bin/spark-submit mycode.py --py-files wesam.zip

Cannot send Python dependencies to Spark on EMR via Livy

The files, zips, eggs mentioned as part of pyFiles in the curl call will set the spark config spark.submit.pyFiles.
Spark takes cares of downloading the files and add adding the files/zips to PYTHONPATH.

Use don't need to add the file add the files again using sc. addPyFile(<>).(The above code is trying to look for filename dependencies in the default FS which in your case is file:// and Spark is not able to find it).

We can remove the addPyFile call, and try to import a class in the zip file specified to confirm if its part of the PYTHONPATH.

Azure Synapse: Upload directory of py files in Spark job reference files

The way to achieve this on Synapse is to package your python files into a wheel package and upload the wheel package to a specific location the Azure Data Lake Storage where your spark pool will load them from every time it starts. This will make the custom python packages available to all jobs and notebooks using that spark pool.

You can find more details on the official documentation: https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-manage-python-packages#install-wheel-files

I Can't Seem to Get --Py-Files on Spark to Work