Importing Pyspark in Python Shell

importing pyspark in python shell

Turns out that the pyspark bin is LOADING python and automatically loading the correct library paths. Check out $SPARK_HOME/bin/pyspark :

export SPARK_HOME=/some/path/to/apache-spark
# Add the PySpark classes to the Python path:
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH

I added this line to my .bashrc file and the modules are now correctly found!

spark-submit python packages with venv cannot run program

I've managed to make it work by creating the virtualenv inside the EMR cluster, then exporting the .tar.gz file with venv-pack to a S3 bucket. This article helped: gist.github.

Inside the EMR shell:

# Create and activate our virtual environment
virtualenv -p python3 venv-datapeeps
source ./venv-datapeeps/bin/activate

# Upgrade pip and install a couple libraries
pip3 install --upgrade pip
pip3 install fuzzy-c-means boto3 venv-pack

# Package the environment and upload
venv-pack -o pyspark_venv.tar.gz
aws s3 cp pyspark_venv.tar.gz s3://<BUCKET>/artifacts/pyspark/

How to correctly import pyspark.sql.functions?

You can try to use from pyspark.sql.functions import *. This method may lead to namespace coverage, such as pyspark sum function covering python built-in sum function.

Another insurance method: import pyspark.sql.functions as F, use method: F.sum.

Importing Pyspark in Python Shell