Configuring Spark to Work with Jupyter Notebook and Anaconda

Configuring Spark to work with Jupyter Notebook and Anaconda

Conda can help correctly manage a lot of dependencies...

Install spark. Assuming spark is installed in /opt/spark, include this in your ~/.bashrc:

export SPARK_HOME=/opt/spark
export PATH=$SPARK_HOME/bin:$PATH

Create a conda environment with all needed dependencies apart from spark:

conda create -n findspark-jupyter-openjdk8-py3 -c conda-forge python=3.5 jupyter=1.0 notebook=5.0 openjdk=8.0.144 findspark=1.1.0

Activate the environment

$ source activate findspark-jupyter-openjdk8-py3

Launch a Jupyter Notebook server:

$ jupyter notebook

In your browser, create a new Python3 notebook

Try calculating PI with the following script (borrowed from this)

import findspark
findspark.init()
import pyspark
import random
sc = pyspark.SparkContext(appName="Pi")
num_samples = 100000000
def inside(p):
x, y = random.random(), random.random()
return x*x + y*y < 1
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
print(pi)
sc.stop()

Initialize pyspark in jupyter notebook using the spark-defaults.conf file

There is a very long list of misconceptions here, majority to connected to this simple fact:

Furthermore, I am working on a jupyter notebook in my local computer.

  • local mode is a development and testing tools - it is not designed or optimized for performance.
  • spark.exectuor properties are meaningless in the local mode as there is only one JVM running - Spark driver, and only its configuration is used.
  • squeezing and making available every bit of RAM and CPU you have for the spark session - is not the same as having the optimal configuration. It looks like the same container contains at least a database, which in that case would be starved of resources.

Additionally:

  • Kryo serializer can have minimal or no impact with PySpark and SQL API.
  • It is not possible to use command line - it is perfectly possible by using PYSPARK_SUBMIT_ARGS.

Finally there is no such thing as optimal configuration that fits all scenarios. For example if you use any Python code "maximizing JVM memory allocation" will leave Python code without required resources. At the same time "cores" and memory are only a subset of resources you have to tune - for many jobs much more important is IO (local disk IO, storage IO).



Related Topics



Leave a reply



Submit