Configuring Spark to work with Jupyter Notebook and Anaconda
Conda can help correctly manage a lot of dependencies...
Install spark. Assuming spark is installed in /opt/spark, include this in your ~/.bashrc:
export SPARK_HOME=/opt/spark
export PATH=$SPARK_HOME/bin:$PATH
Create a conda environment with all needed dependencies apart from spark:
conda create -n findspark-jupyter-openjdk8-py3 -c conda-forge python=3.5 jupyter=1.0 notebook=5.0 openjdk=8.0.144 findspark=1.1.0
Activate the environment
$ source activate findspark-jupyter-openjdk8-py3
Launch a Jupyter Notebook server:
$ jupyter notebook
In your browser, create a new Python3 notebook
Try calculating PI with the following script (borrowed from this)
import findspark
findspark.init()
import pyspark
import random
sc = pyspark.SparkContext(appName="Pi")
num_samples = 100000000
def inside(p):
x, y = random.random(), random.random()
return x*x + y*y < 1
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
print(pi)
sc.stop()
Initialize pyspark in jupyter notebook using the spark-defaults.conf file
There is a very long list of misconceptions here, majority to connected to this simple fact:
Furthermore, I am working on a jupyter notebook in my local computer.
local
mode is a development and testing tools - it is not designed or optimized for performance.spark.exectuor
properties are meaningless in thelocal
mode as there is only one JVM running - Spark driver, and only its configuration is used.- squeezing and making available every bit of RAM and CPU you have for the spark session - is not the same as having the optimal configuration. It looks like the same container contains at least a database, which in that case would be starved of resources.
Additionally:
- Kryo serializer can have minimal or no impact with PySpark and SQL API.
- It is not possible to use command line - it is perfectly possible by using
PYSPARK_SUBMIT_ARGS
.
Finally there is no such thing as optimal configuration that fits all scenarios. For example if you use any Python code "maximizing JVM memory allocation" will leave Python code without required resources. At the same time "cores" and memory are only a subset of resources you have to tune - for many jobs much more important is IO (local disk IO, storage IO).
Related Topics
Find First Sequence Item That Matches a Criterion
How to Make Urllib2 Requests Through Tor in Python
Remove Namespace and Prefix from Xml in Python Using Lxml
Can Existing Virtualenv Be Upgraded Gracefully
Removing a List of Characters in String
Python - Rolling Functions for Groupby Object
Unnamed Python Objects Have the Same Id
What Is the _Dict_._Dict_ Attribute of a Python Class
Python Postgres Psycopg2 Threadedconnectionpool Exhausted
Multiprocessing in Python - Sharing Large Object (E.G. Pandas Dataframe) Between Multiple Processes
How to Filter a Django Query with a List of Values
Python Read from Subprocess Stdout and Stderr Separately While Preserving Order
Python-Requests Close Http Connection
How to Solve Readtimeouterror: Httpsconnectionpool(Host='Pypi.Python.Org', Port=443) with Pip