How to Link Pycharm with Pyspark

How to link PyCharm with PySpark?

With PySpark package (Spark 2.2.0 and later)

With SPARK-1267 being merged you should be able to simplify the process by pip installing Spark in the environment you use for PyCharm development.

  1. Go to File -> Settings -> Project Interpreter
  2. Click on install button and search for PySpark

    Sample Image

  3. Click on install package button.

Manually with user provided Spark installation

Create Run configuration:

  1. Go to Run -> Edit configurations
  2. Add new Python configuration
  3. Set Script path so it points to the script you want to execute
  4. Edit Environment variables field so it contains at least:

    • SPARK_HOME - it should point to the directory with Spark installation. It should contain directories such as bin (with spark-submit, spark-shell, etc.) and conf (with spark-defaults.conf, spark-env.sh, etc.)
    • PYTHONPATH - it should contain $SPARK_HOME/python and optionally $SPARK_HOME/python/lib/py4j-some-version.src.zip if not available otherwise. some-version should match Py4J version used by a given Spark installation (0.8.2.1 - 1.5, 0.9 - 1.6, 0.10.3 - 2.0, 0.10.4 - 2.1, 0.10.4 - 2.2, 0.10.6 - 2.3, 0.10.7 - 2.4)

      Sample Image

  5. Apply the settings

Add PySpark library to the interpreter path (required for code completion):

  1. Go to File -> Settings -> Project Interpreter
  2. Open settings for an interpreter you want to use with Spark
  3. Edit interpreter paths so it contains path to $SPARK_HOME/python (an Py4J if required)
  4. Save the settings

Optionally

  1. Install or add to path type annotations matching installed Spark version to get better completion and static error detection (Disclaimer - I am an author of the project).

Finally

Use newly created configuration to run your script.

configuring pycharm IDE for pyspark - first script exception

You need to configure pycharm to use SDK as python from spark rather than python installation on your machine. It seems your code is picking python 2.7 installed.

Create Run configuration:

Go to Run -> Edit configurations
Add new Python configuration
Set Script path so it points to the script you want to execute
Edit Environment variables field so it contains at least:
SPARK_HOME - it should point to the directory with Spark installation. It should contain directories such as bin (with spark-submit, spark-shell, etc.) and conf (with spark-defaults.conf, spark-env.sh, etc.)
PYTHONPATH - it should contain $SPARK_HOME/python and optionally $SPARK_HOME/python/lib/py4j-some-version.src.zip if not available otherwise. some-version should match Py4J version used by a given Spark installation (0.8.2.1 - 1.5, 0.9 - 1.6.0)

Add PySpark library to the interpreter path (required for code completion):

Go to File -> Settings -> Project Interpreter
Open settings for an interpreter you want to use with Spark
Edit interpreter paths so it contains path to $SPARK_HOME/python (an Py4J if required)
Save the settings
Use newly created configuration to run your script.

Spark 2.2.0 and later:

With SPARK-1267 being merged you should be able to simplify the process by pip installing Spark in the environment you use for PyCharm development.

working on pyspark from pycharm

This is due to the limitation of python analysis in pycharm. Since pyspark generates some of its function on the fly. I have actually opened an issue with pycharm (https://youtrack.jetbrains.com/issue/PY-20200). which provides some solutions which is basically to write some interface code manually.

Update:

If you look at this thread you can see some advancement in the topic. This has a working interface for most stuff and here is some more info on it.



Related Topics



Leave a reply



Submit