How Can Pyspark Be Called in Debug Mode

Pyspark - How to inspect variables within RDD operations

"...code does not stop at break point in Spark operations, like below..." - Could you please clarify what is your PyCharm version and OS?

"And another question is the prompt hint does not give right hint for instance from "map" function. Seems IDE does not know the variable from "map" function is still rdd..." - I believe it is related to this feature request https://youtrack.jetbrains.com/issue/PY-29811

Debugging pyspark in ipdb-fashion

PYSPARK_DRIVER_PYTHON=ipython pyspark

      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.3
      /_/

Using Python version 3.7.1 (default, Jun 16 2019 23:56:28)
SparkSession available as 'spark'.

In [1]: sc.stop()

In [2]: run -d main.py
Breakpoint 1 at /Users/andrii/work/demo/main.py:1
NOTE: Enter 'c' at the ipdb>  prompt to continue execution.
> /Users/andrii/work/demo/main.py(1)<module>()
1---> 1 print(123)
      2 import ipdb;ipdb.set_trace()
      3 a = 2
      4 b = 3

In [3]: run main.py
123
> /Users/andrii/work/demo/main.py(3)<module>()
      2 import ipdb;ipdb.set_trace()
----> 3 a = 2
      4 b = 3

How can set the default spark logging level?

http://spark.apache.org/docs/latest/configuration.html#configuring-logging

Configuring Logging

Spark uses log4j for logging. You can configure it by adding a log4j.properties file in the conf directory. One way to start is to copy the existing log4j.properties.template located there.

The following blog about "How to log in spark" https://www.mapr.com/blog/how-log-apache-spark suggest a way to configure log4j, and provide suggestion which includes directing INFO level logs into a file.

Determine where spark program is failing?

For cluster mode you can go to the YARN Resource Manager UI and select the Tracking UI for your specific running application (which points to the spark driver running on the Application Master within the YARN Node Manager) to open up the Spark UI which is the core developer interface for debugging spark apps.

For client mode you can also go to the YARN RM UI like previously mentioned as well as hit the Spark UI via this address => http://[driverHostname]:4040 where driverHostName is the Master Node in EMR and 4040 is the default port (this can be changed).

Additionally you can access submitted and completed spark apps via the Spark History Server via this default address => http://master-public-dns-name:18080/

These are the essential resources with the Spark UI being the main toolkit for your request.

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-web-interfaces.html

https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-webui.html

How Can Pyspark Be Called in Debug Mode

Pyspark - How to inspect variables within RDD operations

Debugging pyspark in ipdb-fashion

How can set the default spark logging level?

Determine where spark program is failing?

Related Topics

Leave a reply