Pyspark - How to inspect variables within RDD operations
"...code does not stop at break point in Spark operations, like below..." - Could you please clarify what is your PyCharm version and OS?
"And another question is the prompt hint does not give right hint for instance from "map" function. Seems IDE does not know the variable from "map" function is still rdd..." - I believe it is related to this feature request https://youtrack.jetbrains.com/issue/PY-29811
Debugging pyspark in ipdb-fashion
PYSPARK_DRIVER_PYTHON=ipython pyspark
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.4.3
/_/
Using Python version 3.7.1 (default, Jun 16 2019 23:56:28)
SparkSession available as 'spark'.
In [1]: sc.stop()
In [2]: run -d main.py
Breakpoint 1 at /Users/andrii/work/demo/main.py:1
NOTE: Enter 'c' at the ipdb> prompt to continue execution.
> /Users/andrii/work/demo/main.py(1)<module>()
1---> 1 print(123)
2 import ipdb;ipdb.set_trace()
3 a = 2
4 b = 3
or
In [3]: run main.py
123
> /Users/andrii/work/demo/main.py(3)<module>()
2 import ipdb;ipdb.set_trace()
----> 3 a = 2
4 b = 3
How can set the default spark logging level?
http://spark.apache.org/docs/latest/configuration.html#configuring-logging
Configuring Logging
Spark uses log4j for logging. You can configure it by adding a log4j.properties file in the conf directory. One way to start is to copy the existing log4j.properties.template located there.
The following blog about "How to log in spark" https://www.mapr.com/blog/how-log-apache-spark suggest a way to configure log4j, and provide suggestion which includes directing INFO level logs into a file.
Determine where spark program is failing?
For cluster mode
you can go to the YARN Resource Manager UI and select the Tracking UI for your specific running application (which points to the spark driver
running on the Application Master within the YARN Node Manager) to open up the Spark UI which is the core developer interface for debugging spark apps.
For client mode
you can also go to the YARN RM UI like previously mentioned as well as hit the Spark UI via this address => http://[driverHostname]:4040
where driverHostName is the Master Node in EMR
and 4040 is the default port (this can be changed).
Additionally you can access submitted and completed spark apps via the Spark History Server via this default address => http://master-public-dns-name:18080/
These are the essential resources with the Spark UI being the main toolkit for your request.
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-web-interfaces.html
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-webui.html
Related Topics
Tensorflow: How to Replace or Modify Gradient
Group Duplicate Column Ids in Pandas Dataframe
Typeerror: List Indices Must Be Integers or Slices, Not Str
Replace First Occurrence of String in Python
Matplotlib: How to Prevent X-Axis Labels from Overlapping
Differencebetween an Opencv Bgr Image and Its Reverse Version Rgb Image[:,:,::-1]
Parsing .Properties File in Python
High Performance Fuzzy String Comparison in Python, Use Levenshtein or Difflib
Python - Download Images from Google Image Search
Access to Table Objects on Webpage Using Python Selenium
Generate a List of Datetimes Between an Interval
Subsampling Every Nth Entry in a Numpy Array
Extension Method for Python Built-In Types
Why Does Python's _Import_ Require Fromlist