How can I read from S3 in pyspark running in local mode?
So Glennie's answer was close but not what would work in your case. The key thing was to select the right version of the dependencies. If you look at the virtual environment
Everything points to one version which 2.7.3
, which what you also need to use
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:2.7.3" pyspark-shell'
You should verify the version that your installation using by checking the path venv/Lib/site-packages/pyspark/jars
inside your project's virtual env
And after that you can use s3a
by default or s3
by defining the handler class for the same
# Only needed if you use s3://
sc._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'awsKey')
sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'awsSecret')
s3File = sc.textFile("s3a://myrepo/test.csv")
print(s3File.count())
print(s3File.id())
And the output is below
spark-submit doesn't read file from s3, just stucks on it
The issue was caused the spark resource allocation manager. Solved it by reducing of requested recourses. Why it worked using python3 test.py remains a mystery
Locally reading S3 files through Spark (or better: pyspark)
The problem was actually a bug in the Amazon's boto
Python module. The problem was related to the fact that MacPort's version is actually old: installing boto
through pip solved the problem: ~/.aws/credentials
was correctly read.
Now that I have more experience, I would say that in general (as of the end of 2015) Amazon Web Services tools and Spark/PySpark have a patchy documentation and can have some serious bugs that are very easy to run into. For the first problem, I would recommend to first update the aws command line interface, boto
and Spark every time something strange happens: this has "magically" solved a few issues already for me.
How to configure Spark running in local-mode on Amazon EC2 to use the IAM rules for S3
Switch to using the s3a:// scheme (with the Hadoop 2.7.x JARs on your classpath) and this happens automatically. The "s3://" scheme with non-EMR versions of spark/hadoop is not the connector you want (it's old, non-interoperable and has been removed from recent versions)
Related Topics
Hiding Axis Text in Matplotlib Plots
How to Sort a List of Lists by a Specific Index of the Inner List
Selecting Specific Rows of CSV Based on a Column'S Value in Python
Python Read File Determined by Separator \R\N
Ioerror: [Errno 32] Broken Pipe When Piping: 'Prog.Py | Othercmd'
Python Strftime - Date Without Leading 0
Python SQL Select With Possible Null Values
Replace Values of a Numpy Index Array With Values of a List
How Can One Modify the Outline Color of a Node in Networkx
Python 2D List Performance, Without Numpy
Cast String to Float Is Not Supported in Linear Model
Python - Using Regex to Find Multiple Matches and Print Them Out
Python Flask Threaded True Not Working
Pyqt: Getting Widgets to Resize Automatically in a Qdialog
Inserting a Python Datetime.Datetime Object into MySQL
How to Calculate a Gaussian Kernel Matrix Efficiently in Numpy