How to Read from S3 in Pyspark Running in Local Mode

How can I read from S3 in pyspark running in local mode?

So Glennie's answer was close but not what would work in your case. The key thing was to select the right version of the dependencies. If you look at the virtual environment

Jars

Everything points to one version which 2.7.3, which what you also need to use

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:2.7.3" pyspark-shell'

You should verify the version that your installation using by checking the path venv/Lib/site-packages/pyspark/jars inside your project's virtual env

And after that you can use s3a by default or s3 by defining the handler class for the same

# Only needed if you use s3://
sc._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'awsKey')
sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'awsSecret')
s3File = sc.textFile("s3a://myrepo/test.csv")

print(s3File.count())
print(s3File.id())

And the output is below

OutputSpark

spark-submit doesn't read file from s3, just stucks on it

The issue was caused the spark resource allocation manager. Solved it by reducing of requested recourses. Why it worked using python3 test.py remains a mystery

Locally reading S3 files through Spark (or better: pyspark)

The problem was actually a bug in the Amazon's boto Python module. The problem was related to the fact that MacPort's version is actually old: installing boto through pip solved the problem: ~/.aws/credentials was correctly read.

Now that I have more experience, I would say that in general (as of the end of 2015) Amazon Web Services tools and Spark/PySpark have a patchy documentation and can have some serious bugs that are very easy to run into. For the first problem, I would recommend to first update the aws command line interface, boto and Spark every time something strange happens: this has "magically" solved a few issues already for me.

How to configure Spark running in local-mode on Amazon EC2 to use the IAM rules for S3

Switch to using the s3a:// scheme (with the Hadoop 2.7.x JARs on your classpath) and this happens automatically. The "s3://" scheme with non-EMR versions of spark/hadoop is not the connector you want (it's old, non-interoperable and has been removed from recent versions)

How to Read from S3 in Pyspark Running in Local Mode

How can I read from S3 in pyspark running in local mode?

spark-submit doesn't read file from s3, just stucks on it

Locally reading S3 files through Spark (or better: pyspark)

How to configure Spark running in local-mode on Amazon EC2 to use the IAM rules for S3

Related Topics

Leave a reply