How to Include a Python Package with Hadoop Streaming Job

How can I include a python package with Hadoop streaming job?

I would zip up the package into a .tar.gz or a .zip and pass the entire tarball or archive in a -file option to your hadoop command. I've done this in the past with Perl but not Python.

That said, I would think this would still work for you if you use Python's zipimport at http://docs.python.org/library/zipimport.html, which allows you to import modules directly from a zip.

How to use a file in a hadoop streaming job using python?

hadoop jar contrib/streaming/hadoop-streaming-1.1.1.jar -file ./mapper.py \
  -mapper ./mapper.py -file ./reducer.py -reducer ./reducer.py \
  -input test/input.txt  -output test/output -file '../user_ids'

Does ../user_ids exist on your local file path when you execute the job? If it does then you need to amend your mapper code to account for the fact that this file will be available in the local working directory of the mapper at runtime:

f = open('user_ids','r')

Managing dependencies with Hadoop Streaming?

If they're not installed on your task boxes, you can send them with -file. If you need a package or other directory structure, you can send a zipfile, which will be unpacked for you. Here's a Haddop 0.17 invocation:

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.17.0-streaming.jar -mapper mapper.py -reducer reducer.py -input input/foo -output output -file /tmp/foo.py -file /tmp/lib.zip

However, see this issue for a caveat:

https://issues.apache.org/jira/browse/MAPREDUCE-596

Hadoop: How to include third party library in Python MapReduce

Problem has been solved by zipimport.

Then I zip chardet to file module.mod, and used like this:

importer = zipimport.zipimporter('module.mod')
chardet = importer.load_module('chardet')

Add -file module.mod in hadoop streaming command.

Now chardet can be used in script.

More details shown in: How can I include a python package with Hadoop streaming job?

hadoop streaming with python modules

After reviewing sent_tokenize's source code, it looks like the nltk.sent_tokenize AND the nltk.tokenize.sent_tokenize methods/functions rely on a pickle file (one used to do punkt tokenization) to operate.

Since this is Hadoop-streaming, you'd have to figure out where/how to place that pickle file into the zip'd code module that is added into the hadoop job's jar.

Bottom line? I recommend using the RegexpTokenizer class to do sentence and word level tokenization.