How can I include a python package with Hadoop streaming job?
I would zip up the package into a .tar.gz
or a .zip
and pass the entire tarball or archive in a -file
option to your hadoop command. I've done this in the past with Perl but not Python.
That said, I would think this would still work for you if you use Python's zipimport
at http://docs.python.org/library/zipimport.html, which allows you to import modules directly from a zip.
How to use a file in a hadoop streaming job using python?
hadoop jar contrib/streaming/hadoop-streaming-1.1.1.jar -file ./mapper.py \
-mapper ./mapper.py -file ./reducer.py -reducer ./reducer.py \
-input test/input.txt -output test/output -file '../user_ids'
Does ../user_ids exist on your local file path when you execute the job? If it does then you need to amend your mapper code to account for the fact that this file will be available in the local working directory of the mapper at runtime:
f = open('user_ids','r')
Managing dependencies with Hadoop Streaming?
If they're not installed on your task boxes, you can send them with -file. If you need a package or other directory structure, you can send a zipfile, which will be unpacked for you. Here's a Haddop 0.17 invocation:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.17.0-streaming.jar -mapper mapper.py -reducer reducer.py -input input/foo -output output -file /tmp/foo.py -file /tmp/lib.zip
However, see this issue for a caveat:
https://issues.apache.org/jira/browse/MAPREDUCE-596
Hadoop: How to include third party library in Python MapReduce
Problem has been solved by zipimport
.
Then I zip chardet
to file module.mod
, and used like this:
importer = zipimport.zipimporter('module.mod')
chardet = importer.load_module('chardet')
Add -file module.mod
in hadoop streaming command.
Now chardet
can be used in script.
More details shown in: How can I include a python package with Hadoop streaming job?
hadoop streaming with python modules
After reviewing sent_tokenize's source code, it looks like the nltk.sent_tokenize AND the nltk.tokenize.sent_tokenize methods/functions rely on a pickle file (one used to do punkt tokenization) to operate.
Since this is Hadoop-streaming, you'd have to figure out where/how to place that pickle file into the zip'd code module that is added into the hadoop job's jar.
Bottom line? I recommend using the RegexpTokenizer class to do sentence and word level tokenization.
Related Topics
Import Script from a Parent Directory
Open Cv Error: (-215) Scn == 3 || Scn == 4 in Function Cvtcolor
Python: Slicing a Multi-Dimensional Array
Getting List of Pixel Values from Pil
How to Flatten a List of Lists/Nested Lists
How to Use a Multiprocessing Queue in a Function Called by Pool.Imap
How to Handle an Asymptote/Discontinuity with Matplotlib
Transform "List of Tuples" into a Flat List or a Matrix
Preserving Global State in a Flask Application
Python Pandas - Missing Required Dependencies ['Numpy'] 1
How to Get Value from Form Field in Django Framework
How Does Python's Comma Operator Work During Assignment
Convert a List of Tuples to a List of Lists
Segmenting License Plate Characters
Define a Method Outside of Class Definition
Merging a List of Time-Range Tuples That Have Overlapping Time-Ranges