Apache Pig: Load a file that shows fine using hadoop fs -text

According to HDFS Documentation, hadoop fs -text <file> can be used on "zip and TextRecordInputStream" data, so your data may be in one of these formats.

If the file was compressed, normally Hadoop would add the extension when outputting to HDFS, but if this was missing, you could try testing by unzipping/ungzipping/unbzip2ing/etc locally. It appears Pig should do this decompressing automatically, but may require the file extension be present (e.g. part-r-00000.zip) -- more info.

I'm not too sure on the TextRecordInputStream.. it sounds like it would just be the default method of Pig, but I could be wrong. I didn't see any mention of LOAD'ing this data via Pig when I did a quick Google.

Since you've discovered it is a sequence file, here's how you can load it using PiggyBank:

-- using Cloudera directory structure:
REGISTER /usr/lib/pig/contrib/piggybank/java/piggybank.jar
--REGISTER /home/hadoop/lib/pig/piggybank.jar
DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader();

-- Sample job: grab counts of tweets by day
A = LOAD 'mydir/part-r-000{00..99}' # not sure if pig likes the {00..99} syntax, but worth a shot
USING SequenceFileLoader AS (key:long, val:long, etc.);

Load multiple files with PigLatin (Hadoop)

As suggested in other comments, you can do this by pre-processing the file. Suppose your HDFS file is called file_list.txt, then you can do the following:

pig -param flist=`hdfs dfs -cat file_list.txt | awk 'BEGIN{ORS="";}{if (NR == 1) print; else print ","$0;}'` script.pig

The awk code gets rid of the newline characters and uses commas to separate the file names.

In your script (called script.pig in my example), you should use parameter substitution to load the data:

data = LOAD '$flist';

Pig Latin: Load multiple files from a date range (part of the directory structure)

Pig is processing your file name pattern using the hadoop file glob utilities, not the shell's glob utilities. Hadoop's are documented here. As you can see, hadoop does not support the '..' operator for a range. It seems to me you have two options - either write out the {date1,date2,date2,...,dateN} list by hand, which if this is a rare use case is probably the way to go, or write a wrapper script which generates that list for you. Building such a list from a date range should be a trivial task for the scripting language of your choice. For my application, I've gone with the generated list route, and it's working fine (CHD3 distribution).

Not able to filter data using Apache Pig

Since you haven't defined any load function, Pig will use PigStorage in which the
default delimiter is '\t'.

If part-m-00000 is a textfile then try to set the delimiter to ',' :

Users1 = load '/user/training/user/part-m-00000' using PigStorage(',') 
as (user_id, name, age:int, country, gender);

If it's a SequenceFile then have a look at Dolan's or my answer on this question.

How to transfer files between machines in Hadoop and search for a string using Pig

For Copying to Hadoop.
1. You can install Hadoop Client in the other machine and then do
hadoop dfs -copyFromLocal from commandline
2. You could simple write a java code that would use FileSystem API to copy to hadoop.

For Pig.
Assuming you know field 2 may contain XYZTechnologies

A = load '<input-hadoop-dir>' using PigStorage() as (X:chararray,Y:chararray);
-- There should not be "(" and ")" after 'matches'
B = Filter A by Y matches '.*XYZTechnologies.*';
STORE B into 'Hadoop=Path' using PigStorage();

How to Get Pig to Work with lzo Files?

I recently got this to work and wrote up a wiki on it for my coworkers. Here's an excerpt detailing how to get PIG to work with lzos. Hope this helps someone!

NOTE: This is written with a Mac in mind. The steps will be almost identical for other OS', and this should definitely give you what you need to know to configure on Windows or Linux, but you will need to extrapolate a bit (obviously, change Mac-centric folders to whatever OS you're using, etc...).

Hooking PIG up to be able to work with LZOs

This was by far the most annoying and time-consuming part for me-- not because it's difficult, but because there are 50 different tutorials online, none of which are all that helpful. Anyway, what I did to get this working is:

  1. Clone hadoop-lzo from github at https://github.com/kevinweil/hadoop-lzo.

  2. Compile it to get a hadoop-lzo*.jar and the native *.o libraries. You'll need to compile
    this on a 64bit machine.

  3. Copy the native libs to $HADOOP_HOME/lib/native/Mac_OS_X-x86_64-64/.

  4. Copy the java jar to $HADOOP_HOME/lib and $PIG_HOME/lib

  5. Then configure hadoop and pig to have the property java.library.path
    point to the lzo native libraries. You can do this in $HADOOP_HOME/conf/mapred-site.xml with:

  6. Now try out grunt shell by running pig again, and make sure everything still works. If it doesn't, you probably messed up something in mapred-site.xml and you should double check it.

  7. Great! We're almost there. All you need to do now is install elephant-bird. You can get that from https://github.com/kevinweil/elephant-bird (clone it).

  8. Now, in order to get elephant-bird to work, you'll need quite a few pre-reqs. These are listed on the page mentioned above, and might change, so I won't specify them here. What I will mention is that the versions on these are very important. If you get an incorrect version and try running ant, you will get errors. So, don't try grabbing the pre-reqs from brew or macports as you'll likely get a newer version. Instead, just download tarballs and build for each.

  9. command: ant in the elephant-bird folder in order to create a jar.

  10. For simplicity's sake, move all relevant jars (hadoop-lzo-x.x.x.jar and elephant-bird-x.x.x.jar) that you'll need to register frequently somewhere you can easily find them. /usr/local/lib/hadoop/... works nicely.

  11. Try things out! Play around with loading normal files and lzos in grunt shell. Register the relevant jars mentioned above, try loading a file, limiting output to a manageable number, and dumping it. This should all work fine whether you're using a normal text file or an lzo.

How to force STORE (overwrite) to HDFS in Pig?

At last I found a solution on grokbase. Since finding the solution took too long I will reproduce it here and add to it.

Suppose you want to store your output using the statement

STORE Relation INTO 'foo/bar';

Then, in order to delete the directory, you can call at the start of the script

rmf foo/bar

No ";" or quotations required since it is a shell command.

I cannot reproduce it now but at some point in time I got an error message (something about missing files) where I can only assume that rmf interfered with map/reduce. So I recommend putting the call before any relation declaration. After SETs, REGISTERs and defaults should be fine.


SET mapred.fairscheduler.pool 'inhouse';
REGISTER /usr/lib/pig/contrib/piggybank/java/piggybank.jar;
%default name 'foobar'
rmf foo/bar
Rel = LOAD 'something.tsv';
STORE Rel INTO 'foo/bar';

