Calling a Mapreduce Job from a Simple Java Program

Calling a mapreduce job from a simple java program

Oh please don't do it with runJar, the Java API is very good.

See how you can start a job from normal code:

// create a configuration
Configuration conf = new Configuration();
// create a new job based on the configuration
Job job = new Job(conf);
// here you have to put your mapper class
job.setMapperClass(Mapper.class);
// here you have to put your reducer class
job.setReducerClass(Reducer.class);
// here you have to set the jar which is containing your 
// map/reduce class, so you can use the mapper class
job.setJarByClass(Mapper.class);
// key/value of your reducer output
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
// this is setting the format of your input, can be TextInputFormat
job.setInputFormatClass(SequenceFileInputFormat.class);
// same with output
job.setOutputFormatClass(TextOutputFormat.class);
// here you can set the path of your input
SequenceFileInputFormat.addInputPath(job, new Path("files/toMap/"));
// this deletes possible output paths to prevent job failures
FileSystem fs = FileSystem.get(conf);
Path out = new Path("files/out/processed/");
fs.delete(out, true);
// finally set the empty out path
TextOutputFormat.setOutputPath(job, out);

// this waits until the job completes and prints debug out to STDOUT or whatever
// has been configured in your log4j properties.
job.waitForCompletion(true);

If you are using an external cluster, you have to put the following infos to your configuration via:

// this should be like defined in your mapred-site.xml
conf.set("mapred.job.tracker", "jobtracker.com:50001"); 
// like defined in hdfs-site.xml
conf.set("fs.default.name", "hdfs://namenode.com:9000");

This should be no problem when the hadoop-core.jar is in your application containers classpath.
But I think you should put some kind of progress indicator to your web page, because it may take minutes to hours to complete a hadoop job ;)

For YARN (> Hadoop 2)

For YARN, the following configurations need to be set.

// this should be like defined in your yarn-site.xml
conf.set("yarn.resourcemanager.address", "yarn-manager.com:50001"); 

// framework is now "yarn", should be defined like this in mapred-site.xm
conf.set("mapreduce.framework.name", "yarn");

// like defined in hdfs-site.xml
conf.set("fs.default.name", "hdfs://namenode.com:9000");

How to call a custom method after Mapreduce Job completition using Hadoop java api?

You're calling the code in Location 1 before you call jb.waitForCompletion(true). You need to call it after (and obviously not call System.exit()). So:

jb.waitForCompletion(true);
//Run your code

Run MapReduce Job from a web application

For map-reduce program to run, you need to have jackson-mapper-asl-*.jar and jackson-core-asl-*.jar files present on your map-reduce program class-path. The actual jar file names will vary based on the hadoop distribution and version you are using.

These files are present under $HADOOP_HOME/lib folder.
Two ways to solve this problem:

Invoke map-reduce program using hadoop jar command. This will ensure that all the required jar files are automatically included in your map-reduce program's class-path.
If you wish to trigger a map-reduce job from your application, make sure you include these jar files (and other necessary jar files) in your application class-path, so that when you spawn a map-reduce program it automatically picks up the jar files from the application class-path.

org.apache.hadoop.ipc.RemoteException: User: kohtianan is not allowed
to impersonate hadoopUser

This error indicates that the user kohtianan does not have access to Hadoop DFS. What you can do is, just create a directory on HDFS (from hdfs superuser) and change the owner of that directory to kohtianan. This should resolve your issue.

Running a Hadoop Job From another Java Program

If you don't have them at compile time, then directly set the name in the configuration like this:

conf.set("mapreduce.map.class", "org.what.ever.ClassName");
conf.set("mapreduce.reduce.class", "org.what.ever.ClassName");

Start mapreduce job on hadoop 2.2 (Yarn) from java application

The MR Client API is the same for the Legacy MR and YARN. The properties can be set on the Configuration object instead of specifying in the xml configuration files. Check the documentation and the required configurations here and here to setup and execute a simple MR job in YARN.

Calling a Mapreduce Job from a Simple Java Program