Calling a mapreduce job from a simple java program
Oh please don't do it with runJar
, the Java API is very good.
See how you can start a job from normal code:
// create a configuration
Configuration conf = new Configuration();
// create a new job based on the configuration
Job job = new Job(conf);
// here you have to put your mapper class
job.setMapperClass(Mapper.class);
// here you have to put your reducer class
job.setReducerClass(Reducer.class);
// here you have to set the jar which is containing your
// map/reduce class, so you can use the mapper class
job.setJarByClass(Mapper.class);
// key/value of your reducer output
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
// this is setting the format of your input, can be TextInputFormat
job.setInputFormatClass(SequenceFileInputFormat.class);
// same with output
job.setOutputFormatClass(TextOutputFormat.class);
// here you can set the path of your input
SequenceFileInputFormat.addInputPath(job, new Path("files/toMap/"));
// this deletes possible output paths to prevent job failures
FileSystem fs = FileSystem.get(conf);
Path out = new Path("files/out/processed/");
fs.delete(out, true);
// finally set the empty out path
TextOutputFormat.setOutputPath(job, out);
// this waits until the job completes and prints debug out to STDOUT or whatever
// has been configured in your log4j properties.
job.waitForCompletion(true);
If you are using an external cluster, you have to put the following infos to your configuration via:
// this should be like defined in your mapred-site.xml
conf.set("mapred.job.tracker", "jobtracker.com:50001");
// like defined in hdfs-site.xml
conf.set("fs.default.name", "hdfs://namenode.com:9000");
This should be no problem when the hadoop-core.jar
is in your application containers classpath.
But I think you should put some kind of progress indicator to your web page, because it may take minutes to hours to complete a hadoop job ;)
For YARN (> Hadoop 2)
For YARN, the following configurations need to be set.
// this should be like defined in your yarn-site.xml
conf.set("yarn.resourcemanager.address", "yarn-manager.com:50001");
// framework is now "yarn", should be defined like this in mapred-site.xm
conf.set("mapreduce.framework.name", "yarn");
// like defined in hdfs-site.xml
conf.set("fs.default.name", "hdfs://namenode.com:9000");
How to call a custom method after Mapreduce Job completition using Hadoop java api?
You're calling the code in Location 1 before you call jb.waitForCompletion(true)
. You need to call it after (and obviously not call System.exit()
). So:
jb.waitForCompletion(true);
//Run your code
Run MapReduce Job from a web application
For map-reduce program to run, you need to have jackson-mapper-asl-*.jar
and jackson-core-asl-*.jar
files present on your map-reduce program class-path. The actual jar file names will vary based on the hadoop distribution and version you are using.
These files are present under $HADOOP_HOME/lib
folder.
Two ways to solve this problem:
Invoke map-reduce program using
hadoop jar
command. This will ensure that all the required jar files are automatically included in your map-reduce program's class-path.If you wish to trigger a map-reduce job from your application, make sure you include these jar files (and other necessary jar files) in your application class-path, so that when you spawn a map-reduce program it automatically picks up the jar files from the application class-path.
org.apache.hadoop.ipc.RemoteException: User: kohtianan is not allowed
to impersonate hadoopUser
This error indicates that the user kohtianan
does not have access to Hadoop DFS. What you can do is, just create a directory on HDFS (from hdfs superuser) and change the owner of that directory to kohtianan
. This should resolve your issue.
Running a Hadoop Job From another Java Program
If you don't have them at compile time, then directly set the name in the configuration like this:
conf.set("mapreduce.map.class", "org.what.ever.ClassName");
conf.set("mapreduce.reduce.class", "org.what.ever.ClassName");
Start mapreduce job on hadoop 2.2 (Yarn) from java application
The MR Client API is the same for the Legacy MR and YARN. The properties can be set on the Configuration object instead of specifying in the xml configuration files. Check the documentation and the required configurations here and here to setup and execute a simple MR job in YARN.
Related Topics
Mockito: Trying to Spy on Method Is Calling the Original Method
Why Do We Assign a Parent Reference to the Child Object in Java
Functional Style of Java 8's Optional.Ifpresent and If-Not-Present
Get Maven Artifact Version at Runtime
Java Recursive Fibonacci Sequence
Where Is the Documentation for the Values() Method of Enum
Converting Secret Key into a String and Vice Versa
Named Placeholders in String Formatting
Java Synchronized Block for .Class
Mockito:How to Verify Method Was Called on an Object Created Within a Method
What Does It Mean: the Serializable Class Does Not Declare a Static Final Serialversionuid Field
How to Call a Method with a Separate Thread in Java
How to Solve Circular Reference in JSON Serializer Caused by Hibernate Bidirectional Mapping
Should I Call Ugi.Checktgtandreloginfromkeytab() Before Every Action on Hadoop
Spring: Why Do We Autowire the Interface and Not the Implemented Class