Let me explain how to integrate Spring with Hadoop.
This POC contains how to Word Count from one of the Flat file in Hadoop file system using Spring for Apache Hadoop (
Spring for Apache Hadoop provides extensions to Spring, Spring Batch, and Spring Integration to build manageable and robust pipeline solutions around Hadoop.)
Spring for Hadoop provides integration with the Spring Framework to create and run Hadoop, MapReduce, Hive, and Pig jobs as well as work with HDFS and HBase.
1. Software Requirements
JDK level 6.0
Spring Framework 3.0 and above
Apache Hadoop 1.2.1. and above
Cloudera CDH3 (cdh3u5), CDH4 (cdh4.1.3 MRv1) distributions
Hortonworks Data Platform 1.3
Greenplum HD (1.2)
Any distro compatible with Apache Hadoop 1.x should be supported.
Note:
Hadoop YARN support is only available in Spring for Apache Hadoop version 2.0 and later.
2. Spring Hadoop
This will explain how to run a hadoop mapreduce job using spring IOC with CDH3 using eclipse from windows.
Spring hadoop name space is
http://www.springframework.org/schema/hadoop
& schema for it is
http://www.springframework.org/schema/hadoop/spring-hadoop.xsd
Steps:
1. Download spring-hadoop jar.
2. Create a java project in eclipse.
3. Cteate lib folder and place the following jars inside lib
4. Add all these jars to BuilPath.
5. Create hadoop.properties file in src folder
hd.fs=value of fs.default.name
hd.jt=value of mapred.job.tracker
wordcount.input.path=inputpath
wordcount.output.path=outputpath
for example:
hd.fs=hdfs://javapandit1:9030
hd.jt=javapandit1:9010
wordcount.input.path=/user/hadoop/test/input/
wordcount.output.path=/user/hadoop/test/output
6. Place the applicationContext.xml inside src package
<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:hdp="http://www.springframework.org/schema/hadoop"
xmlns:context="http://www.springframework.org/schema/context"
xsi:schemaLocation="http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans-3.0.xsd
http://www.springframework.org/schema/context
http://www.springframework.org/schema/context/spring-context-3.0.xsd
http://www.springframework.org/schema/hadoop
http://www.springframework.org/schema/hadoop/spring-hadoop.xsd">
<context:property-placeholder location="hadoop.properties"/>
<hdp:configuration>
fs.default.name=${hd.fs}
mapred.job.tracker=${hd.jt}
</hdp:configuration>
<hdp:job id="wordcountJob"
input-path="${wordcount.input.path}"
output-path="${wordcount.output.path}"
libs="file:/home/hadoop/hadoop-0.20.2-cdh3u0/hadoop-examples-*.jar"
jar-by-"org.apache.hadoop.examples.WordCount"
mapper="org.apache.hadoop.examples.WordCount.TokenizerMapper"
reducer="org.apache.hadoop.examples.WordCount.IntSumReducer"/>
<hdp:job-runner id="runner" run-at-startup="true"
job-ref="wordcountJob" />
</beans>
Note:
1. Make sure that there is no space at end of line of fs.default.name and mapred.job.tracker values otherwise you will get URI exception.
Caused by: java.net.URISyntaxException: Illegal character in authority at index 7: hdfs://javapandit1:9030
2. For running eclipse from windows system we need to set the setJarByClass.
In applicationContext.xml we need to set the jar-by-class attribute value.
Otherwise we will get the following error.
WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.examples.WordCount$TokenizerMapper
3. Careful while setting the value of jar-by-class, it expects class with package name otherwise will get the following error.
Example:
jar-by-"WordCount"
Caused by: org.springframework.beans.TypeMismatchException: Failed to convert property value of type 'java.lang.String' to required type 'java.lang.Class' for property 'jarByClass'; nested exception is java.lang.IllegalArgumentException: Cannot find class [WordCount]
Caused by: java.lang.ClassNotFoundException: WordCount
7. Create a class to run the wordcount job
package com.sample;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.springframework.context.support.AbstractApplicationContext;
import org.springframework.context.support.ClassPathXmlApplicationContext;
public class WordCount {
private static final Log log=LogFactory.getLog(WordCount.class);
public static void main(String[] args) throws Exception {
AbstractApplicationContext context = new ClassPathXmlApplicationContext(
"applicationContext.xml",WordCount.class);
log.info("Wordcount with HDFS copy Application Running");
context.registerShutdownHook();
}
}
8. Run the WordCount class as run as java application.
9. If it runs successfully we will get the output file in specified output path.
Project structure is as follows:
This POC contains how to Word Count from one of the Flat file in Hadoop file system using Spring for Apache Hadoop (
Spring for Apache Hadoop provides extensions to Spring, Spring Batch, and Spring Integration to build manageable and robust pipeline solutions around Hadoop.)
Spring for Hadoop provides integration with the Spring Framework to create and run Hadoop, MapReduce, Hive, and Pig jobs as well as work with HDFS and HBase.
1. Software Requirements
JDK level 6.0
Spring Framework 3.0 and above
Apache Hadoop 1.2.1. and above
Cloudera CDH3 (cdh3u5), CDH4 (cdh4.1.3 MRv1) distributions
Hortonworks Data Platform 1.3
Greenplum HD (1.2)
Any distro compatible with Apache Hadoop 1.x should be supported.
Note:
Hadoop YARN support is only available in Spring for Apache Hadoop version 2.0 and later.
2. Spring Hadoop
This will explain how to run a hadoop mapreduce job using spring IOC with CDH3 using eclipse from windows.
Spring hadoop name space is
http://www.springframework.org/schema/hadoop
& schema for it is
http://www.springframework.org/schema/hadoop/spring-hadoop.xsd
Steps:
1. Download spring-hadoop jar.
2. Create a java project in eclipse.
3. Cteate lib folder and place the following jars inside lib
4. Add all these jars to BuilPath.
5. Create hadoop.properties file in src folder
hd.fs=value of fs.default.name
hd.jt=value of mapred.job.tracker
wordcount.input.path=inputpath
wordcount.output.path=outputpath
for example:
hd.fs=hdfs://javapandit1:9030
hd.jt=javapandit1:9010
wordcount.input.path=/user/hadoop/test/input/
wordcount.output.path=/user/hadoop/test/output
6. Place the applicationContext.xml inside src package
<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:hdp="http://www.springframework.org/schema/hadoop"
xmlns:context="http://www.springframework.org/schema/context"
xsi:schemaLocation="http://www.springframework.org/schema/beans
http://www.springframework.org/schema/beans/spring-beans-3.0.xsd
http://www.springframework.org/schema/context
http://www.springframework.org/schema/context/spring-context-3.0.xsd
http://www.springframework.org/schema/hadoop
http://www.springframework.org/schema/hadoop/spring-hadoop.xsd">
<context:property-placeholder location="hadoop.properties"/>
<hdp:configuration>
fs.default.name=${hd.fs}
mapred.job.tracker=${hd.jt}
</hdp:configuration>
<hdp:job id="wordcountJob"
input-path="${wordcount.input.path}"
output-path="${wordcount.output.path}"
libs="file:/home/hadoop/hadoop-0.20.2-cdh3u0/hadoop-examples-*.jar"
jar-by-"org.apache.hadoop.examples.WordCount"
mapper="org.apache.hadoop.examples.WordCount.TokenizerMapper"
reducer="org.apache.hadoop.examples.WordCount.IntSumReducer"/>
<hdp:job-runner id="runner" run-at-startup="true"
job-ref="wordcountJob" />
</beans>
Note:
1. Make sure that there is no space at end of line of fs.default.name and mapred.job.tracker values otherwise you will get URI exception.
Caused by: java.net.URISyntaxException: Illegal character in authority at index 7: hdfs://javapandit1:9030
2. For running eclipse from windows system we need to set the setJarByClass.
In applicationContext.xml we need to set the jar-by-class attribute value.
Otherwise we will get the following error.
WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.examples.WordCount$TokenizerMapper
3. Careful while setting the value of jar-by-class, it expects class with package name otherwise will get the following error.
Example:
jar-by-"WordCount"
Caused by: org.springframework.beans.TypeMismatchException: Failed to convert property value of type 'java.lang.String' to required type 'java.lang.Class' for property 'jarByClass'; nested exception is java.lang.IllegalArgumentException: Cannot find class [WordCount]
Caused by: java.lang.ClassNotFoundException: WordCount
7. Create a class to run the wordcount job
package com.sample;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.springframework.context.support.AbstractApplicationContext;
import org.springframework.context.support.ClassPathXmlApplicationContext;
public class WordCount {
private static final Log log=LogFactory.getLog(WordCount.class);
public static void main(String[] args) throws Exception {
AbstractApplicationContext context = new ClassPathXmlApplicationContext(
"applicationContext.xml",WordCount.class);
log.info("Wordcount with HDFS copy Application Running");
context.registerShutdownHook();
}
}
8. Run the WordCount class as run as java application.
9. If it runs successfully we will get the output file in specified output path.
Project structure is as follows: