LAB: Map Reduce for Line Count
- Dataset: Stack_Overflow_Tags/final_stack_data.zip
- Unzip the file and see the size of the data.
- The dataset contains some stack overflow questions. The goal is to find out the total number of questions in the file(number of rows)
- Move the data to hdfs.
- Write a word count program to count the frequency of each word
- Take the final output inside a text file
Solution
Is the Hadoop started?
jps
Start Hadoop if not started already
start-all.sh
you can also use start-dfs.sh
Is the Hadoop started now? jps
check your files on hdfs
hadoop fs -ls /
Dataset is /Stack_Overflow_Tags/final_stack_data.zip. Unzip the data first. Unzipping takes some time
sudo unzip /home/hduser/datasets/Stack_Overflow_Tags/final_stack_data.zip -d /home/hduser/datasets/Stack_Overflow_Tags/
Bring the data onto hadoop HDFS. The file size is huge. This step takes some-time
hadoop fs -copyFromLocal /home/hduser/datasets/Stack_Overflow_Tags/final_stack_data.txt /stack_overflow_hdfs
Check the data file on HDFS
hadoop fs -ls /
check your current working directory
cd
Goto hadoop bin
cd /usr/local/hadoop/bin/
It is imporatant to make your PWD(present working directory) as $hadoop/bin
Open an editor with a file name LineCount.java
sudo gedit LineCount.java
Copy the below java code, paste in your file and save your file
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class LineCount{
public static class LineCntMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
Text keyEmit = new Text("Total Lines");
private final static IntWritable one = new IntWritable(1);
public void map(LongWritable key, Text value, Context context){
try {
context.write(keyEmit, one);
}
catch (IOException e) {
e.printStackTrace();
System.exit(0);
}
catch (InterruptedException e) {
e.printStackTrace();
System.exit(0);
}
}
}
public static class LineCntReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context){
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
try {
context.write(key, new IntWritable(sum));
}
catch (IOException e) {
e.printStackTrace();
System.exit(0);
}
catch (InterruptedException e) {
e.printStackTrace();
System.exit(0);
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "line count2");
job.setJarByClass(LineCount.class);
job.setMapperClass(LineCntMapper.class);
job.setCombinerClass(LineCntReducer.class);
job.setReducerClass(LineCntReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
To compile this program, use the below command
hadoop com.sun.tools.javac.Main LineCount.java
Create the jar file which is named as lc.jar
jar cf lc.jar LineCount*.class
Run linecount program, output will be automaically routed to
hadoop jar lc.jar LineCount /stack_overflow_hdfs /usr/stack_overflow_out
Check the output here http://localhost:50070/explorer.html#/
Have a look at the output
hadoop fs -cat /usr/stack_overflow_out/part-r-00000
We can take the output to a text file
hadoop fs -cat /usr/stack_overflow_out/part-r-00000 >> /home/hduser/Output/stack_overflow_out.txt