301.2.3-Map Reduce Code for line count

LAB: Map Reduce for Line Count

Dataset: Stack_Overflow_Tags/final_stack_data.zip
Unzip the file and see the size of the data.
The dataset contains some stack overflow questions. The goal is to find out the total number of questions in the file(number of rows)
Move the data to hdfs.
Write a word count program to count the frequency of each word
Take the final output inside a text file

Solution

Is the Hadoop started?

jps

Start Hadoop if not started already

start-all.sh

you can also use start-dfs.sh

Is the Hadoop started now? jps

check your files on hdfs

hadoop fs -ls /

Dataset is /Stack_Overflow_Tags/final_stack_data.zip. Unzip the data first. Unzipping takes some time

sudo unzip /home/hduser/datasets/Stack_Overflow_Tags/final_stack_data.zip -d /home/hduser/datasets/Stack_Overflow_Tags/

Bring the data onto hadoop HDFS. The file size is huge. This step takes some-time

hadoop fs -copyFromLocal /home/hduser/datasets/Stack_Overflow_Tags/final_stack_data.txt /stack_overflow_hdfs

Check the data file on HDFS

hadoop fs -ls /

check your current working directory

cd

Goto hadoop bin

cd /usr/local/hadoop/bin/

It is imporatant to make your PWD(present working directory) as $hadoop/bin

Open an editor with a file name LineCount.java

sudo gedit LineCount.java

Copy the below java code, paste in your file and save your file

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class LineCount{
    public static class LineCntMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    Text keyEmit = new Text("Total Lines");
    private final static IntWritable one = new IntWritable(1);

    public void map(LongWritable key, Text value, Context context){
        try {
            context.write(keyEmit, one);
        } 
        catch (IOException e) {
            e.printStackTrace();
            System.exit(0);
        } 
        catch (InterruptedException e) {
            e.printStackTrace();
            System.exit(0);
        }
    }
}

    public static class LineCntReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    public void reduce(Text key, Iterable<IntWritable> values, Context context){
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        try {
            context.write(key, new IntWritable(sum));
        } 
        catch (IOException e) {
            e.printStackTrace();
            System.exit(0);
        } 
        catch (InterruptedException e) {
            e.printStackTrace();
            System.exit(0);
        }
    }
}

    public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "line count2");
    job.setJarByClass(LineCount.class);
    job.setMapperClass(LineCntMapper.class);
    job.setCombinerClass(LineCntReducer.class);
    job.setReducerClass(LineCntReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

To compile this program, use the below command

hadoop com.sun.tools.javac.Main LineCount.java

Create the jar file which is named as lc.jar

jar cf lc.jar LineCount*.class

Run linecount program, output will be automaically routed to

hadoop jar lc.jar LineCount /stack_overflow_hdfs /usr/stack_overflow_out

Check the output here http://localhost:50070/explorer.html#/

Have a look at the output

hadoop fs -cat /usr/stack_overflow_out/part-r-00000

We can take the output to a text file

hadoop fs -cat /usr/stack_overflow_out/part-r-00000 >> /home/hduser/Output/stack_overflow_out.txt

22nd March 2018