• No products in the cart.

301.2.3-Map Reduce Code for line count

 

LAB: Map Reduce for Line Count

  • Dataset: Stack_Overflow_Tags/final_stack_data.zip
  • Unzip the file and see the size of the data.
  • The dataset contains some stack overflow questions. The goal is to find out the total number of questions in the file(number of rows)
  • Move the data to hdfs.
  • Write a word count program to count the frequency of each word
  • Take the final output inside a text file

Solution

Is the Hadoop started?

jps

Start Hadoop if not started already

start-all.sh

you can also use start-dfs.sh

Is the Hadoop started now? jps

check your files on hdfs

hadoop fs -ls /

Dataset is /Stack_Overflow_Tags/final_stack_data.zip. Unzip the data first. Unzipping takes some time

sudo unzip /home/hduser/datasets/Stack_Overflow_Tags/final_stack_data.zip -d /home/hduser/datasets/Stack_Overflow_Tags/

Bring the data onto hadoop HDFS. The file size is huge. This step takes some-time

hadoop fs -copyFromLocal /home/hduser/datasets/Stack_Overflow_Tags/final_stack_data.txt /stack_overflow_hdfs

Check the data file on HDFS

hadoop fs -ls /

check your current working directory

cd

Goto hadoop bin

cd /usr/local/hadoop/bin/

It is imporatant to make your PWD(present working directory) as $hadoop/bin

Open an editor with a file name LineCount.java

sudo gedit LineCount.java

Copy the below java code, paste in your file and save your file

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class LineCount{
    public static class LineCntMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
    Text keyEmit = new Text("Total Lines");
    private final static IntWritable one = new IntWritable(1);

    public void map(LongWritable key, Text value, Context context){
        try {
            context.write(keyEmit, one);
        } 
        catch (IOException e) {
            e.printStackTrace();
            System.exit(0);
        } 
        catch (InterruptedException e) {
            e.printStackTrace();
            System.exit(0);
        }
    }
}

    public static class LineCntReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    public void reduce(Text key, Iterable<IntWritable> values, Context context){
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        try {
            context.write(key, new IntWritable(sum));
        } 
        catch (IOException e) {
            e.printStackTrace();
            System.exit(0);
        } 
        catch (InterruptedException e) {
            e.printStackTrace();
            System.exit(0);
        }
    }
}

    public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "line count2");
    job.setJarByClass(LineCount.class);
    job.setMapperClass(LineCntMapper.class);
    job.setCombinerClass(LineCntReducer.class);
    job.setReducerClass(LineCntReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

To compile this program, use the below command

hadoop com.sun.tools.javac.Main LineCount.java

Create the jar file which is named as lc.jar

jar cf lc.jar LineCount*.class

Run linecount program, output will be automaically routed to

hadoop jar lc.jar LineCount /stack_overflow_hdfs /usr/stack_overflow_out

Check the output here http://localhost:50070/explorer.html#/

Have a look at the output

hadoop fs -cat /usr/stack_overflow_out/part-r-00000

We can take the output to a text file

hadoop fs -cat /usr/stack_overflow_out/part-r-00000 >> /home/hduser/Output/stack_overflow_out.txt

 

22nd March 2018

Statinfer

Statinfer derived from Statistical inference. We provide training in various Data Analytics and Data Science courses and assist candidates in securing placements.

Contact Us

info@statinfer.com

+91- 9676098897

+91- 9494762485

 

Our Social Links

top
© 2020. All Rights Reserved.