301.1.8-Lab of Hadoop

LAB: Hadoop Sandbox

        ls
        cd ..

        start-all.sh

        start-dfs.sh
        start-yarn.sh

jps

        hadoop fs -ls /

http://localhost:50070

        hadoop fs -copyFromLocal /home/hduser/datasets/Stock_Price_Data/stock_price.txt /test_on_dfs

        hadoop fs -ls /

        hadoop fs -rmr /test_on_dfs

        hadoop fs -ls /

Since this is sudo cluster mode, which we are still working with a limited resource computer.
Thus, let us take a medium size data.
The dataset that we are going to work with, is Stack Overflow Tax data, which is already provided.

    hadoop fs -copyFromLocal  /home/hduser/datasets/Stack_Overflow_Tags/final_stack_data.txt /stack_data

This code will copy the local file into HDFS.
The code consists of the path of the local file on the local system, followed by the file name on HDFS i.e., stack_data, onto which the local file is copied.
This code will move the 7 GB file onto HDFS.
The copy will take some time, because this whole file need to be cut into smaller pieces, their pointers need to be taken and 3 replicas of each data chunk of 128 MB will be created, but we cannot directly access them.
We can access each block of data. We can select the block that we need and download it.
We can check the status of copying from the browser directory. Every time the page is refreshed, the copies of data will be updated. We can even see the no. of blocks that are copied onto HDFS.
We can check the no. of blocks, no. of replications created for each block and the total size of the data from the browser. There are total 55 blocks of data.

    hadoop fs -ls /

This code will show all the files on HDFS. We can see that, there is a file called “stack_data”.
As there are 55 blocks or chunks, we can apply mapreduce on it and find out the required results, like counting the no. of lines in the data.
Counting the lines on whole data of 7 gb will be very difficult, so applying mapreduce to find the line count on the smaller data pieces will gives faster results.

22nd March 2018