• No products in the cart.

• Install Oracle VM virtual box or VM Player

• Go to home folder
        ls
cd ..

        start-all.sh

• If above command doesn’t work, then use the following command:
        start-dfs.sh
start-yarn.sh

• Check name node and data nodes
        jps

### LAB: HDFS Files

• The files on HDFS
        hadoop fs -ls /

• Check in the browser
http://localhost:50070

### LAB: Copy from Local to HDFS

• Move files to HDFS
        hadoop fs -copyFromLocal /home/hduser/datasets/Stock_Price_Data/stock_price.txt /test_on_dfs
• The files on HDFS
        hadoop fs -ls /

• Delete files from HDFS
        hadoop fs -rmr /test_on_dfs
• The files on HDFS now
        hadoop fs -ls /

### LAB: Move big data file to HDFS

• Since this is sudo cluster mode, which we are still working with a limited resource computer.
• Thus, let us take a medium size data.
• The dataset that we are going to work with, is Stack Overflow Tax data, which is already provided.
    hadoop fs -copyFromLocal  /home/hduser/datasets/Stack_Overflow_Tags/final_stack_data.txt /stack_data
• This code will copy the local file into HDFS.
• The code consists of the path of the local file on the local system, followed by the file name on HDFS i.e., stack_data, onto which the local file is copied.
• This code will move the 7 GB file onto HDFS.
• The copy will take some time, because this whole file need to be cut into smaller pieces, their pointers need to be taken and 3 replicas of each data chunk of 128 MB will be created, but we cannot directly access them.
• We can access each block of data. We can select the block that we need and download it.
• We can check the status of copying from the browser directory. Every time the page is refreshed, the copies of data will be updated. We can even see the no. of blocks that are copied onto HDFS.
• We can check the no. of blocks, no. of replications created for each block and the total size of the data from the browser. There are total 55 blocks of data.
    hadoop fs -ls /
• This code will show all the files on HDFS. We can see that, there is a file called “stack_data”.
• As there are 55 blocks or chunks, we can apply mapreduce on it and find out the required results, like counting the no. of lines in the data.
• Counting the lines on whole data of 7 gb will be very difficult, so applying mapreduce to find the line count on the smaller data pieces will gives faster results.

22nd March 2018

### 0 responses on "301.1.8-Lab of Hadoop"

Statinfer Software Solutions LLP

Software Technology Parks of India,
NH16, Krishna Nagar, Benz Circle,