Statinfer

301.1.8-Lab of Hadoop

LAB: Hadoop Sandbox

LAB: Demo Hadoop Sandbox

  • Install Oracle VM virtual box or VM Player
  • Load Hadoop VMware Image

LAB: Starting Hadoop

  • Go to home folder
        ls
        cd ..

  • Start Hadoop
        start-all.sh

  • If above command doesn’t work, then use the following command:
        start-dfs.sh
        start-yarn.sh

  • Check name node and data nodes
        jps

LAB: HDFS Files

  • The files on HDFS
        hadoop fs -ls /

  • Check in the browser
http://localhost:50070

LAB: Copy from Local to HDFS

  • Move files to HDFS
        hadoop fs -copyFromLocal /home/hduser/datasets/Stock_Price_Data/stock_price.txt /test_on_dfs
  • The files on HDFS
        hadoop fs -ls /

  • Delete files from HDFS
        hadoop fs -rmr /test_on_dfs
  • The files on HDFS now
        hadoop fs -ls /

LAB: Move big data file to HDFS

  • Since this is sudo cluster mode, which we are still working with a limited resource computer.
  • Thus, let us take a medium size data.
  • The dataset that we are going to work with, is Stack Overflow Tax data, which is already provided.
    hadoop fs -copyFromLocal  /home/hduser/datasets/Stack_Overflow_Tags/final_stack_data.txt /stack_data
  • This code will copy the local file into HDFS.
  • The code consists of the path of the local file on the local system, followed by the file name on HDFS i.e., stack_data, onto which the local file is copied.
  • This code will move the 7 GB file onto HDFS.
  • The copy will take some time, because this whole file need to be cut into smaller pieces, their pointers need to be taken and 3 replicas of each data chunk of 128 MB will be created, but we cannot directly access them.
  • We can access each block of data. We can select the block that we need and download it.
  • We can check the status of copying from the browser directory. Every time the page is refreshed, the copies of data will be updated. We can even see the no. of blocks that are copied onto HDFS.
  • We can check the no. of blocks, no. of replications created for each block and the total size of the data from the browser. There are total 55 blocks of data.
    hadoop fs -ls /
  • This code will show all the files on HDFS. We can see that, there is a file called “stack_data”.
  • As there are 55 blocks or chunks, we can apply mapreduce on it and find out the required results, like counting the no. of lines in the data.
  • Counting the lines on whole data of 7 gb will be very difficult, so applying mapreduce to find the line count on the smaller data pieces will gives faster results.

 

15th May 2017

0 responses on "301.1.8-Lab of Hadoop"

Leave a Message

Blog Posts

Hurry up!!!

"use coupon code for FLAT 30% discount"  datascientistoffer        ___________________________________      Subscribe to our youtube channel. Get access to video tutorials.                

Contact Us

Statinfer Software Solutions#647 2nd floor 1st Main, Indira Nagar 1st Stage, 100 feet road,Indranagar Bangalore,Karnataka, Pin code:-560038 Landmarks: Opp. Namma Metro Pillar 48.

Connect with us

linkin fn twitter g

How to become a Data Scientist.?

top