Thursday, February 28, 2013

Big data : The next generation of innovation, competition, and productivity


What is big data?

Big data usually includes data sets with sizes beyond the ability of commonly-used software tools to capture, curate, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data in a single data set. With this difficulty, a new platform of "big data" tools has arisen to handle sense making over large quantities of data, as in the Apache Hadoop Big Data Platform.


Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data.

Big data spans three dimensions: Volume, Velocity and Variety.


Volume:  
Enterprises are awash with ever-growing data of all types, easily amassing terabytes—even petabytes—of information.
  • Turn 12 terabytes of Tweets created each day into improved product sentiment analysis
  • Convert 350 billion annual meter readings to better predict power consumption
Velocity: 
 Sometimes 2 minutes is too late. For time-sensitive processes such as catching fraud, big data must be used as it streams into your enterprise in order to maximize its value.
  • Scrutinize 5 million trade events created each day to identify potential fraud
  • Analyze 500 million daily call detail records in real-time to predict customer churn faster
Variety: 
Big data is any type of data - structured and unstructured data such as text, sensor data, audio, video, click streams, log files and more. New insights are found when analyzing these data types together.
  • Monitor 100’s of live video feeds from surveillance cameras to target points of interest
  • Exploit the 80% data growth in images, video and documents to improve customer satisfaction
 

Big data is not only a matter of size; it is an opportunity to find insights in new and emerging types of data and content, to make your business more agile, and to answer questions that were previously considered beyond your reach. Until now, there was no practical way to harvest this opportunity.

Sunday, February 24, 2013

Starting with Hadoop : Installation

Purpose
This document describes how to install, configure and manage non-trivial Hadoop clusters ranging from a few nodes to extremely large clusters with thousands of nodes.
Prerequisites
1.      Make sure all required software is installed on all nodes in your cluster (i.e. jdk1.6 and greater).
2.      Download the Hadoop software.

Installation Process
1.      Installing JAVA
     a.      Unzip the jdk tar
              tar -xvf jdk-7u2-linux-x64.tar.gz

     b.      Move unzipped jdk to usr/lib/jvm
              sudo mv jdk1.7.0_03/* /usr/lib/jvm/jdk1.7.0_34/

     c.       Update JAVA alternatives
              sudo update-alternatives --install "/usr/bin/java" "java" "/usr/lib/jvm/jdk1.7.0_34/bin/java" 1
              sudo update-alternatives --install "/usr/bin/javac" "javac" "/usr/lib/jvm/jdk1.7.0_34/bin/javac" 1
              sudo update-alternatives --install "/usr/bin/javaws" "javaws" "/usr/lib/jvm/jdk1.7.0_34/bin/javaws" 1

2.      Installing Hadoop
     a.      Unzip the hadoop tar inside home/<user>/hadoop
              tar xzf hadoop-1.0.4.tar.gz

3.      set JAVAPATH and HADOOP_COMMON_HOME path
         export JAVA_HOME=/usr/lib/jvm/jdk1.7.0_34
         export HADOOP_COMMON_HOME="/home/<user>/hadoop/hadoop-1.0.4"
         export PATH=$HADOOP_COMMON_HOME/bin/:$PATH

Configure Hadoop
 1.      conf/hadoop-env.h
          Add or change these lines to specify the JAVA_HOME and directory to store the logs:
          export JAVA_HOME=/usr/lib/jvm/jdk1.7.0_34
          export HADOOP_LOG_DIR=/home/<home>/hadoop/hadoop-1.0.4/hadoop_logs

2.      conf/core-site.xml
         Here the NameNode runs on 172.16.16.60
   
       <configuration>
           <property>
              <name>fs.default.name</name>
              <value>hdfs:// 172.16.16.60:9000</value>
          </property>
      </configuration>


3.      conf/hdfs-site.xml
     
       <configuration>
                <property>
                          <name>dfs.replication</name>
                          <value>3</value>
                </property>
                <property>
                         <name>dfs.name.dir</name>
                         <value>/lhome/hadoop/data/dfs/name/</value>
               </property>
               <property>
                         <name>dfs.data.dir</name>
                         <value>/lhome/hadoop/data/dfs/data/</value>
               </property>
        <configuration>

dfs.replication is the number of replicas of each block. dfs.name.dir is the path on the local filesystem where the NameNode stores the namespace and transactions logs persistently. dfs.data.dir is comma-separated list of paths on the local filesystem of a DataNode where it stores its blocks.

4.      conf/mapred-site.xml
         Here the JobTracker runs on 172.16.16.60
     
         <configuration>
                <property>
                          <name>mapred.job.tracker</name>
                          <value>172.16.16.60:9001</value>
               </property>
               <property>
                          <name>mapred.system.dir</name>
                          <value>/hadoop/data/mapred/system/</value>
               </property>
               <property>
                         <name>mapred.local.dir</name>
                         <value>/lhome/hadoop/data/mapred/local/</value>
               </property>
       </configuration>

mapreduce.jobtracker.address is host or IP and port of JobTracker. mapreduce.jobtracker.system.dir is the path on the HDFS where where the Map/Reduce framework stores system files. mapreduce.cluster.local.dir is comma-separated list of paths on the local filesystem where temporary MapReduce data is written.

5.      conf/masters
         Delete localhost and add all the names of the namenode, each in on line.
         For Example:
         <IP of Namenode>

6.      conf/slaves
         Delete localhost and add all the names of the TaskTrackers, each in on line.
         For Example:
         <IP of Slave 1>
         <IP of Slave 2>
          ….
          ….
        <IP of Slave n>

7.      Configuring SSH
         In fully-distributed mode, we have to start daemons, and to do that, we need to have SSH installed. it merely starts daemons on the set of hosts in the cluster (defined by the slaves file) by SSH-ing to each host and starting a daemon process. So we need to make sure that we can SSH to localhost and log in without having to enter a password.

       $ sudo apt-get install ssh

Then to enable password-less login, generate a new SSH key with an empty passphrase:

        $ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

        $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Test this with:

         $ ssh localhost

You should be logged in without having to type a password.
Copy the contents inside the  /<user>/.ssh/  of masternode to other slave nodes as well to have passwordless access to other nodes from masternode.

8.      Duplicate Hadoop configuration files to all nodes
We may duplicate the configuration files under conf directory to all nodes. The script mentioned above can be used. By now, we have finished copying Hadoop softwares and configuring the Hadoop. Now let’s have some fun with Hadoop.

Hadoop Startup
To start a Hadoop cluster you will need to start both the HDFS and Map/Reduce cluster. Format a new distributed file system:
       
            $ bin/hadoop namenode -format

Start the HDFS with the following command, run on the designated NameNode:                                            
       
            $ bin/start-dfs.sh

The bin/start-dfs.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the NameNode and
starts the DataNode daemon on all the listed slaves.
Start Map-Reduce with the following command, run on the designated JobTracker:    
       
            $ bin/start-mapred.sh

The bin/start-mapred.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the JobTracker and starts the TaskTracker daemon on all the listed slaves.

Hadoop Shutdown
Stop HDFS with the following command, run on the designated NameNode:
 
             $ bin/stop-dfs.sh

The bin/stop-dfs.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the NameNode and stops the DataNode daemon on all the listed slaves.
Stop Map/Reduce with the following command, run on the designated the designated JobTracker:

            $ bin/stop-mapred.sh

The bin/stop-mapred.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the JobTracker and stops the TaskTracker daemon on all the listed slaves.

Followers