The Big Object: 2015

Thursday, August 27, 2015

What is Hadoop anyway?

Hadoop will change the way businesses think about storage, processing and the value of ‘big’ data.

Apache Hadoop is an open source project governed by the Apache Software Foundation (ASF). Hadoop enables the user to extract valuable business insight from massive amounts of structured and unstructured data quickly and cost-effectively through three main functions:

Processing – MapReduce. Computation in Hadoop is based on the MapReduce paradigm that distributes tasks across a cluster of coordinated “nodes.”

Storage – HDFS. Storage is accomplished with the Hadoop Distributed File System (HDFS) – a reliable file system that allows large volumes of data to be stored and accessed across large clusters of commodity servers.

Resource Management – YARN. Coming in Hadoop 2.0, YARN performs a resource management function further increasing efficiency and extends MapReduce capabilities by supporting non-MapReduce workloads such as Graph, Steaming, In-memory, MPI processing and more.

Hadoop is designed to scale up or down without system interruption and runs on commodity hardware making the capture and processing of big data economically viable for the enterprise.

“By 2017, I believe that 50% of the world’s data will be stored and analyzed by Apache Hadoop.”

Tuesday, April 14, 2015

Data Analysis from MongoDB using R

Most of us are aware of R, is a programming language and software environment for statistical computing and graphics. The R language is widely used among statisticians and data miners for developing statistical softwares and data analysis. If we empower R with proper datasets and sources it would be the icing on the cake, so in this post we are going to see how, R would be connected to the MongoDB and how one can apply R power or datasets from MongoDB.

Prerequisites for this demo, you should have MongoDB daemon up and running on server or on your local machine(pseudo distributed mode)

Start your R instance and install "rmongodb" packages by issuing below command(s)

$ install.packages("rmongodb")

$ library(rmongodb)

connect R with MongoDB instance

$ mongo.create(host = "127.0.0.1", name = "", username = "", password = "", db = "test", timeout = 0L)

you'll get response as below, using above connection configuration you are connecting to the mongo instance on 127.0.0.1 to the 'test' mongo database with empty username and password.

[1] 0

attr(,"mongo")

<pointer: 0x0884f0a8>

attr(,"class")

[1] "mongo"

attr(,"host")

[1] "127.0.0.1"

attr(,"name")

[1] ""

attr(,"username")

[1] ""

attr(,"password")

[1] ""

attr(,"db")

[1] "test"

attr(,"timeout")

[1] 0

you can check by issuing below command, whether R is connected to MongoDB or not.

$ mongo.is.connected(mongo)

[1] TRUE

Now your R is successfully connected to MongoDB instance to test database, so you can easily fire a simple mongo queries and use R's power to calculate analytics over mongoDB datasets.

for example to get simple one record from Mongo

$ mongo.find.one(mongo,"test.zip",list())

we can also use filter queries to fetch records from MongoDB into R datasets,

$ mongo.find(mongo, "test.zip", list(pop=list('$gt'=21L)))

So, this just a beginning stay tuned for the next updates.

Thanks for visiting, I'll appreciate your thoughts and comments

Saturday, March 28, 2015

Data Scrapper in Python

Hello All,

Nowadays we know the data is the most valuable thing in the world, who has the more data has the more power or command over the market. This market is totally data driven and I'm sure in next couple of decades the data can also decide the future, just kidding :)

But trust me we can power our recommendations systems to predict very much accurate results with the data. Data is directly proportional to the value.

As the data is important then the its collection is also important, so we have number of data sources available over the net, one just need to find it out and fetch the required information from.

So in this post, we are going to learn one of the very famous data collection method is Data Scrapping from world wide web. Today we are going to write data scrapper in Python(3.4.3)

#Import the required libraries
import urllib.request
import re

#stock symbol lists, you may refer it from file
symbolslist = ["suzlon.bo","unitech.bo","spicejet.bo","idfc6.bo","powergrid6.bo"]

i=0
while i<len(symbolslist):
#scapping page url
urlstr = "https://in.finance.yahoo.com/q?s="+symbolslist[i]+""
htmfile = urllib.request.urlopen(urlstr)
htmtext = htmfile.read().decode('utf-8')
regex='<span id="yfs_l84_'+symbolslist[i]+'">(.+?)</span>'
pattern = re.compile(regex)
price = re.findall(pattern, htmtext)
#Print the scrapped data
print("The price of",symbolslist[i]," is ",price)
i+=1

This is just a basic program you can modify and extend as per your requirement.

Thanks for visiting, stay tuned for more!!!

Thursday, March 19, 2015

Apache Storm Setup and Deployment

Please follow below steps for apache storm and zookeeper setup and deployment

Set up a Zookeeper cluster

Download and extract a Storm package to Nimbus and worker machines

Install dependencies on Nimbus and worker machines

Fill in mandatory configurations into storm.yaml

Launch daemons under supervision using “storm” script and a supervisor of your choice

Overall Zookeeper and Storm cluster components

Setup a Zookeeper cluster

Storm uses Zookeeper for coordinating the cluster. Zookeeper is not used for message passing, so the load Storm places on Zookeeper is quite low. Single node Zookeeper clusters should be sufficient for most cases, but if you want failover or are deploying large Storm clusters you may want larger Zookeeper clusters.

Install the Java JDK. You can use the native packaging system for your system, or download the JDK from:

http://java.sun.com/javase/downloads/index.jsp

Set the Java heap size. This is very important to avoid swapping, which will seriously degrade Zookeeper performance. To determine the correct value, use load tests, and make sure you are well below the usage limit that would cause you to swap. Be conservative - use a maximum heap size of 3GB for a 4GB machine.

Install the Zookeeper Server Package. It can be downloaded from:

http://hadoop.apache.org/zookeeper/releases.html

Create a configuration file. This file can be called anything. Use the following settings as a starting point:

tickTime=2000

dataDir=/var/zookeeper/

clientPort=2181

initLimit=5

syncLimit=2

server.1=zoo1:2888:3888

server.2=zoo2:2888:3888

server.3=zoo3:2888:3888

You can find the meanings of these and other configuration settings in the section Configuration Parameters. A word though about a few here:

Every machine that is part of the Zookeeper ensemble should know about every other machine in the ensemble. You accomplish this with the series of lines of the form server.id=host:port:port. The parameters host and port are straightforward. You attribute the server id to each machine by creating a file named myid, one for each server, which resides in that server's data directory, as specified by the configuration file parameter dataDir.

The myid file consists of a single line containing only the text of that machine's id. So myid of server 1 would contain the text "1" and nothing else. The id must be unique within the ensemble and should have a value between 1 and 255.

If your configuration file is setup, you can start a Zookeeper server:

$ java -cp zookeeper.jar:lib/log4j-1.2.15.jar:conf \ org.apache.zookeeper.server.quorum.QuorumPeerMain zoo.cfg

QuorumPeerMain starts a Zookeeper server, JMX management beans are also registered which allows management through a JMX management console. The ZooKeeper JMX document contains details on managing ZooKeeper with JMX. See the script bin/zkServer.sh, which is included in the release, for an example of starting server instances.

Test your deployment by connecting to the hosts:

In Java, you can run the following command to execute simple operations:

$ java -cp zookeeper.jar:src/java/lib/log4j-1.2.15.jar:conf:src/java/lib/jline-0.9.94.jar \ org.apache.zookeeper.ZooKeeperMain -server 127.0.0.1:2181

In C, you can compile either the single threaded client or the multithreaded client: or n the c subdirectory in the Zookeeper sources. This compiles the single threaded client:

$ make cli_st

And this compiles the multithreaded client:

$ make cli_mt

Running either program gives you a shell in which to execute simple file-system-like operations. To connect to Zookeeper with the multithreaded client, for example, you would run:

$ cli_mt 127.0.0.1:2181

Setup a Storm cluster

Environment

* OS: CentOS 6.X

* CPU Arch: x64

* Middleware: Needs JDK6 or after(Oracle JDK or Open JDK）

Installing storm package

Unzip downloaded zip archive.

https://github.com/acromusashi/storm-installer/wiki/Download

Install the ZeroMQ RPM:

If occur failed dependencies uuid, download from

http://zid-lux1.uibk.ac.at/linux/rpm2html/centos/6/os/x86_64/Packages/uuid-1.6.1-10.el6.x86_64.html

and install uuid-1.6.1-10.el6.x86_64.rpm.

# su -

# rpm -ivh zeromq-2.1.7-1.el6.x86_64.rpm

# rpm -ivh zeromq-devel-2.1.7-1.el6.x86_64.rpm

# rpm -ivh jzmq-2.1.0-1.el6.x86_64.rpm

# rpm -ivh jzmq-devel-2.1.0-1.el6.x86_64.rpm

Install the Storm RPM:

# su -

# rpm -ivh storm-0.9.0-1.el6.x86_64.rpm

# rpm -ivh storm-service-0.9.0-1.el6.x86_64.rpm

Set the zookeeper host, nimbus host and other required properties to storm configuration file.

(Reference: http://nathanmarz.github.com/storm/doc/backtype/storm/Config.html )

* storm.zookeeper.servers (STORM_ZOOKEEPER_SERVERS)

* nimbus.host (NIMBUS_HOST)

# vi /opt/storm/conf/storm.yaml

Settings Example:

Default storm.yaml example.

########### These MUST be filled in for a storm configuration##############

storm.zookeeper.servers:

- "111.222.333.444"

- "555.666.777.888" ## zookeeper hosts

storm.zookeeper.port: 2181

nimbus.host: "111.222.333.444" ## nimbus host

storm.local.dir: "/mnt/storm"

supervisor.slots.ports:

- 6700

- 6701

- 6702

- 6703

Start or stop storm cluster by following commands:

Start

# service storm-nimbus start

# service storm-ui start

# service storm-drpc start

# service storm-logviewer start

# service storm-supervisor start

Stop

# service storm-supervisor stop

# service storm-logviewer stop

# service storm-drpc stop

# service storm-ui stop

# service storm-nimbus stop

Strom Dependency libraries

Project : Storm

Version : 0.9.0

Lisence : Eclipse Public License 1.0

Source URL : http://storm-project.net/

Project : ZeroMQ

Version : 2.1.7

Lisence : LGPLv3

Source URL : http://www.zeromq.org/

Project : JZMQ

Version : 2.1.0

Lisence : LGPLv3

Source URL : https://github.com/zeromq/jzmq

Pages

Thursday, August 27, 2015

What is Hadoop anyway?

Tuesday, April 14, 2015

Data Analysis from MongoDB using R

Saturday, March 28, 2015

Data Scrapper in Python

Thursday, March 19, 2015

Apache Storm Setup and Deployment

Please follow below steps for apache storm and zookeeper setup and deployment

Followers