Thursday, April 25, 2013

Job Scheduling for Hadoop.

As we know hadoop processes and analyse large amount of data, with different variety and with high processing speed, but for achieve this performance at maximum level, with higher rate of efficiency Job scheduling is very important.

Hadoop supports three types of scheduling,
1. FIFO Scheduler - First In First Out
2. Fair Scheduler  - Each job get equal amount of processor time span.
3. Capacity Scheduler - Priority Scheduler

FIFO Scheduler :  
This is a default scheduler, The original scheduling algorithm that was integrated within the Job Tracker was called FIFO. In FIFO scheduling, a Job Tracker pulled jobs from a work queue, oldest job first. This schedule had no concept of the priority or size of the job, but the approach was simple to implement and efficient.

Fair Scheduler :
Fair scheduling is a method of assigning resources to jobs such that all jobs get, on average, an equal share of resources over time. When there is a single job running, that job uses the entire cluster. When other jobs are submitted, tasks slots that free up are assigned to the new jobs, so that each job gets roughly the same amount of CPU time. Unlike the default Hadoop scheduler, which forms a queue of jobs, this lets short jobs finish in reasonable time while not starving long jobs. It is also an easy way to share a cluster between multiple of users. Fair sharing can also work with job priorities - the priorities are used as weights to determine the fraction of total compute time that each job gets.

Capacity Scheduler : 
The capacity scheduler shares some of the principles of the fair scheduler but has distinct differences, too. First, capacity scheduling was defined for large clusters, which may have multiple, independent consumers and target applications. For this reason, capacity scheduling provides greater control as well as the ability to provide a minimum capacity guarantee and share excess capacity among users.
In capacity scheduling, instead of pools, several queues are created, each with a configurable number of map and reduce slots. Each queue is also assigned a guaranteed capacity (where the overall capacity of the cluster is the sum of each queue's capacity).

This scheduler was developed by Yahoo!.

References:
http://hadoop.apache.org/docs/stable/capacity_scheduler.html
http://hadoop.apache.org/docs/stable/fair_scheduler.html

Wednesday, April 24, 2013

Would Hadoop really replaces Traditional Data warehousing domains?


If someone asks you 'Would Hadoop will be the future to data warehousing?; would it replaces the traditional data warehousing systems? then what will be your reaction.You will think that he must be kidding, there is no point of discussing such question because you know the how important is the data warehousing is. but its only if you are unaware of Hadoop.

Yes, Traditional Data Warehouse can in fact address this specific use case reasonably well from an architectural standpoint. But given that the most cutting-edge cloud analytics is happening in Hadoop clusters, it’s just a matter of time one to two years, tops before all data warehouse vendors bring Hadoop into their heart of their architectures. For those vendors who haven’t yet fully committed to full Hadoop integration, the growing real-world adoption of this open-source approach will force their hands.

Where the next-generation Data Warehouse is concerned, the petabyte staging cloud is merely Hadoop’s initial footprint. Enterprises are moving rapidly toward the Data Warehouse as the hub for all future analytics. Again, the impressive growth in MapReduce for predictive modeling, data mining(Mahout), and content analytics will practically compel Data Warehouse vendors to optimize their platforms for MapReduce.

Yes, Handling Bigdata it not just a matter of volume, it means variety of data and velocity as well.

Followers