What is Big Data

What is Big Data

What is Big Data

What is Big Data

Big Data is “Not a Replacement for existing Analytical systems” like Cubes, Data-warehouses etc. Big Data processes both remake, and complement existing analytic workflows by Simplifying production of structured information from emerging “ambient” data sources.

When you have non-traditional Datasources like Social media, IOT devices, Automated robotics etc. Big Data allows you to make sense out of these un-structured or semi-structured data into sensible analytical data. So here are the key points:

What is Big Data

  • Enabling rapid sense-making over un-enriched and un-modeled data
  • Enabling analytics at scale over ambient data
  • Enabling creation of ambient data driven models
  • Existing systems enable sense-making over modeled data
  • There is tremendous potential value in making sense of ambient data

 

What is Big Data

Comparison Chart between an RDBMS System and a Big Data based MapReduce

What is Big Data

As you process more and more data, and you want interactive response ypically in most cases, you need more expensive hardware to support the infrastructure. Failures at the points of disk and network can be quite problematic and maintaining ACID (atomicity, consistency, isolation, durability) could be a challenge.

You can work around this problem with more expensive HW and systems like purchasing Database Appliances from Oracle or Microsoft (ESSBASE, PDW) but adoption would be small due to high costs.

In case of Big Data and Hadoop, We are using commodity hardware without the need for specialized and expensive network and disk. Not so much ACID, but we get BASE (basically available, soft state, eventually consistent)

Map Reduce (Split, Shuffle)

What is Big Data

The Hadoop Ecosystem

What is NoSQL ?

Broadly put, the NoSQL is analogous to OLTP if you imagine Hadoop as a BI system. They are comprised of many components:

  • HBase
  • Cassandra
  • MongoDB
  • CouchBase
  • MemcacheD and more.

Implementations of Google’s BigTable – distributed storage system for managing structured data at very large sizes

A Bigtable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.

What is HBASE

What is Big Data

  • Efficient at Random Reads/Writes
  • Distributed, large scale data store
  • Utilizes Hadoop for persistence
  • Both HBase and Hadoop are distributed

Cassandra implementation AT Netflix

What is Big Data

Source: http://www.slideshare.net/adrianco/migrating-netflix-from-oracle-to-global-cassandra

Where did Cassandra originated from?

A lot of its origin stems from Facebook. Cassandra was originally created at Facebook. However in case of Facebook messaging, Facebook decided to use HBase instead.

What is Big Data

Source: http://www.slideshare.net/danirayan/streaming-map-reduce

What is HIVE and HIVE Queries

This is the favourite method for most SQL Professionals becuase it uses SQL like Queries to query data from Hadoop. It is a ž“Data warehouse” system for Hadoop.

With Hive, you can do the following:

  • žAnalysis of large datasets stored in HDFS
  • žSQL–Like Interface
  • žNo Java programming needed.
  • žAd-hoc queries via HiveQL
    (translate into MapReduce)

žYou can connect from PowerQuery, PowerBI or PowerPivot for Excel etc using the Hive ODBC driver or native connectors.

Example HIVE Query:

CREATE TABLE indro_managed (bar int);

LOAD DATA INPATH ‘/user/larar/data.txt’

INTO TABLE indro_managed

CREATE EXTERNAL TABLE indro_external (bar int)

  LOCATION ‘/user/larar/indro_external’;

LOAD DATA INPATH ‘/user/larar/data.txt’

INTO TABLE indro_external

Comparison Table for RDBMS and Hive

What is Big Data

What is Mahout?

žIt is a scalable machine learning library that leverages the Hadoop infrastructure

žKey Use Cases:

  • žRecommendation mining:  Examine user behavior, build recommendation model
  • žClustering:  Grouping data into related topics
  • žClassification:  Learn from classified documents to assign categories to unlabeled data

What is R Programming?

  • žStatistical computing and graphing programming language
  • žRHIPE: R and Hadoop Integration
  • žOpen source GNU Project

What us Scoop?

žData connector system for Hadoop and RDBMS

  • žImporting RDBMS data to files (delimited or sequence) in HDFS, or tables in Hive
  • žImporting RDBMS query results to files (delimited or sequence) in HDFS, or tables in Hive
  • žExporting files and Hive tables to RDBMS tables
  • žExecutes MapReduce jobs to transfer data in parallel with fault tolerance

What is Pig?

It is a Data-flow platform to transform and analyze HDFS data. It have the following benefits:

  • Scripting – No Java Programming Needed!
  • Focus on semantics, not on implementation
  • Extensible through user defined functions and methods
  • Pig can operate on data whether it has metadata or not.
  • Pig is not tied to one particular parallel framework.
  • Pig is designed to be easily controlled and modified by its users.
  • Pig processes data quickly.

Read More about Pigs here: http://pig.apache.org/philosophy.html

Workflow, Management & Monitoring with  OozieAmbari, & ZooKeeper

What is Big Data

What is Oozie?

It is a žWorkflow processing system. žUsers define a series of jobs written in multiple languages and link them to one another. For example: a particular query is only to be initiated after specified previous jobs on which it relies for data are completed.

What is Ambari?

It is a management system for monitoring a Hadoop system. With Ambari you can:

  • žInstall: žWizard for installing Hadoop services across any number of nodes
  • žManage: Central management for starting, stopping, and reconfiguring Hadoop services across the entire cluster
  • žMonitor: Dashboard for monitoring health and status of the Hadoop cluster. žSends email alerts when your attention is needed (e.g., a node goes down, remaining disk space is low, etc)

What is Zookeeper?

It is a žCentralized service for mMaintaining configuration information and naming.

  • žProviding distributed synchronization
  • žProviding group services
  • žHigh throughput, low latency, highly available, strictly ordered access

What is HCatalog ?

žCentralized Metadata Management for Shared schema and data type and žTable storage.

  • žNotifications via Java Message Service (JMS)
  • žWorks across Pig, Map Reduce, and Hive

I hope this introductory post gave you a good understanding of what Big Data is all about. If this was helpful, do not forget to give your feedback in the comments section. Cheers!

Disclosure: We are a professional review website that sometimes receive compensation or free units from the companies whose products we review. We test each product thoroughly and give high marks to only the very best. We are independently owned and the opinions expressed here are our own.