BA 574 – Data Management
Data Management
BA 574 – Data Management
Lecture 8: Big Data and NoSQL
• Big Data Concepts
• Big Data Management Strategies
• Hadoop for Big Data Management
• NoSQL vs Relational
Agenda
2
3Big Data is Everywhere
4Name Example(s) of Size
Byte A single letter, like "A."
Kilobyte 1024 Bytes. An e-mail.
Megabyte 1024 Kilobytes. A good sized book.
Gigabyte 1024 Megabytes. A DVD is about 1-5 Gigabyte
Terrabyte 1024 Gigabytes. The capacity of a personal computer.
Petabyte
1024 Terrabytes. The amount of data available on the web
in the year 2000 is thought to occupy 8 Petabytes.
Exabyte
1024 Petabytes. 161 Exabytes of data were created in
2006.
Zettabyte
1024 Exabytes. In year 2010, we create about 1 Zettabyte
data.
Data Sizes
Big Data Characteristics
• Volume: Quantity of data to be stored
• Velocity: Speed at which data is entered into system and
must be processed
• Variety: Variations in the structure of data to be stored
•Other characteristics:
▫ Variability: Changes in meaning of data based on context
Sentimental analysis attempts to determine attitude
▫ Veracity: Trustworthiness of data
▫ Value: Degree data can be analyzed for meaningful insight
▫ Visualization: Ability to graphically resent data to make it
understandable to users
5
Current View of Big Data
6
7• The overall goal is to enable parallel and cost-
effective computation.
• Three strategies:
▫ Employ redundant but relatively inexpensive components to
control costs
▫ Minimize joins to compartmentalize data requests
▫ Partition datasets to spread the workload
Big Data Management Strategies
Multiple Inexpensive Components
• Systems divide and conquer big
jobs; many computers share the
load
• Save money using a bunch of
cheap computers in place of a
very expensive one
• … But, more computers = more
failures
8
?
Master
Node
Master
Node
Data
Node
Master
Node
Data
Node
Data
Node
Minimize Joining Data
9
Products
ID Price Desc
1 $8 IPhone Cable
2 $15 Wireless Keyboard
3 $750 Iphone 6S
Carts
ID Customer
1001 S. Green
1002 J. Sousa
2134 J. Kim
PIC
ProdId CartID Qty
3 1001 1
2 1002 1
1 1001 1
Cart Document
CartId: 1001
Customer: “S. Green”
Items:
PId:“1” Dsc:“IPhone Cable” $:“8”
PId:“3” Dsc:“IPhone 6S” $:“750”
GetCartData(1001)
1) GetCart(1001),
2) GetCProds(1001) ,
3) GetProd(1,3)
4) SendResult()
GetCart(1001)
SendResult()
The hierarchical data structure
reduces the need for joins in high-
volume, high-velocity transactions
Spread Workload
10
Web Server
Master
Node
Web Server Web Server
Master
Node
Master
Node
Data
Node
Data
Node
Data
Node
Sep-Dec May-Aug Jan-Apr
Goal: balance the load
across the servers
Throughput is limited
if all the data in used
at one time is on one
server
11
Hadoop-Implementing Big Data Management
Strategies
• De facto standard for most Big Data storage and processing
• Java-based framework for distributing and processing very large data
sets across clusters of computers
• Most important components:
▫ Hadoop Distributed File System (HDFS): Store large data on a cluster of
inexpensive nodes.
▫ MapReduce: Programming model that supports processing large data sets
on a cluster of inexpensive nodes.
11
Hadoop Distributed File System (HDFS)
• Uses several types of nodes (computers):
▫ Data node store the actual file data
▫ Name node contains file system metadata
▫ Client node makes requests to the file system as needed to support user
applications
▫ Data node communicates with name node by regularly sending block
reports and heartbeats
12
Figure 14.4 – Hadoop Distributed File System
(HDFS)
13
MapReduce
• Framework used to process large data sets across clusters
▫ Breaks down complex tasks into smaller subtasks, performing the subtasks
and producing a final result
▫ Map function takes a collection of data and sorts and filters it into a set of
key-value pairs
Mapper program performs the map function
▫ Reduce summaries results of map function to produce a single result
Reducer program performs the reduce function