Data Intensive System(DIS)
System Challenges:
Data distributed over many disks
Compute using many processors
Connected by gigabit Ethernet (or equivalent)
System Requirements:
Lots of disks
Lots of processors
Located in close proximity
System Comparison:
(i) Data
Conventional Supercomputers
|
DISC
|
Data stored in separate repository
No support for collection or management
Brought into system for computation
Time consuming
Limits interactivity
|
System collects and maintains data
Shared, active data set
Computation colocated with storage
Faster access
|
(ii) Programing Models
Conventional Supercomputers
|
DISC
|
Programs described at very low level
Specify detailed control of processing & communications
Rely on small number of software packages
Written by specialists
Limits classes of problems & solution methods
|
Application programs written in terms of high-level operations on data
Runtime system controls scheduling, load balancing, …
|
(iii) Interaction
Conventional Supercomputers
|
DISC
|
Main Machine: Batch Access
Priority is to conserve machine resources
User submits job with specific resource requirements
Run in batch mode when resources available
Offline Visualization
Move results to separate facility for interactive use
|
Interactive Access
Priority is to conserve human resources
User action can range from simple query to complex computation
System supports many simultaneous users
Requires flexible programming and runtime environment
|
(iv) Reliability
Conventional Supercomputers
|
DISC
|
“Brittle” Systems
Main recovery mechanism is to recompute from most recent checkpoint
Must bring down system for diagnosis, repair, or upgrades
|
Flexible Error Detection and Recovery
Runtime system detects and diagnoses errors
Selective use of redundancy and dynamic recomputation
Replace or upgrade components while system running
Requires flexible programming model & runtime environment
|
Comparing with Grid Computing:
Grid: Distribute Computing and Data
(i) Computation: Distribute problem across many machines
Generally only those with easy partitioning into independent subproblems
(ii) Data: Support shared access to large-scale data set
DISC: Centralize Computing and Data
(i) Enables more demanding computational tasks
(ii) Reduces time required to get data to machines
(iii) Enables more flexible resource management
A Commercial DISC
Netezza Performance Server (NPS)
Designed for “data warehouse” applications
Heavy duty analysis of database
Data distributed over up to 500 Snippet Processing Units
Disk storage, dedicated processor, FPGA controller
User “programs” expressed in SQL
Constructing DISC
Hardware: Rent from Amazon
Elastic Compute Cloud (EC2)
Generic Linux cycles for $0.10 / hour ($877 / yr)
Simple Storage Service (S3)
Network-accessible storage for $0.15 / GB / month ($1800/TB/yr)
Software: utilize open source
Hadoop Project
Open source project providing file system and MapReduce
Supported and used by Yahoo
Implementing System Software
Programming Support
Abstractions for computation & data representation
E.g., Google: MapReduce & BigTable
Usage models
Runtime Support
Allocating processing and storage
Scheduling multiple users
Implementing programming model
Error Handling
Detecting errors
Dynamic recovery
Identifying failed components