- horizontal scaling vs vertical scaling. Hadoop is multiple machines being one thing.
- Premise: Shipping data is expensive so let's put it in a "data ecosystem" & ship the computation closer to the data
- HDFS: distributed file cluster that acts like one hard drive
- YARN: yet-another-resource negotiator; like MESOS but for hadoop
- MapReduce: A mapper distributes data to multiple computers across the cluster; a reducer put it back together (even if a few servers fail)
- Pig: A procedural language that looks like SQL script, but translates that automatically to mappers and reducers
- Hive: A database that reads data with mappers and reducers
- Fault-tolerant, economical, higher throughput than just 1 FS.
- Files are replicated in a fault-tolerant way
- Master-Slave Architecture
- Master = Name Node. Slaves = data nodes
- NameNode is what the client contacts. Data is chunked & split among data nodes.
- 2 name nodes b/c resilency
- Records any change to metadata like rename,create,remove,move.
- 2 files: EditLog & FSImage. One for edits the other for heirarchy.
- Metadata is stored in-memory for fast lookups
- Manage names & locations of file-blocks.
- config replication @ cluster/file lvl
- Yet Another Resource Negotiator
- Used to be part of Map/Reduce but was split out
- Think Mesos. It's a scheduling layer that allows you to delegate jobs to free nodes w/i your hadoop cluster.
- Works with HDFS to run computations based on data locality
- Mesos vs Yarn: A node is allowed to accept/reject a job in Mesos.
- Solves the same problem as Map-Reduce, but optimized for interactive queries
- Map Reduce is meant for batch workloads
- MR can be slow because it's always contacting HDFS & doing disk-reads + disk-writes
- Simplified Data Transfer from HDFS to RDBMS
- Simplified Data Transfer from RDBMS to HDFS
- PRO: takes care of Hadoop provisioning, schema maintenance, with no code
- CON: Supported connectors ONLY