Skip to content

Latest commit

 

History

History
44 lines (39 loc) · 1.98 KB

hadoop.md

File metadata and controls

44 lines (39 loc) · 1.98 KB

Hadoop

  • horizontal scaling vs vertical scaling. Hadoop is multiple machines being one thing.
  • Premise: Shipping data is expensive so let's put it in a "data ecosystem" & ship the computation closer to the data
  • HDFS: distributed file cluster that acts like one hard drive
  • YARN: yet-another-resource negotiator; like MESOS but for hadoop
  • MapReduce: A mapper distributes data to multiple computers across the cluster; a reducer put it back together (even if a few servers fail)
  • Pig: A procedural language that looks like SQL script, but translates that automatically to mappers and reducers
  • Hive: A database that reads data with mappers and reducers

HDFS

  • Fault-tolerant, economical, higher throughput than just 1 FS.
  • Files are replicated in a fault-tolerant way
  • Master-Slave Architecture
  • Master = Name Node. Slaves = data nodes

Name Node

  • NameNode is what the client contacts. Data is chunked & split among data nodes.
  • 2 name nodes b/c resilency
  • Records any change to metadata like rename,create,remove,move.
  • 2 files: EditLog & FSImage. One for edits the other for heirarchy.
  • Metadata is stored in-memory for fast lookups

Data Nodes

  • Manage names & locations of file-blocks.
  • config replication @ cluster/file lvl

YARN

  • Yet Another Resource Negotiator
  • Used to be part of Map/Reduce but was split out
  • Think Mesos. It's a scheduling layer that allows you to delegate jobs to free nodes w/i your hadoop cluster.
  • Works with HDFS to run computations based on data locality
  • Mesos vs Yarn: A node is allowed to accept/reject a job in Mesos.

Map-Reduce

Tez

  • Solves the same problem as Map-Reduce, but optimized for interactive queries
  • Map Reduce is meant for batch workloads
  • MR can be slow because it's always contacting HDFS & doing disk-reads + disk-writes

Sqoop

  • Simplified Data Transfer from HDFS to RDBMS
  • Simplified Data Transfer from RDBMS to HDFS
  • PRO: takes care of Hadoop provisioning, schema maintenance, with no code
  • CON: Supported connectors ONLY