What would Google do?

Web search for a planet: The Google cluster architecture (2003)

Basic assumptions

Search can be parallelized
High-end hardware fails too
Throughput > performance
Queries per second > Seconds per query

↓

Rely on software, not hardware
Use commodity hardware

Parallelize

Focus on parallel queries
Achieve high throughput
Parallelize at CPU or Cluster level
10s query parallelized on 10 nodes takes 1s

Notes:

Query time

Index servers: Inverted Index, document-partitioned
Document servers: Copy of entire internet

Notes:

Query time

Query index servers → doc IDs
Merge doc IDs → result set
Fetch doc IDs from doc servers → (Title, URL, result snippet)
Generate search result page

Notes: