Web search for a planet: The Google cluster architecture (2003)
- Search can be parallelized
- High-end hardware fails too
- Throughput > performance
- Queries per second > Seconds per query
↓
- Rely on software, not hardware
- Use commodity hardware
- Focus on parallel queries
- Achieve high throughput
- Parallelize at CPU or Cluster level
- 10s query parallelized on 10 nodes takes 1s
Notes:
- Index servers: Inverted Index, document-partitioned
- Document servers: Copy of entire internet
Notes:
- Query index servers → doc IDs
- Merge doc IDs → result set
- Fetch doc IDs from doc servers → (Title, URL, result snippet)
- Generate search result page
Notes: