Please see the following blog posts for the latests updates:
- ETL Language Showdown - Sept. 2014
- ETL Language Showdown Part 2 - Now with Python - May. 2015
This repo implements the same map reduce ETL (Extract-Transform-Load) task in multiple languages in an effort to compare language productivity, terseness and readability. The performance comparisons should not be taken seriously. If anything, it is a bigger indication of my skillset in that language rather than their performance capabilities.
Count the number of tweets that mention 'knicks' in their message and bucket based on the neighborhood of origin. The ~1GB dataset for this task, sampled below, contains a tweet's message and its NYC neighborhood. It can be downloaded here.
91 west-brighton Brooklyn Uhhh
121 turtle-bay-east-midtown Manhattan Say anything
175 morningside-heights Manhattan It feels half-cheating half-fulfilling to cite myself.
- These tasks are not run on Hadoop but do run concurrently. Performance numbers are moot since the CPU mostly sits idle waiting on Disk IO.
- **UPDATE: Boy was the IO bound assumption wrong.
- Ruby 2.2.2
- Golang 1.4.2 - Imperative
- Scala 2.11.4 - Both Imperative and Functional
- Elixir 1.0.4 - Functional
- Python 3
- Uses Akka (Supervisors and Actors)
Ruby w/ Celluloid (Global Interpreter Lock Bound, single core) | 43.7s |
JRuby w/ Celluloid | 15.8s |
Ruby w/ grosser/parallel (not GNU Parallel) | 10.9s |
Python w/ Pool | 11.7s |
Elixir | 21.8s |
Scala | 8.8s |
Scala w/ Substring (Skipped regex for performance analysis) | 8.3s |
Golang | 32.8s |
Golang w/ Substring (Skipped regex for performance analysis) | 7.8s |
Node w/ Cluster | TODO |