Skip to content

Count the number of times certain words were said in a particular neighborhood. Performed as a basic MapReduce job against 25M tweets. Implemented with different programming languages as a educational exercise.

Notifications You must be signed in to change notification settings

maxgrenderjones/etl-language-comparison

 
 

Repository files navigation

Update

Please see the following blog posts for the latests updates:

  1. ETL Language Showdown - Sept. 2014
  2. ETL Language Showdown Part 2 - Now with Python - May. 2015

ETL Language Showdown

This repo implements the same map reduce ETL (Extract-Transform-Load) task in multiple languages in an effort to compare language productivity, terseness and readability. The performance comparisons should not be taken seriously. If anything, it is a bigger indication of my skillset in that language rather than their performance capabilities.

The Task

Count the number of tweets that mention 'knicks' in their message and bucket based on the neighborhood of origin. The ~1GB dataset for this task, sampled below, contains a tweet's message and its NYC neighborhood. It can be downloaded here.

91	west-brighton	Brooklyn	Uhhh
121	turtle-bay-east-midtown	Manhattan	Say anything
175	morningside-heights	Manhattan	It feels half-cheating half-fulfilling to cite myself.

Initial Assumption

  • These tasks are not run on Hadoop but do run concurrently. Performance numbers are moot since the CPU mostly sits idle waiting on Disk IO.
  • **UPDATE: Boy was the IO bound assumption wrong.

The Languages

  1. Ruby 2.2.2
  2. Golang 1.4.2 - Imperative
  3. Scala 2.11.4 - Both Imperative and Functional
  4. Elixir 1.0.4 - Functional
  5. Python 3

Scala

  • Uses Akka (Supervisors and Actors)

Results

Ruby w/ Celluloid (Global Interpreter Lock Bound, single core) 43.7s
JRuby w/ Celluloid 15.8s
Ruby w/ grosser/parallel (not GNU Parallel) 10.9s
Python w/ Pool 11.7s
Elixir 21.8s
Scala 8.8s
Scala w/ Substring (Skipped regex for performance analysis) 8.3s
Golang 32.8s
Golang w/ Substring (Skipped regex for performance analysis) 7.8s
Node w/ Cluster TODO

About

Count the number of times certain words were said in a particular neighborhood. Performed as a basic MapReduce job against 25M tweets. Implemented with different programming languages as a educational exercise.

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Scala 26.1%
  • Elixir 20.8%
  • Go 17.9%
  • Ruby 14.6%
  • Python 10.4%
  • Nim 5.7%
  • Shell 4.5%