Added streaming and an example in Python and Ruby

greglu · greglu · commit d9c0497f1ba6 · 2011-06-22T19:13:31.000-04:00
diff --git a/README.md b/README.md
@@ -13,6 +13,14 @@ Prerequisites
 * Git
 
 
+Datasets
+--------
+
+https://github.com/hoppertravel/HackReduce/wiki/Datasets
+
+Take a look at the datasets/ folder to see samples subsets of these datasets.
+
+
 Run an example job locally
 --------------------------
 
@@ -80,12 +88,16 @@ Run any of the following commands in your CLI, and after the job's completed, ch
 Note: The jobs are made for the specific datasets, so pairing them up properly is important. The second argument (/tmp/*) is just a made up output path for the results of the job, and can be modified to anything you want.
 
 
-Datasets
---------
+Streaming example
+-----------------
 
-https://github.com/hoppertravel/HackReduce/wiki/Datasets
+* Python
 
-Take a look at the datasets/ folder to see samples subsets of these datasets.
+    $ java -classpath ".:lib/*" org.apache.hadoop.streaming.HadoopStreaming -input datasets/nasdaq/daily_prices/ -output /tmp/py_streaming_count -mapper streaming/nasdaq_counter.py -reducer aggregate
+
+* Ruby
+
+    $ java -classpath ".:lib/*" org.apache.hadoop.streaming.HadoopStreaming -input datasets/nasdaq/daily_prices/ -output /tmp/rb_streaming_count -mapper streaming/nasdaq_counter.rb -reducer aggregate
 
 
 Running on a Hadoop cluster
diff --git a/lib/hadoop-0.20.2-streaming.jar b/lib/hadoop-0.20.2-streaming.jar
diff --git a/streaming/nasdaq_counter.py b/streaming/nasdaq_counter.py
@@ -0,0 +1,21 @@
+#!/usr/bin/python
+
+import sys;
+
+def generateLongCountToken(id):
+    return "LongValueSum:" + id + "\t" + "1"
+
+def main(argv):
+    line = sys.stdin.readline();
+    try:
+        while line:
+            line = line[:-1];
+            fields = line.split(",");
+            # Anything starting with NASDAQ is a valid record
+            if fields[0] == "NASDAQ":
+                print generateLongCountToken(fields[0]);
+            line = sys.stdin.readline();
+    except "end of file":
+        return None
+if __name__ == "__main__":
+     main(sys.argv)
diff --git a/streaming/nasdaq_counter.rb b/streaming/nasdaq_counter.rb
@@ -0,0 +1,10 @@
+#!/usr/bin/env ruby
+ 
+STDIN.each_line do |line|
+  word_count = {}
+  fields = line.split(",")
+
+  if fields[0] == "NASDAQ"
+    puts "LongValueSum:#{fields[0]}\t1"
+  end
+end