Skip to content

Commit d9c0497

Browse files
committed
Added streaming and an example in Python and Ruby
1 parent 4ac4035 commit d9c0497

File tree

4 files changed

+47
-4
lines changed

4 files changed

+47
-4
lines changed

README.md

+16-4
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,14 @@ Prerequisites
1313
* Git
1414

1515

16+
Datasets
17+
--------
18+
19+
https://github.com/hoppertravel/HackReduce/wiki/Datasets
20+
21+
Take a look at the datasets/ folder to see samples subsets of these datasets.
22+
23+
1624
Run an example job locally
1725
--------------------------
1826

@@ -80,12 +88,16 @@ Run any of the following commands in your CLI, and after the job's completed, ch
8088
Note: The jobs are made for the specific datasets, so pairing them up properly is important. The second argument (/tmp/*) is just a made up output path for the results of the job, and can be modified to anything you want.
8189

8290

83-
Datasets
84-
--------
91+
Streaming example
92+
-----------------
8593

86-
https://github.com/hoppertravel/HackReduce/wiki/Datasets
94+
* Python
8795

88-
Take a look at the datasets/ folder to see samples subsets of these datasets.
96+
$ java -classpath ".:lib/*" org.apache.hadoop.streaming.HadoopStreaming -input datasets/nasdaq/daily_prices/ -output /tmp/py_streaming_count -mapper streaming/nasdaq_counter.py -reducer aggregate
97+
98+
* Ruby
99+
100+
$ java -classpath ".:lib/*" org.apache.hadoop.streaming.HadoopStreaming -input datasets/nasdaq/daily_prices/ -output /tmp/rb_streaming_count -mapper streaming/nasdaq_counter.rb -reducer aggregate
89101

90102

91103
Running on a Hadoop cluster

lib/hadoop-0.20.2-streaming.jar

63 KB
Binary file not shown.

streaming/nasdaq_counter.py

+21
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
#!/usr/bin/python
2+
3+
import sys;
4+
5+
def generateLongCountToken(id):
6+
return "LongValueSum:" + id + "\t" + "1"
7+
8+
def main(argv):
9+
line = sys.stdin.readline();
10+
try:
11+
while line:
12+
line = line[:-1];
13+
fields = line.split(",");
14+
# Anything starting with NASDAQ is a valid record
15+
if fields[0] == "NASDAQ":
16+
print generateLongCountToken(fields[0]);
17+
line = sys.stdin.readline();
18+
except "end of file":
19+
return None
20+
if __name__ == "__main__":
21+
main(sys.argv)

streaming/nasdaq_counter.rb

+10
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
#!/usr/bin/env ruby
2+
3+
STDIN.each_line do |line|
4+
word_count = {}
5+
fields = line.split(",")
6+
7+
if fields[0] == "NASDAQ"
8+
puts "LongValueSum:#{fields[0]}\t1"
9+
end
10+
end

0 commit comments

Comments
 (0)