forked from mahmoudparsian/pyspark-tutorial
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathadd-indices.txt
executable file
·51 lines (40 loc) · 1.21 KB
/
add-indices.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# ./pyspark
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 1.4.0
/_/
Using Python version 2.6.9 (unknown, Sep 9 2014 15:05:12)
SparkContext available as sc, SQLContext available as sqlContext.
>>> a = [('g1', 2), ('g2', 4), ('g3', 3), ('g4', 8)]
>>> a
[('g1', 2), ('g2', 4), ('g3', 3), ('g4', 8)]
>>> rdd = sc.parallelize(a);
>>> rdd.collect()
[('g1', 2), ('g2', 4), ('g3', 3), ('g4', 8)]
>>> sorted = rdd.sortByKey()
>>> sorted.collect()
[('g1', 2), ('g2', 4), ('g3', 3), ('g4', 8)]
>>> rdd2 = rdd.map(lambda (x,y) : (y,x))
>>> rdd2.collect()
[(2, 'g1'), (4, 'g2'), (3, 'g3'), (8, 'g4')]
>>> sorted = rdd2.sortByKey()
>>> sorted.collect()
[(2, 'g1'), (3, 'g3'), (4, 'g2'), (8, 'g4')]
>>> sorted = rdd2.sortByKey(False)
>>> sorted.collect()
[(8, 'g4'), (4, 'g2'), (3, 'g3'), (2, 'g1')]
>>> sorted = rdd2.sortByKey()
>>> sorted.collect()
[(2, 'g1'), (3, 'g3'), (4, 'g2'), (8, 'g4')]
>>>
>>> list
[(2, 'g1'), (3, 'g3'), (4, 'g2'), (8, 'g4')]
>>>
>>> sorted.collect()
[(2, 'g1'), (3, 'g3'), (4, 'g2'), (8, 'g4')]
>>> indices = sorted.zipWithIndex()
>>> indices.collect()
[((2, 'g1'), 0), ((3, 'g3'), 1), ((4, 'g2'), 2), ((8, 'g4'), 3)]
>>>