@@ -6,9 +6,176 @@ Tensor Parser
6
6
A package for constructing sparse tensors from CSV-like data sources.
7
7
8
8
9
+ ## CSV Files
10
+ We support CSV files stored in text, gzip (` .gz ` ), or bzip2 (` .bz2 ` ) formats.
11
+ By default, we attempt to auto-detect the header and delimitery of the CSV file
12
+ via Python's supplied CSV parsing library. The ` --query ` option will query the
13
+ detected CSV metadata and print to ` STDOUT ` :
14
+
15
+ $ ./build/build_tensor.py traffic.csv.gz out.tns --query
16
+ Found delimiter: ","
17
+ Found fields:
18
+ ['Date Of Stop', 'Time Of Stop', 'Latitude', 'Longitude', 'Description']
19
+
20
+ Note that ` out.tns ` is not touched when querying a CSV file.
21
+
22
+ Any numer of CSV files can be provided for output, so long as the fields used
23
+ to construct the sparse tensor are found in each file.
24
+
25
+ If no head is detected, a default of ` ['1', '2, ...] ` is used.
26
+
27
+ If you wish to use something other than the detected delimiter or field names,
28
+ they can be modified with ` --field-sep= ` and ` --has-header=<yes,no> ` .
29
+
30
+
31
+ ## Tensor Files
32
+ Two types of files are created:
33
+ * Sparse tensor (` .tns ` ): the actual tensor data. Each line is a list of
34
+ indices and a value. For example, ` 1 1 1 1.38 ` would be one-non-zero in a
35
+ third-order tensor.
36
+ * Mode mappings (` .map ` ): map the tensor indices to the original values in
37
+ the source data. Line ` i ` is the source data that was mapped to index ` i `
38
+ in the tensor.
39
+ For more information on file formats, see
40
+ [ FROSTT] ( http://frostt.io/tensors/file-formats.html ) .
41
+
42
+
43
+ ## Tensor Construction
44
+ ### Mode selection
45
+ Columns of the CSV file (referred to as "fields") are selected using the
46
+ ` --field= ` flag. If the CSV file has a header, the supplied parameter must
47
+ match a field in the header (but is ** not** case sensitive). If the field has
48
+ spaces in the name, simply enclose it in quotes: ``--field="time of day"`.
49
+
50
+ ### Tensor values
51
+ A field of the CSV file can be selected to be used as the values of the tensor
52
+ using the ` --vals= ` flag. If no field is selected to use as the tensor values,
53
+ ` 1 ` is used and the resulting tensor is one of count data.
54
+
55
+
9
56
## Mode Types
57
+ A critical step when constructing a sparse tensor is to select the datatype of
58
+ the CSV columns. When the CSV is parsed, the fields are read and sorted as
59
+ strings. Thus, values with the same string representation are mapped to the
60
+ same index in the tensor mode. In practice, however, columns often should be
61
+ treated as integers, floats, dates, or other types.
62
+
63
+ In addition to affecting the ordering of the resulting indices, the type of a
64
+ column affects the mapping of CSV entries to unique indices. For example, one
65
+ may wish to round floats such that ` 1.38 ` and ` 1.38111 ` map to the same value.
66
+
67
+ We provide several types which can be specified with the ` --type= ` flag:
68
+ * ` str ` => String (default)
69
+ * ` int ` => Integer
70
+ * ` float ` => Floating-point number
71
+ * ` roundf-X ` => Floating-point numbers rounded to ` X ` decimal places
72
+ * ` date ` => A ` datetime ` object that encodes year, month, day, hour,
73
+ minute, second, and millisecond
74
+ * ` year ` => A year (integer extracted from ` date ` )
75
+ * ` month ` => A month (integer in range [ 0,11] extracted from ` date ` )
76
+ * ` day ` => A day (integer in range [ 0,30] extracted from ` date ` )
77
+ * ` hour ` => A hour (integer in range [ 0,23] extracted from ` date ` )
78
+ * ` min ` => A min (integer in range [ 0,60] extracted from ` date ` )
79
+ * ` sec ` => A sec (integer in range [ 0,60] extracted from ` date ` )
80
+
81
+ Smart date matching is provided by the
82
+ [ dateutil] ( https://pypi.python.org/pypi/python-dateutil ) package. For example,
83
+ ` "Aug 20" ` and ` "08/20/92" ` will map to the same index if the type is either
84
+ ` month ` or ` day ` . However, the package maps to the current year if none is
85
+ specified, and thus they will map to different indices if the type is ` year ` .
86
+
87
+ You can specify multiple fields in the same ` --type ` instance. For example:
88
+ ` --type=userid,itemid,int ` would treat the fields ` userid ` and ` itemid ` both
89
+ to integers.
90
+
91
+
92
+ ### Advanced mode types
93
+ A "type" in our context is any object which supports:
94
+ * construction: ` type("X") ` should return some representation of "X"
95
+ * comparison: ` __le__() ` is required to sort indices. If no comparison is
96
+ possible, be sure to disable sorting of the mode with ` --no-sort `
97
+ * printing: ` __str__() ` is required to construct ` .map ` files
98
+ The specification of a type is as simple as providing a function which maps a
99
+ string to some object. Conveniently, most builtin types already support this
100
+ interface via their constructors. Functions such as ` int() ` and ` float() `
101
+ work well.
102
+
103
+ Many types can be specified with a short anonymous function. If the specified
104
+ type is not found in the list of builtin types (above), then it is treated
105
+ as source code and specifies a custom type. For example,
106
+
107
+ `--type=cost,"lambda x : float(x) * 1.06"`
108
+
109
+ may be a method of scaling all costs by 6% to account for sales tax. Note that
110
+ all types should take a single parameter which will be an ` str ` object.
111
+
112
+
10
113
11
114
## Handling Duplicates
115
+ By default, duplicate non-zero values are removed and their values are summed.
116
+ This behavior can be changed with ` --merge= ` , which takes one of the following
117
+ options:
118
+
119
+ * ` none ` (do not remove duplicate non-zeros)
120
+ * ` sum `
121
+ * ` min `
122
+ * ` max `
123
+ * ` avg `
124
+ * ` count ` (use the number of duplicates)
125
+
126
+ Note that merging duplicates requires the tensor to be sorted. A disk-based
127
+ sort is provided by the ` csvsorter ` library.
128
+
129
+
130
+ ## Example
131
+ Suppose you have the following CSV file:
132
+
133
+ $ zcat traffic.csv.gz
134
+ Date Of Stop,Time Of Stop,Latitude,Longitude,Description
135
+ 01/01/2013,02:23:00,39.0584153167,-77.0480714833,DUI
136
+ 01/01/2013,01:45:00,38.9907737666667,-77.1545810833333,SPEEDING
137
+ 01/01/2013,05:15:00,39.288735,-77.20448,DWI
138
+
139
+ We want to keep the dates, hour of violation, lower-case description, and round
140
+ the geolocations to three decimal places:
141
+
142
+ $ ./scripts/build_tensor.py traffic.csv.gz traffic.tns \
143
+ -f "date of stop" --type="date of stop",date \
144
+ -f "time of stop" --type="time of stop",hour \
145
+ -f latitude -f longitude --type=latitude,longitude,roundf-3 \
146
+ -f description --type=description,"lambda x : x.lower()"
147
+
148
+ The resulting tensor is built:
149
+
150
+ $ cat mode-1-dateofstop.map
151
+ 2013-01-01 00:00:00
152
+
153
+ $ cat mode-2-timeofstop.map
154
+ 1
155
+ 2
156
+ 5
157
+
158
+ $ cat mode-3-latitude.map
159
+ 38.991
160
+ 39.058
161
+ 39.289
162
+
163
+ $ cat mode-4-longitude.map
164
+ -77.204
165
+ -77.155
166
+ -77.048
167
+
168
+ $ cat mode-5-description.map
169
+ dui
170
+ dwi
171
+ speeding
172
+
173
+ $ cat traffic.tns
174
+ 1 1 1 2 3 1
175
+ 1 2 2 3 1 1
176
+ 1 3 3 1 2 1
177
+
178
+
12
179
13
180
## Testing
14
181
This project uses the builtin ` unittest ` library provided by Python. You can
0 commit comments