Skip to content

Commit 4d17177

Browse files
authored
README.md documentation (#8)
1 parent 9169b9b commit 4d17177

File tree

2 files changed

+173
-12
lines changed

2 files changed

+173
-12
lines changed

README.md

+167
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,176 @@ Tensor Parser
66
A package for constructing sparse tensors from CSV-like data sources.
77

88

9+
## CSV Files
10+
We support CSV files stored in text, gzip (`.gz`), or bzip2 (`.bz2`) formats.
11+
By default, we attempt to auto-detect the header and delimitery of the CSV file
12+
via Python's supplied CSV parsing library. The `--query` option will query the
13+
detected CSV metadata and print to `STDOUT`:
14+
15+
$ ./build/build_tensor.py traffic.csv.gz out.tns --query
16+
Found delimiter: ","
17+
Found fields:
18+
['Date Of Stop', 'Time Of Stop', 'Latitude', 'Longitude', 'Description']
19+
20+
Note that `out.tns` is not touched when querying a CSV file.
21+
22+
Any numer of CSV files can be provided for output, so long as the fields used
23+
to construct the sparse tensor are found in each file.
24+
25+
If no head is detected, a default of `['1', '2, ...]` is used.
26+
27+
If you wish to use something other than the detected delimiter or field names,
28+
they can be modified with `--field-sep=` and `--has-header=<yes,no>`.
29+
30+
31+
## Tensor Files
32+
Two types of files are created:
33+
* Sparse tensor (`.tns`): the actual tensor data. Each line is a list of
34+
indices and a value. For example, `1 1 1 1.38` would be one-non-zero in a
35+
third-order tensor.
36+
* Mode mappings (`.map`): map the tensor indices to the original values in
37+
the source data. Line `i` is the source data that was mapped to index `i`
38+
in the tensor.
39+
For more information on file formats, see
40+
[FROSTT](http://frostt.io/tensors/file-formats.html).
41+
42+
43+
## Tensor Construction
44+
### Mode selection
45+
Columns of the CSV file (referred to as "fields") are selected using the
46+
`--field=` flag. If the CSV file has a header, the supplied parameter must
47+
match a field in the header (but is **not** case sensitive). If the field has
48+
spaces in the name, simply enclose it in quotes: ``--field="time of day"`.
49+
50+
### Tensor values
51+
A field of the CSV file can be selected to be used as the values of the tensor
52+
using the `--vals=` flag. If no field is selected to use as the tensor values,
53+
`1` is used and the resulting tensor is one of count data.
54+
55+
956
## Mode Types
57+
A critical step when constructing a sparse tensor is to select the datatype of
58+
the CSV columns. When the CSV is parsed, the fields are read and sorted as
59+
strings. Thus, values with the same string representation are mapped to the
60+
same index in the tensor mode. In practice, however, columns often should be
61+
treated as integers, floats, dates, or other types.
62+
63+
In addition to affecting the ordering of the resulting indices, the type of a
64+
column affects the mapping of CSV entries to unique indices. For example, one
65+
may wish to round floats such that `1.38` and `1.38111` map to the same value.
66+
67+
We provide several types which can be specified with the `--type=` flag:
68+
* `str` => String (default)
69+
* `int` => Integer
70+
* `float` => Floating-point number
71+
* `roundf-X` => Floating-point numbers rounded to `X` decimal places
72+
* `date` => A `datetime` object that encodes year, month, day, hour,
73+
minute, second, and millisecond
74+
* `year` => A year (integer extracted from `date`)
75+
* `month` => A month (integer in range [0,11] extracted from `date`)
76+
* `day` => A day (integer in range [0,30] extracted from `date`)
77+
* `hour` => A hour (integer in range [0,23] extracted from `date`)
78+
* `min` => A min (integer in range [0,60] extracted from `date`)
79+
* `sec` => A sec (integer in range [0,60] extracted from `date`)
80+
81+
Smart date matching is provided by the
82+
[dateutil](https://pypi.python.org/pypi/python-dateutil) package. For example,
83+
`"Aug 20"` and `"08/20/92"` will map to the same index if the type is either
84+
`month` or `day`. However, the package maps to the current year if none is
85+
specified, and thus they will map to different indices if the type is `year`.
86+
87+
You can specify multiple fields in the same `--type` instance. For example:
88+
`--type=userid,itemid,int` would treat the fields `userid` and `itemid` both
89+
to integers.
90+
91+
92+
### Advanced mode types
93+
A "type" in our context is any object which supports:
94+
* construction: `type("X")` should return some representation of "X"
95+
* comparison: `__le__()` is required to sort indices. If no comparison is
96+
possible, be sure to disable sorting of the mode with `--no-sort`
97+
* printing: `__str__()` is required to construct `.map` files
98+
The specification of a type is as simple as providing a function which maps a
99+
string to some object. Conveniently, most builtin types already support this
100+
interface via their constructors. Functions such as `int()` and `float()`
101+
work well.
102+
103+
Many types can be specified with a short anonymous function. If the specified
104+
type is not found in the list of builtin types (above), then it is treated
105+
as source code and specifies a custom type. For example,
106+
107+
`--type=cost,"lambda x : float(x) * 1.06"`
108+
109+
may be a method of scaling all costs by 6% to account for sales tax. Note that
110+
all types should take a single parameter which will be an `str` object.
111+
112+
10113

11114
## Handling Duplicates
115+
By default, duplicate non-zero values are removed and their values are summed.
116+
This behavior can be changed with `--merge=`, which takes one of the following
117+
options:
118+
119+
* `none` (do not remove duplicate non-zeros)
120+
* `sum`
121+
* `min`
122+
* `max`
123+
* `avg`
124+
* `count` (use the number of duplicates)
125+
126+
Note that merging duplicates requires the tensor to be sorted. A disk-based
127+
sort is provided by the `csvsorter` library.
128+
129+
130+
## Example
131+
Suppose you have the following CSV file:
132+
133+
$ zcat traffic.csv.gz
134+
Date Of Stop,Time Of Stop,Latitude,Longitude,Description
135+
01/01/2013,02:23:00,39.0584153167,-77.0480714833,DUI
136+
01/01/2013,01:45:00,38.9907737666667,-77.1545810833333,SPEEDING
137+
01/01/2013,05:15:00,39.288735,-77.20448,DWI
138+
139+
We want to keep the dates, hour of violation, lower-case description, and round
140+
the geolocations to three decimal places:
141+
142+
$ ./scripts/build_tensor.py traffic.csv.gz traffic.tns \
143+
-f "date of stop" --type="date of stop",date \
144+
-f "time of stop" --type="time of stop",hour \
145+
-f latitude -f longitude --type=latitude,longitude,roundf-3 \
146+
-f description --type=description,"lambda x : x.lower()"
147+
148+
The resulting tensor is built:
149+
150+
$ cat mode-1-dateofstop.map
151+
2013-01-01 00:00:00
152+
153+
$ cat mode-2-timeofstop.map
154+
1
155+
2
156+
5
157+
158+
$ cat mode-3-latitude.map
159+
38.991
160+
39.058
161+
39.289
162+
163+
$ cat mode-4-longitude.map
164+
-77.204
165+
-77.155
166+
-77.048
167+
168+
$ cat mode-5-description.map
169+
dui
170+
dwi
171+
speeding
172+
173+
$ cat traffic.tns
174+
1 1 1 2 3 1
175+
1 2 2 3 1 1
176+
1 3 3 1 2 1
177+
178+
12179

13180
## Testing
14181
This project uses the builtin `unittest` library provided by Python. You can

scripts/build_tensor.py

+6-12
Original file line numberDiff line numberDiff line change
@@ -147,9 +147,8 @@ def parse_args(cmd_args=None):
147147
parser.add_argument('--has-header', choices=['yes', 'no'],
148148
help='Indicate whether CSV has a header row (default: auto)')
149149

150-
parser.add_argument('-q', '--query', action='append',
151-
choices=['field-sep', 'header'],
152-
help='query a component of the CSV file and exit')
150+
parser.add_argument('-q', '--query', action='store_true',
151+
help='query metadata of the CSV file and exit')
153152

154153
parser.add_argument('--merge', type=str, default='sum',
155154
choices=['none', 'sum', 'min', 'max', 'avg', 'count'],
@@ -172,17 +171,12 @@ def parse_args(cmd_args=None):
172171
else:
173172
args.has_header = False
174173

175-
#
176-
# Check for file query.no_sort
177-
#
174+
# Check for file query
178175
if args.query:
179176
parser = csv_parser(args.csv[0])
180-
for query in args.query:
181-
if query == 'field-sep':
182-
print('Found delimiter: "{}"'.format(parser.get_delimiter()))
183-
if query == 'header':
184-
print('Found fields:')
185-
pprint.pprint(parser.get_header())
177+
print('Found delimiter: "{}"'.format(parser.get_delimiter()))
178+
print('Found fields:')
179+
pprint.pprint(parser.get_header())
186180
sys.exit(0)
187181

188182

0 commit comments

Comments
 (0)