README.md documentation (#8)

ShadenSmith · web-flow · commit 4d17177a323d · 2017-12-17T13:25:56.000-06:00
diff --git a/README.md b/README.md
@@ -6,9 +6,176 @@ Tensor Parser
 A package for constructing sparse tensors from CSV-like data sources.
 
 
+## CSV Files
+We support CSV files stored in text, gzip (`.gz`), or bzip2 (`.bz2`) formats.
+By default, we attempt to auto-detect the header and delimitery of the CSV file
+via Python's supplied CSV parsing library. The `--query` option will query the
+detected CSV metadata and print to `STDOUT`:
+
+    $ ./build/build_tensor.py traffic.csv.gz out.tns --query
+    Found delimiter: ","
+    Found fields:
+    ['Date Of Stop', 'Time Of Stop', 'Latitude', 'Longitude', 'Description']
+
+Note that `out.tns` is not touched when querying a CSV file.
+
+Any numer of CSV files can be provided for output, so long as the fields used
+to construct the sparse tensor are found in each file.
+
+If no head is detected, a default of `['1', '2, ...]` is used.
+
+If you wish to use something other than the detected delimiter or field names,
+they can be modified with `--field-sep=` and `--has-header=<yes,no>`.
+
+
+## Tensor Files
+Two types of files are created:
+  * Sparse tensor (`.tns`): the actual tensor data. Each line is a list of
+    indices and a value. For example, `1 1 1 1.38` would be one-non-zero in a
+    third-order tensor.
+  * Mode mappings (`.map`): map the tensor indices to the original values in
+    the source data. Line `i` is the source data that was mapped to index `i`
+    in the tensor.
+For more information on file formats, see
+[FROSTT](http://frostt.io/tensors/file-formats.html).
+
+
+## Tensor Construction
+### Mode selection
+Columns of the CSV file (referred to as "fields") are selected using the
+`--field=` flag. If the CSV file has a header, the supplied parameter must
+match a field in the header (but is **not** case sensitive). If the field has
+spaces in the name, simply enclose it in quotes: ``--field="time of day"`.
+
+### Tensor values
+A field of the CSV file can be selected to be used as the values of the tensor
+using the `--vals=` flag. If no field is selected to use as the tensor values,
+`1` is used and the resulting tensor is one of count data.
+
+
 ## Mode Types
+A critical step when constructing a sparse tensor is to select the datatype of
+the CSV columns. When the CSV is parsed, the fields are read and sorted as
+strings. Thus, values with the same string representation are mapped to the
+same index in the tensor mode. In practice, however, columns often should be
+treated as integers, floats, dates, or other types.
+
+In addition to affecting the ordering of the resulting indices, the type of a
+column affects the mapping of CSV entries to unique indices. For example, one
+may wish to round floats such that `1.38` and `1.38111` map to the same value.
+
+We provide several types which can be specified with the `--type=` flag:
+  * `str` => String (default)
+  * `int` => Integer
+  * `float` => Floating-point number
+  * `roundf-X` => Floating-point numbers rounded to `X` decimal places
+  * `date` => A `datetime` object that encodes year, month, day, hour,
+    minute, second, and millisecond
+  * `year` => A year (integer extracted from `date`)
+  * `month` => A month (integer in range [0,11] extracted from `date`)
+  * `day` => A day (integer in range [0,30] extracted from `date`)
+  * `hour` => A hour (integer in range [0,23] extracted from `date`)
+  * `min` => A min (integer in range [0,60] extracted from `date`)
+  * `sec` => A sec (integer in range [0,60] extracted from `date`)
+
+Smart date matching is provided by the
+[dateutil](https://pypi.python.org/pypi/python-dateutil) package. For example,
+`"Aug 20"` and `"08/20/92"` will map to the same index if the type is either
+`month` or `day`. However, the package maps to the current year if none is
+specified, and thus they will map to different indices if the type is `year`.
+
+You can specify multiple fields in the same `--type` instance. For example:
+`--type=userid,itemid,int` would treat the fields `userid` and `itemid` both
+to integers.
+
+
+### Advanced mode types
+A "type" in our context is any object which supports:
+  * construction: `type("X")` should return some representation of "X"
+  * comparison: `__le__()` is required to sort indices. If no comparison is
+    possible, be sure to disable sorting of the mode with `--no-sort`
+  * printing: `__str__()` is required to construct `.map` files
+The specification of a type is as simple as providing a function which maps a
+string to some object. Conveniently, most builtin types already support this
+interface via their constructors. Functions such as `int()` and `float()`
+work well.
+
+Many types can be specified with a short anonymous function. If the specified
+type is not found in the list of builtin types (above), then it is treated
+as source code and specifies a custom type. For example,
+
+    `--type=cost,"lambda x : float(x) * 1.06"`
+
+may be a method of scaling all costs by 6% to account for sales tax. Note that
+all types should take a single parameter which will be an `str` object.
+
+
 
 ## Handling Duplicates
+By default, duplicate non-zero values are removed and their values are summed.
+This behavior can be changed with `--merge=`, which takes one of the following
+options:
+
+  * `none` (do not remove duplicate non-zeros)
+  * `sum`
+  * `min`
+  * `max`
+  * `avg`
+  * `count` (use the number of duplicates)
+
+Note that merging duplicates requires the tensor to be sorted. A disk-based
+sort is provided by the `csvsorter` library.
+
+
+## Example
+Suppose you have the following CSV file:
+
+    $ zcat traffic.csv.gz
+    Date Of Stop,Time Of Stop,Latitude,Longitude,Description
+    01/01/2013,02:23:00,39.0584153167,-77.0480714833,DUI
+    01/01/2013,01:45:00,38.9907737666667,-77.1545810833333,SPEEDING
+    01/01/2013,05:15:00,39.288735,-77.20448,DWI
+
+We want to keep the dates, hour of violation, lower-case description, and round
+the geolocations to three decimal places:
+
+    $ ./scripts/build_tensor.py traffic.csv.gz traffic.tns \
+        -f "date of stop" --type="date of stop",date \
+        -f "time of stop" --type="time of stop",hour \
+        -f latitude -f longitude --type=latitude,longitude,roundf-3 \
+        -f description --type=description,"lambda x : x.lower()"
+
+The resulting tensor is built:
+
+    $ cat mode-1-dateofstop.map
+    2013-01-01 00:00:00
+
+    $ cat mode-2-timeofstop.map
+    1
+    2
+    5
+
+    $ cat mode-3-latitude.map
+    38.991
+    39.058
+    39.289
+
+    $ cat mode-4-longitude.map
+    -77.204
+    -77.155
+    -77.048
+
+    $ cat mode-5-description.map
+    dui
+    dwi
+    speeding
+
+    $ cat traffic.tns
+    1 1 1 2 3 1
+    1 2 2 3 1 1
+    1 3 3 1 2 1
+
+
 
 ## Testing
 This project uses the builtin `unittest` library provided by Python. You can
diff --git a/scripts/build_tensor.py b/scripts/build_tensor.py
@@ -147,9 +147,8 @@ def parse_args(cmd_args=None):
   parser.add_argument('--has-header', choices=['yes', 'no'],
       help='Indicate whether CSV has a header row (default: auto)')
 
-  parser.add_argument('-q', '--query', action='append',
-      choices=['field-sep', 'header'],
-      help='query a component of the CSV file and exit')
+  parser.add_argument('-q', '--query', action='store_true',
+      help='query metadata of the CSV file and exit')
 
   parser.add_argument('--merge', type=str, default='sum',
       choices=['none', 'sum', 'min', 'max', 'avg', 'count'],
@@ -172,17 +171,12 @@ def parse_args(cmd_args=None):
     else:
       args.has_header = False
 
-  #
-  # Check for file query.no_sort
-  #
+  # Check for file query
   if args.query:
     parser = csv_parser(args.csv[0])
-    for query in args.query:
-      if query == 'field-sep':
-        print('Found delimiter: "{}"'.format(parser.get_delimiter()))
-      if query == 'header':
-        print('Found fields:')
-        pprint.pprint(parser.get_header())
+    print('Found delimiter: "{}"'.format(parser.get_delimiter()))
+    print('Found fields:')
+    pprint.pprint(parser.get_header())
     sys.exit(0)