Skip to content

Commit 3048746

Browse files
authored
add csv_in_depth.md (#127)
1 parent 0d53074 commit 3048746

File tree

1 file changed

+81
-0
lines changed

1 file changed

+81
-0
lines changed

doc/csv_in_depth.md

+81
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
# CSV Schema in Depth
2+
3+
CSV (comma separated values) schema covers any delimiter (so not just comma) based flat file format. A
4+
complete `"omni.2.1"` CSV schema has 3 parts: `parser_settings`, `file_declaration`, and
5+
`transform_declarations`. We've covered `parser_settings` in [Getting Started](./gettingstarted.md);
6+
we've covered `transform_declarations` in depth in [All About Transforms](./transforms.md); we've covered
7+
some parts of CSV `file_declarations` in [Getting Started](./gettingstarted.md), we'll go into more details
8+
about it here.
9+
10+
## CSV `file_declaration`
11+
12+
Full `file_declaration` schema looks as follows:
13+
```
14+
"file_declaration": {
15+
"delimiter": "<delimiter>" <= required
16+
"replace_double_quotes": true/false, <= optional
17+
"header_row_index": integer >= 1, <= optional
18+
"data_row_index": integer >= 1, <= required
19+
"columns": [ <= required, must not be empty array.
20+
{
21+
"name": "<column name>", <= required
22+
"alias": "<alias name>" <= optional
23+
},
24+
...
25+
]
26+
}
27+
```
28+
29+
- `delimiter`: self-explanatory.
30+
- Note 1: the delimiter doesn't have to be of a single character.
31+
- Note 2: the delimiter doesn't have to be limited to ASCII character(s), utf8 is supported.
32+
33+
- `replace_double_quotes`: omniparser will replace any occurrences of double quotes `"` with single
34+
quotes `'`.
35+
36+
While [CSV RFC](https://tools.ietf.org/html/rfc4180) clearly specifies the rules of using double
37+
quotes and how to escape them, (un)surprisingly many CSV data providers have bad implementations
38+
and unescaped double quotes are quite frequently seen and breaking CSV parsing. The worst offender
39+
is this:
40+
41+
```
42+
COLUMN_1|COLUMN_2|COLUMN_3
43+
...
44+
data 1|"data 2 has a leading double quote" then some|data 3
45+
...
46+
```
47+
48+
Unfortunately Golang's [CSV parser](https://golang.org/pkg/encoding/csv/#Reader) cannot (nor should
49+
it) deal with this situation (even if you set `LazyQuotes` to `true`): it would start gobbling
50+
everything after the leading double quote in `COLUMN_2` as string literals, even passing the delimiter
51+
`|`. Usually it would consume **many many** lines until it finally hits another leading quote by luck.
52+
53+
Given asking data providers/partners to fix double quote escaping properly according to CSV RFC is
54+
usually mission impossible from our experience, we've added `replace_double_quotes` flag to do a
55+
(frankly quite harsh) double quote to single quote replacement - yes, the content is altered, but at
56+
least the parsing and transform will succeed and minor differences result is the least evil we can do.
57+
**Use it only as a last resort**.
58+
59+
- `header_row_index`: line number (1-based) where the column header declaration sits in the input.
60+
61+
If specified, omniparser will read columns in from the given line and compare them
62+
against the declared column names specified in `columns` section. If the actual columns read from the
63+
input have fewer values than the declared columns, omniparser will fail; If the actual columns read
64+
from the input have more values than the declared columns, the excessive actual columns are ignored;
65+
if any column value mismatches, omniparser will fail. All header verification failures are considered
66+
fatal, i.e. the entire ingestion and transform operation aginst the input will fail.
67+
68+
If not specified, no header verification is done and omniparser will assume the column names and order
69+
declared in the `columns` section for the input data.
70+
71+
- `data_row_index`: line number (1-based) where the first actual data line starts in the input. Required.
72+
73+
- `columns.name`: the name of a column.
74+
- Note 1: it must match the corresponding column header value if `header_row_index` is specified.
75+
- Note 2: if name contains white space, then `alias` use is advised (to make XPath query possible).
76+
77+
- `columns.alias`: the alias of a column. Optional.
78+
79+
If a column's name contains space, while it's completely legitimate in CSV, it would make the XPath
80+
based transform hard/impossible later. In situations like this, we strongly advise schema writer to
81+
use `alias` to assign an alias to the column that is XPath friendly, such as containing no spaces.

0 commit comments

Comments
 (0)