|
| 1 | +# CSV Schema in Depth |
| 2 | + |
| 3 | +CSV (comma separated values) schema covers any delimiter (so not just comma) based flat file format. A |
| 4 | +complete `"omni.2.1"` CSV schema has 3 parts: `parser_settings`, `file_declaration`, and |
| 5 | +`transform_declarations`. We've covered `parser_settings` in [Getting Started](./gettingstarted.md); |
| 6 | +we've covered `transform_declarations` in depth in [All About Transforms](./transforms.md); we've covered |
| 7 | +some parts of CSV `file_declarations` in [Getting Started](./gettingstarted.md), we'll go into more details |
| 8 | +about it here. |
| 9 | + |
| 10 | +## CSV `file_declaration` |
| 11 | + |
| 12 | +Full `file_declaration` schema looks as follows: |
| 13 | +``` |
| 14 | +"file_declaration": { |
| 15 | + "delimiter": "<delimiter>" <= required |
| 16 | + "replace_double_quotes": true/false, <= optional |
| 17 | + "header_row_index": integer >= 1, <= optional |
| 18 | + "data_row_index": integer >= 1, <= required |
| 19 | + "columns": [ <= required, must not be empty array. |
| 20 | + { |
| 21 | + "name": "<column name>", <= required |
| 22 | + "alias": "<alias name>" <= optional |
| 23 | + }, |
| 24 | + ... |
| 25 | + ] |
| 26 | +} |
| 27 | +``` |
| 28 | + |
| 29 | +- `delimiter`: self-explanatory. |
| 30 | + - Note 1: the delimiter doesn't have to be of a single character. |
| 31 | + - Note 2: the delimiter doesn't have to be limited to ASCII character(s), utf8 is supported. |
| 32 | + |
| 33 | +- `replace_double_quotes`: omniparser will replace any occurrences of double quotes `"` with single |
| 34 | +quotes `'`. |
| 35 | + |
| 36 | + While [CSV RFC](https://tools.ietf.org/html/rfc4180) clearly specifies the rules of using double |
| 37 | + quotes and how to escape them, (un)surprisingly many CSV data providers have bad implementations |
| 38 | + and unescaped double quotes are quite frequently seen and breaking CSV parsing. The worst offender |
| 39 | + is this: |
| 40 | + |
| 41 | + ``` |
| 42 | + COLUMN_1|COLUMN_2|COLUMN_3 |
| 43 | + ... |
| 44 | + data 1|"data 2 has a leading double quote" then some|data 3 |
| 45 | + ... |
| 46 | + ``` |
| 47 | + |
| 48 | + Unfortunately Golang's [CSV parser](https://golang.org/pkg/encoding/csv/#Reader) cannot (nor should |
| 49 | + it) deal with this situation (even if you set `LazyQuotes` to `true`): it would start gobbling |
| 50 | + everything after the leading double quote in `COLUMN_2` as string literals, even passing the delimiter |
| 51 | + `|`. Usually it would consume **many many** lines until it finally hits another leading quote by luck. |
| 52 | + |
| 53 | + Given asking data providers/partners to fix double quote escaping properly according to CSV RFC is |
| 54 | + usually mission impossible from our experience, we've added `replace_double_quotes` flag to do a |
| 55 | + (frankly quite harsh) double quote to single quote replacement - yes, the content is altered, but at |
| 56 | + least the parsing and transform will succeed and minor differences result is the least evil we can do. |
| 57 | + **Use it only as a last resort**. |
| 58 | + |
| 59 | +- `header_row_index`: line number (1-based) where the column header declaration sits in the input. |
| 60 | + |
| 61 | + If specified, omniparser will read columns in from the given line and compare them |
| 62 | + against the declared column names specified in `columns` section. If the actual columns read from the |
| 63 | + input have fewer values than the declared columns, omniparser will fail; If the actual columns read |
| 64 | + from the input have more values than the declared columns, the excessive actual columns are ignored; |
| 65 | + if any column value mismatches, omniparser will fail. All header verification failures are considered |
| 66 | + fatal, i.e. the entire ingestion and transform operation aginst the input will fail. |
| 67 | + |
| 68 | + If not specified, no header verification is done and omniparser will assume the column names and order |
| 69 | + declared in the `columns` section for the input data. |
| 70 | + |
| 71 | +- `data_row_index`: line number (1-based) where the first actual data line starts in the input. Required. |
| 72 | + |
| 73 | +- `columns.name`: the name of a column. |
| 74 | + - Note 1: it must match the corresponding column header value if `header_row_index` is specified. |
| 75 | + - Note 2: if name contains white space, then `alias` use is advised (to make XPath query possible). |
| 76 | + |
| 77 | +- `columns.alias`: the alias of a column. Optional. |
| 78 | + |
| 79 | + If a column's name contains space, while it's completely legitimate in CSV, it would make the XPath |
| 80 | + based transform hard/impossible later. In situations like this, we strongly advise schema writer to |
| 81 | + use `alias` to assign an alias to the column that is XPath friendly, such as containing no spaces. |
0 commit comments