|
| 1 | +# Fixed-Length Schema in Depth |
| 2 | + |
| 3 | +Fixed-length (sometimes also called fixed-width) schema has 3 parts: `parser_settings`, `file_declaration`, |
| 4 | +and `transform_declarations`. We've covered `parser_settings` in [Getting Started](./gettingstarted.md); |
| 5 | +we've covered `transform_declarations` in depth in [All About Transforms](./transforms.md). Before we go |
| 6 | +into `file_declaration`, we need to talk about the concept of `envelope`. |
| 7 | + |
| 8 | +An `envelope` is a basic data unit ingested in from the input, processed and transformed by omniparser for |
| 9 | +a fixed length input file. Here are a few examples: |
| 10 | + |
| 11 | +- Single line `envelope` (from [this sample](../extensions/omniv21/samples/fixedlength/1_single_row.input.txt)): |
| 12 | + ``` |
| 13 | + 2019/01/31T12:34:56-0800 10.5 30.2 N 33 37.7749 122.4194 |
| 14 | + 2020/07/31T01:23:45-0500 39 95 SE 8 32.7767 96.7970 |
| 15 | + ``` |
| 16 | + Here each line in the input is an `envelope`, individually ingested, processed and transformed. |
| 17 | +
|
| 18 | +- Multi-line `envelope` (from [this sample](../extensions/omniv21/samples/fixedlength/2_multi_rows.input.txt)): |
| 19 | + ``` |
| 20 | + H001TW 0689311345 030 DEL HMDEPOT 20190826124704 EST 1230004321 W841206858 James Bond 20190827 RHONDA W841206858 128 Bird Ave HAPPYVALLEY FL54321 USELGS 030 1 105.00 THDNR 1 1003164515 DLB000977979 Appt set between 00 00 AND 00 00 on 08 27 19 AG 21200 1600 7029083C 314 Algebra Blvd MEDIAN FL31415 US |
| 21 | + H0020689311345 BM 040670500067120 |
| 22 | + H0020689311345 CN 100000103732 |
| 23 | + H0020689311345 OA 2400 Highway 155 South |
| 24 | + H0020689311345 OC Locust Grove |
| 25 | + H0020689311345 ON THE HOME DEPOT 6705 |
| 26 | + H0020689311345 OO W841206858 |
| 27 | + H0020689311345 OP 30248 |
| 28 | + H0020689311345 OST GA |
| 29 | + H0020689311345 OT US |
| 30 | + H0020689311345 PO 7029083C |
| 31 | + H001TW 0689311348 030 DEL HMDEPOT 20190826124704 EST 1230001234 W938003272 Jason Bourne 20190827 RHONDA W938003272 123 S 45ST ST MAGIC BEACH FL12345 USELGS 030 1 1.00 THDNR 1 1003621621 Appt set between 00 00 AND 00 00 on 08 27 19 AG 21030 1430 6816137C 314 Algebra Blvd MEDIAN FL31415 US |
| 32 | + H0020689311348 BM 040677700277714 |
| 33 | + H0020689311348 CN |
| 34 | + H0020689311348 OA 2500 Highway 155 South |
| 35 | + H0020689311348 OC Locust Grove |
| 36 | + H0020689311348 ON THE HOME DEPOT 6777 |
| 37 | + H0020689311348 OO W938003272 |
| 38 | + H0020689311348 OP 30248 |
| 39 | + H0020689311348 OST GA |
| 40 | + H0020689311348 OT US |
| 41 | + H0020689311348 PO 6816137C |
| 42 | + ``` |
| 43 | + Each `envelope` consists of 11 consecutive lines, and these 11 consecutive lines are ingested, processed and |
| 44 | + transformed as a unit. |
| 45 | +
|
| 46 | + Single line `envelope` is obviously a special case of multi-line envelope. |
| 47 | +
|
| 48 | +- Variable length `envelope` encapsulated by a `header` and a `footer` (from |
| 49 | +[this sample](../extensions/omniv21/samples/fixedlength/3_header_footer.input.txt)): |
| 50 | + ``` |
| 51 | + A010 20191105 |
| 52 | + ... |
| 53 | + ... |
| 54 | + A999 |
| 55 | + V010 V |
| 56 | + ... |
| 57 | + ... |
| 58 | + V999 |
| 59 | + V010 V |
| 60 | + ... |
| 61 | + ... |
| 62 | + V999 |
| 63 | + ... |
| 64 | + ... |
| 65 | + Z001 |
| 66 | + Z999 |
| 67 | + ``` |
| 68 | + Here we have 3 different `envelopes`: |
| 69 | + - first `envelope` is encapsulated by `A010` and `A999`. This envelope has one occurrence. |
| 70 | + - second `envelope` is encapsulated by `V010` and `V999`. This envelope has multiple occurrences. |
| 71 | + - last `envelope` is encapsulated by `Z001` and `Z999`. This envelope has one occurrence. |
| 72 | +
|
| 73 | +## `file_declaration` for Fixed Rows `envelope` |
| 74 | +
|
| 75 | +`file_declaration` section looks as follows when dealing with envelopes of fixed number of rows: |
| 76 | +``` |
| 77 | +"file_declaration": { |
| 78 | + "envelopes": [ <= required |
| 79 | + { |
| 80 | + "by_rows": <integer value>, <= optional |
| 81 | + "columns": [ <= optional |
| 82 | + { |
| 83 | + "name": "<unique column name>", <= optional |
| 84 | + "start_pos": <integer>, <= required |
| 85 | + "length": <integer>, <= required |
| 86 | + "line_pattern": "<line regexp>" <= optional |
| 87 | + }, |
| 88 | + ... |
| 89 | + ] |
| 90 | + } |
| 91 | + ] |
| 92 | +} |
| 93 | +``` |
| 94 | +Note **for `by_rows` fixed-length schema, there must be one and only one envelope**. |
| 95 | +
|
| 96 | +- `by_rows`: defines how many consecutive lines form an `envelope`. If not specified, default to 1. |
| 97 | +
|
| 98 | +- `columns`: defines a number of columns/fields that will hold speific pieces of data extracted from |
| 99 | +those rows. |
| 100 | +
|
| 101 | +- `columns.name`: name of a column, used in as IDR node name and later in XPath queries in |
| 102 | +`transform_declarations`. Required and must be unique. |
| 103 | +
|
| 104 | +- `columns.start_pos`: specifies the starting character position (1-based) for the column's data. |
| 105 | +
|
| 106 | +- `columns.length`: specifies the length of the column's data. |
| 107 | +
|
| 108 | +- `columns.line_pattern`: used in multi-line `envelope` where the pattern identifies from which line |
| 109 | +this column's data will be extracted. |
| 110 | +
|
| 111 | +An example of single-line envelope `file_declaration` might look like: |
| 112 | +``` |
| 113 | +"file_declaration": { |
| 114 | + "envelopes" : [ |
| 115 | + { |
| 116 | + "columns": [ |
| 117 | + { "name": "DATE", "start_pos": 1, "length": 24 }, |
| 118 | + { "name": "HIGH_TEMP_C", "start_pos": 26, "length": 4 }, |
| 119 | + ... |
| 120 | + ] |
| 121 | + } |
| 122 | + ] |
| 123 | +}, |
| 124 | +``` |
| 125 | +
|
| 126 | +An example of multi-line envelope `file_declaration` might look like: |
| 127 | +``` |
| 128 | +"file_declaration": { |
| 129 | + "envelopes" : [ |
| 130 | + { |
| 131 | + "by_rows": 11, |
| 132 | + "columns": [ |
| 133 | + { "name": "tracking_number_h001", "start_pos": 464, "length": 30, "line_pattern": "^H001" }, |
| 134 | + ... |
| 135 | + { "name": "tracking_number_h002_cn", "start_pos": 28, "length": 50, "line_pattern": "^H002.{19}CN" } |
| 136 | + ] |
| 137 | + } |
| 138 | + ] |
| 139 | +}, |
| 140 | +``` |
| 141 | +
|
| 142 | +## `file_declaration` for Header/Footer `envelope` |
| 143 | +
|
| 144 | +`file_declaration` section looks as follows when dealing with envelopes bounded by header/footer: |
| 145 | +``` |
| 146 | +"file_declaration": { |
| 147 | + "envelopes": [ <= required |
| 148 | + { |
| 149 | + "name": "<unique envelope name>", <= optional |
| 150 | + "by_header_footer": { <= required |
| 151 | + "header": "<header regexp>", <= required |
| 152 | + "footer": "<footer regexp>", <= required |
| 153 | + }, |
| 154 | + "not_target": <true/false>, <= optional |
| 155 | + "columns": [ <= optional |
| 156 | + { |
| 157 | + "name": "<unique column name>", <= required |
| 158 | + "start_pos": <integer>, <= required |
| 159 | + "length": <integer>, <= required |
| 160 | + "line_pattern": "<line regexp>" <= optional |
| 161 | + }, |
| 162 | + ... |
| 163 | + ] |
| 164 | + }, |
| 165 | + ... |
| 166 | + ] |
| 167 | +} |
| 168 | +``` |
| 169 | +Note there can be multiple envelopes for a fixed-length schema when dealing with envelopes bounded by |
| 170 | +headers/footers. |
| 171 | +
|
| 172 | +Contrary to `by_rows` envelope and schema which is simple and easy to understand, `by_header_footer` |
| 173 | +envelopes and schema needs a bit explanation before diving into each schema attribute. |
| 174 | +
|
| 175 | +**A`by_header_footer` fixed-length schema can have multiple of envelopes**. These envelopes' order |
| 176 | +must match their appearance order in the input files, although each one is optional. (We could've |
| 177 | +made out-of-order envelope matching possible, but based on experiences, such scenarios rarely exist |
| 178 | +thus not worth pursuing at the cost of complexity.) Of these envelopes, **one and only one must be |
| 179 | +the target envelope**, instance of which omniparser will ingest, transform and return. All other |
| 180 | +non-target envelopes are considered global envelopes. Each envelope can have 0 or more instances |
| 181 | +from the input. All global envelopes' instances are permanently kept in the IDR tree while target |
| 182 | +envelope instance is kept in the returned IDR tree until the next `Read()` call is invoked by the |
| 183 | +client, thus making stream-processing large files without memory constraints possible. |
| 184 | +
|
| 185 | +Typically "global" envelopes are for the global header and footer for an input, and usually their |
| 186 | +numbers of instances are limited (1 usually). Target envelope is for the repeating data blocks of |
| 187 | +an input. If you look at the example shown early under *"Variable length `envelope` encapsulated by |
| 188 | +a `header` and a `footer`"*, you can see the global header envelope is wrapped by `A010` and `A999`; |
| 189 | +the global footer envelope is wrapped by `Z001` and `Z999`; and repeating data block envelope is |
| 190 | +wrapped by `V010` and `V999`. |
| 191 | +
|
| 192 | +Each `by_header_footer` envelope has a unique name: if the name is not directly given in the schema, |
| 193 | +then it's randomly and uniquely generated. Because when an instance of the target envelope is |
| 194 | +returned, the IDR tree node returned is anchored on the target envelope (see more details about |
| 195 | +fixed-length IDR tree structure [here](./idr.md#fixed-length-mostly-txt)), thus XPath queries to any |
| 196 | +data inside the target envelope don't need the node name. As a result, usually there is no need to |
| 197 | +specify a name for the target envelope. If, however, there are data on the global envelopes that |
| 198 | +transform needs, typically some data/info from the global header envelope, then such global envelopes |
| 199 | +should be named, and transform can refer to such data by XPath queries like |
| 200 | +`../<global_envelope_name>/<path_to_such-data>`. Understanding the IDR tree structure for fixed-length |
| 201 | +format is the key to understand how you can extract data. |
| 202 | +
|
| 203 | +- `name`: a unique name of the envelope. Optional. If not specified, which is common, a unique id |
| 204 | +is generated by omniparser. |
| 205 | +
|
| 206 | +- `by_header_footer.header`: a regexp pattern identifies the line of the beginning of the envelope. |
| 207 | +
|
| 208 | +- `by_header_footer.footer`: a regexp pattern identifies the line of the end of the envelope. Note |
| 209 | +`footer` can be the same as the `header`, in case we have an envelope of a single line. |
| 210 | +
|
| 211 | +- `not_target`: whether the envelope is a target envelope or not. Optional, and by default it's false: |
| 212 | +the envelope is a target envelope. Note there can be one and only one target envelope in a |
| 213 | +`by_header_footer` fixed-length schema. |
| 214 | +
|
| 215 | +- `columns`: identical to `columns` in `by_rows` fixed_length schema. |
| 216 | +
|
| 217 | +A schema for the earlier example might look like this: |
| 218 | +``` |
| 219 | +"file_declaration": { |
| 220 | + "envelopes" : [ |
| 221 | + { |
| 222 | + "name": "GLOBAL", |
| 223 | + "by_header_footer": { "header": "^A010", "footer": "^A999" }, |
| 224 | + "not_target": true, |
| 225 | + "columns": [ { "name": "carrier", "start_pos": 6, "length": 6, "line_pattern": "^A060" } ] |
| 226 | + }, |
| 227 | + { |
| 228 | + "by_header_footer": { "header": "^V010", "footer": "^V999" }, |
| 229 | + "columns": [ |
| 230 | + { "name": "tracking_number", "start_pos": 6, "length": 15, "line_pattern": "^V020" }, |
| 231 | + { "name": "delivery_date", "start_pos": 6, "length": 8, "line_pattern": "^V045" }, |
| 232 | + ... |
| 233 | + ] |
| 234 | + }, |
| 235 | + { |
| 236 | + "by_header_footer": { "header": "^Z001", "footer": "^Z999" }, |
| 237 | + "not_target": true |
| 238 | + } |
| 239 | + ] |
| 240 | +}, |
| 241 | +``` |
| 242 | +
|
| 243 | +Note 1: since we want to have, in the `FINAL_OUTPUT`, a field `carrier`, and the carrier information |
| 244 | +is in the global header envelope, thus we give a name to the global header envelope `GLOBAL` thus we |
| 245 | +can refer to the carrier name data by XPath `../GLOBAL/carrier`. |
| 246 | +
|
| 247 | +Note 2: there is no data we need from global footer envelope, thus we can simply keep it "nameless" ( |
| 248 | +though technically it is not true, as omniparser will assign a random/unique name to it). |
| 249 | +
|
| 250 | +Note 3: target envelope is "nameless" and it is the only envelope does not have `"not_target": true`. |
| 251 | +
|
| 252 | +## Fixed-Length IDR Structure |
| 253 | +
|
| 254 | +See [here](./idr.md#fixed-length-mostly-txt) for more details. |
0 commit comments