Skip to content

Commit 6fc8075

Browse files
authored
fixed-length doc (#128)
1 parent 3048746 commit 6fc8075

File tree

2 files changed

+258
-0
lines changed

2 files changed

+258
-0
lines changed

doc/csv_in_depth.md

+4
Original file line numberDiff line numberDiff line change
@@ -79,3 +79,7 @@ quotes `'`.
7979
If a column's name contains space, while it's completely legitimate in CSV, it would make the XPath
8080
based transform hard/impossible later. In situations like this, we strongly advise schema writer to
8181
use `alias` to assign an alias to the column that is XPath friendly, such as containing no spaces.
82+
83+
## CSV Specific IDR Structure
84+
85+
See [here](./idr.md#csv-aka-delimited) for more details.

doc/fixedlength_in_depth.md

+254
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,254 @@
1+
# Fixed-Length Schema in Depth
2+
3+
Fixed-length (sometimes also called fixed-width) schema has 3 parts: `parser_settings`, `file_declaration`,
4+
and `transform_declarations`. We've covered `parser_settings` in [Getting Started](./gettingstarted.md);
5+
we've covered `transform_declarations` in depth in [All About Transforms](./transforms.md). Before we go
6+
into `file_declaration`, we need to talk about the concept of `envelope`.
7+
8+
An `envelope` is a basic data unit ingested in from the input, processed and transformed by omniparser for
9+
a fixed length input file. Here are a few examples:
10+
11+
- Single line `envelope` (from [this sample](../extensions/omniv21/samples/fixedlength/1_single_row.input.txt)):
12+
```
13+
2019/01/31T12:34:56-0800 10.5 30.2 N 33 37.7749 122.4194
14+
2020/07/31T01:23:45-0500 39 95 SE 8 32.7767 96.7970
15+
```
16+
Here each line in the input is an `envelope`, individually ingested, processed and transformed.
17+
18+
- Multi-line `envelope` (from [this sample](../extensions/omniv21/samples/fixedlength/2_multi_rows.input.txt)):
19+
```
20+
H001TW 0689311345 030 DEL HMDEPOT 20190826124704 EST 1230004321 W841206858 James Bond 20190827 RHONDA W841206858 128 Bird Ave HAPPYVALLEY FL54321 USELGS 030 1 105.00 THDNR 1 1003164515 DLB000977979 Appt set between 00 00 AND 00 00 on 08 27 19 AG 21200 1600 7029083C 314 Algebra Blvd MEDIAN FL31415 US
21+
H0020689311345 BM 040670500067120
22+
H0020689311345 CN 100000103732
23+
H0020689311345 OA 2400 Highway 155 South
24+
H0020689311345 OC Locust Grove
25+
H0020689311345 ON THE HOME DEPOT 6705
26+
H0020689311345 OO W841206858
27+
H0020689311345 OP 30248
28+
H0020689311345 OST GA
29+
H0020689311345 OT US
30+
H0020689311345 PO 7029083C
31+
H001TW 0689311348 030 DEL HMDEPOT 20190826124704 EST 1230001234 W938003272 Jason Bourne 20190827 RHONDA W938003272 123 S 45ST ST MAGIC BEACH FL12345 USELGS 030 1 1.00 THDNR 1 1003621621 Appt set between 00 00 AND 00 00 on 08 27 19 AG 21030 1430 6816137C 314 Algebra Blvd MEDIAN FL31415 US
32+
H0020689311348 BM 040677700277714
33+
H0020689311348 CN
34+
H0020689311348 OA 2500 Highway 155 South
35+
H0020689311348 OC Locust Grove
36+
H0020689311348 ON THE HOME DEPOT 6777
37+
H0020689311348 OO W938003272
38+
H0020689311348 OP 30248
39+
H0020689311348 OST GA
40+
H0020689311348 OT US
41+
H0020689311348 PO 6816137C
42+
```
43+
Each `envelope` consists of 11 consecutive lines, and these 11 consecutive lines are ingested, processed and
44+
transformed as a unit.
45+
46+
Single line `envelope` is obviously a special case of multi-line envelope.
47+
48+
- Variable length `envelope` encapsulated by a `header` and a `footer` (from
49+
[this sample](../extensions/omniv21/samples/fixedlength/3_header_footer.input.txt)):
50+
```
51+
A010 20191105
52+
...
53+
...
54+
A999
55+
V010 V
56+
...
57+
...
58+
V999
59+
V010 V
60+
...
61+
...
62+
V999
63+
...
64+
...
65+
Z001
66+
Z999
67+
```
68+
Here we have 3 different `envelopes`:
69+
- first `envelope` is encapsulated by `A010` and `A999`. This envelope has one occurrence.
70+
- second `envelope` is encapsulated by `V010` and `V999`. This envelope has multiple occurrences.
71+
- last `envelope` is encapsulated by `Z001` and `Z999`. This envelope has one occurrence.
72+
73+
## `file_declaration` for Fixed Rows `envelope`
74+
75+
`file_declaration` section looks as follows when dealing with envelopes of fixed number of rows:
76+
```
77+
"file_declaration": {
78+
"envelopes": [ <= required
79+
{
80+
"by_rows": <integer value>, <= optional
81+
"columns": [ <= optional
82+
{
83+
"name": "<unique column name>", <= optional
84+
"start_pos": <integer>, <= required
85+
"length": <integer>, <= required
86+
"line_pattern": "<line regexp>" <= optional
87+
},
88+
...
89+
]
90+
}
91+
]
92+
}
93+
```
94+
Note **for `by_rows` fixed-length schema, there must be one and only one envelope**.
95+
96+
- `by_rows`: defines how many consecutive lines form an `envelope`. If not specified, default to 1.
97+
98+
- `columns`: defines a number of columns/fields that will hold speific pieces of data extracted from
99+
those rows.
100+
101+
- `columns.name`: name of a column, used in as IDR node name and later in XPath queries in
102+
`transform_declarations`. Required and must be unique.
103+
104+
- `columns.start_pos`: specifies the starting character position (1-based) for the column's data.
105+
106+
- `columns.length`: specifies the length of the column's data.
107+
108+
- `columns.line_pattern`: used in multi-line `envelope` where the pattern identifies from which line
109+
this column's data will be extracted.
110+
111+
An example of single-line envelope `file_declaration` might look like:
112+
```
113+
"file_declaration": {
114+
"envelopes" : [
115+
{
116+
"columns": [
117+
{ "name": "DATE", "start_pos": 1, "length": 24 },
118+
{ "name": "HIGH_TEMP_C", "start_pos": 26, "length": 4 },
119+
...
120+
]
121+
}
122+
]
123+
},
124+
```
125+
126+
An example of multi-line envelope `file_declaration` might look like:
127+
```
128+
"file_declaration": {
129+
"envelopes" : [
130+
{
131+
"by_rows": 11,
132+
"columns": [
133+
{ "name": "tracking_number_h001", "start_pos": 464, "length": 30, "line_pattern": "^H001" },
134+
...
135+
{ "name": "tracking_number_h002_cn", "start_pos": 28, "length": 50, "line_pattern": "^H002.{19}CN" }
136+
]
137+
}
138+
]
139+
},
140+
```
141+
142+
## `file_declaration` for Header/Footer `envelope`
143+
144+
`file_declaration` section looks as follows when dealing with envelopes bounded by header/footer:
145+
```
146+
"file_declaration": {
147+
"envelopes": [ <= required
148+
{
149+
"name": "<unique envelope name>", <= optional
150+
"by_header_footer": { <= required
151+
"header": "<header regexp>", <= required
152+
"footer": "<footer regexp>", <= required
153+
},
154+
"not_target": <true/false>, <= optional
155+
"columns": [ <= optional
156+
{
157+
"name": "<unique column name>", <= required
158+
"start_pos": <integer>, <= required
159+
"length": <integer>, <= required
160+
"line_pattern": "<line regexp>" <= optional
161+
},
162+
...
163+
]
164+
},
165+
...
166+
]
167+
}
168+
```
169+
Note there can be multiple envelopes for a fixed-length schema when dealing with envelopes bounded by
170+
headers/footers.
171+
172+
Contrary to `by_rows` envelope and schema which is simple and easy to understand, `by_header_footer`
173+
envelopes and schema needs a bit explanation before diving into each schema attribute.
174+
175+
**A`by_header_footer` fixed-length schema can have multiple of envelopes**. These envelopes' order
176+
must match their appearance order in the input files, although each one is optional. (We could've
177+
made out-of-order envelope matching possible, but based on experiences, such scenarios rarely exist
178+
thus not worth pursuing at the cost of complexity.) Of these envelopes, **one and only one must be
179+
the target envelope**, instance of which omniparser will ingest, transform and return. All other
180+
non-target envelopes are considered global envelopes. Each envelope can have 0 or more instances
181+
from the input. All global envelopes' instances are permanently kept in the IDR tree while target
182+
envelope instance is kept in the returned IDR tree until the next `Read()` call is invoked by the
183+
client, thus making stream-processing large files without memory constraints possible.
184+
185+
Typically "global" envelopes are for the global header and footer for an input, and usually their
186+
numbers of instances are limited (1 usually). Target envelope is for the repeating data blocks of
187+
an input. If you look at the example shown early under *"Variable length `envelope` encapsulated by
188+
a `header` and a `footer`"*, you can see the global header envelope is wrapped by `A010` and `A999`;
189+
the global footer envelope is wrapped by `Z001` and `Z999`; and repeating data block envelope is
190+
wrapped by `V010` and `V999`.
191+
192+
Each `by_header_footer` envelope has a unique name: if the name is not directly given in the schema,
193+
then it's randomly and uniquely generated. Because when an instance of the target envelope is
194+
returned, the IDR tree node returned is anchored on the target envelope (see more details about
195+
fixed-length IDR tree structure [here](./idr.md#fixed-length-mostly-txt)), thus XPath queries to any
196+
data inside the target envelope don't need the node name. As a result, usually there is no need to
197+
specify a name for the target envelope. If, however, there are data on the global envelopes that
198+
transform needs, typically some data/info from the global header envelope, then such global envelopes
199+
should be named, and transform can refer to such data by XPath queries like
200+
`../<global_envelope_name>/<path_to_such-data>`. Understanding the IDR tree structure for fixed-length
201+
format is the key to understand how you can extract data.
202+
203+
- `name`: a unique name of the envelope. Optional. If not specified, which is common, a unique id
204+
is generated by omniparser.
205+
206+
- `by_header_footer.header`: a regexp pattern identifies the line of the beginning of the envelope.
207+
208+
- `by_header_footer.footer`: a regexp pattern identifies the line of the end of the envelope. Note
209+
`footer` can be the same as the `header`, in case we have an envelope of a single line.
210+
211+
- `not_target`: whether the envelope is a target envelope or not. Optional, and by default it's false:
212+
the envelope is a target envelope. Note there can be one and only one target envelope in a
213+
`by_header_footer` fixed-length schema.
214+
215+
- `columns`: identical to `columns` in `by_rows` fixed_length schema.
216+
217+
A schema for the earlier example might look like this:
218+
```
219+
"file_declaration": {
220+
"envelopes" : [
221+
{
222+
"name": "GLOBAL",
223+
"by_header_footer": { "header": "^A010", "footer": "^A999" },
224+
"not_target": true,
225+
"columns": [ { "name": "carrier", "start_pos": 6, "length": 6, "line_pattern": "^A060" } ]
226+
},
227+
{
228+
"by_header_footer": { "header": "^V010", "footer": "^V999" },
229+
"columns": [
230+
{ "name": "tracking_number", "start_pos": 6, "length": 15, "line_pattern": "^V020" },
231+
{ "name": "delivery_date", "start_pos": 6, "length": 8, "line_pattern": "^V045" },
232+
...
233+
]
234+
},
235+
{
236+
"by_header_footer": { "header": "^Z001", "footer": "^Z999" },
237+
"not_target": true
238+
}
239+
]
240+
},
241+
```
242+
243+
Note 1: since we want to have, in the `FINAL_OUTPUT`, a field `carrier`, and the carrier information
244+
is in the global header envelope, thus we give a name to the global header envelope `GLOBAL` thus we
245+
can refer to the carrier name data by XPath `../GLOBAL/carrier`.
246+
247+
Note 2: there is no data we need from global footer envelope, thus we can simply keep it "nameless" (
248+
though technically it is not true, as omniparser will assign a random/unique name to it).
249+
250+
Note 3: target envelope is "nameless" and it is the only envelope does not have `"not_target": true`.
251+
252+
## Fixed-Length IDR Structure
253+
254+
See [here](./idr.md#fixed-length-mostly-txt) for more details.

0 commit comments

Comments
 (0)