Reading "wide" `t-route` flow velocity depth `csv`'s has high performance penalty

https://github.com/NOAA-OWP/ngen-cal/blob/2823b2c7cf0a92ade311d4b89b31b643f378e70c/python/ngen_cal/src/ngen/cal/ngen_hooks/ngen_output.py#L181

`ngen.cal` supports reading `t-route` output in a variety of formats (see #153). One supported format is `csv_output`. This format contains simulated flow, velocity, and depth values for each waterbody for each `t-route` _timestep_. For example:

```csv
,"(0, 'q')","(0, 'v')","(0, 'd')","(1, 'q')","(1, 'v')","(1, 'd')"
2420800,0.0,0.0,0.0,0.0,0.0,0.0
```

<details><summary><code>t-route</code> <code>csv_output</code> configuration</summary>

```yaml
output_parameters:
  csv_output:
    csv_output_folder: output/
```

</details>

Crucially, this means the _longer_ the simulation time the _wider_ each row will be.

`csv` parsers like `pandas` `c` parser or `arrow`'s `csv` parser optimize for reading _long_ `csv` files rather than wide `csv` files. Both of these parsers use a "chunking" approach where they allocate a buffer, read _rows_ from the `csv` file into the buffer until its full, and process the data. However, when a row is sufficiently long it cannot fit fully into the buffer. Because of this and other implementation specific details, parsing and deserializing these _wide_ csv files into a `pandas.DataFrame` can take on the order of minutes. In a local test I found that a `csv` file with 3 years of 5 minute timestep data (315360 timesteps) took roughly 3.5 minutes to deserialize into a `pandas` dataframe on an M2 pro macbook.

One potential solution to this is to disable `pd.read_csv`'s `low_memory` flag:

```python
df = pd.read_csv(filepath, index_col=0, engine="c", low_memory=False)
```

In local testing it too ~9 seconds to read and deserialize the same file.

For now, my general recommendation is to use `t-route`'s `stream_output` instead of `csv_output` if possible. `stream_output` still supports `csv`, but instead uses a long format instead of a wide format that does not suffer the same performance penalty. See the most up to date examples of this on the [`t-route` repo](https://github.com/NOAA-OWP/t-route) or in #153.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading "wide" `t-route` flow velocity depth `csv`'s has high performance penalty #204

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reading "wide" t-route flow velocity depth csv's has high performance penalty #204

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Reading "wide" `t-route` flow velocity depth `csv`'s has high performance penalty #204