Skip to content

Reading "wide" t-route flow velocity depth csv's has high performance penalty #204

@aaraney

Description

@aaraney

df = pd.read_csv(filepath, index_col=0)

ngen.cal supports reading t-route output in a variety of formats (see #153). One supported format is csv_output. This format contains simulated flow, velocity, and depth values for each waterbody for each t-route timestep. For example:

,"(0, 'q')","(0, 'v')","(0, 'd')","(1, 'q')","(1, 'v')","(1, 'd')"
2420800,0.0,0.0,0.0,0.0,0.0,0.0
t-route csv_output configuration
output_parameters:
  csv_output:
    csv_output_folder: output/

Crucially, this means the longer the simulation time the wider each row will be.

csv parsers like pandas c parser or arrow's csv parser optimize for reading long csv files rather than wide csv files. Both of these parsers use a "chunking" approach where they allocate a buffer, read rows from the csv file into the buffer until its full, and process the data. However, when a row is sufficiently long it cannot fit fully into the buffer. Because of this and other implementation specific details, parsing and deserializing these wide csv files into a pandas.DataFrame can take on the order of minutes. In a local test I found that a csv file with 3 years of 5 minute timestep data (315360 timesteps) took roughly 3.5 minutes to deserialize into a pandas dataframe on an M2 pro macbook.

One potential solution to this is to disable pd.read_csv's low_memory flag:

df = pd.read_csv(filepath, index_col=0, engine="c", low_memory=False)

In local testing it too ~9 seconds to read and deserialize the same file.

For now, my general recommendation is to use t-route's stream_output instead of csv_output if possible. stream_output still supports csv, but instead uses a long format instead of a wide format that does not suffer the same performance penalty. See the most up to date examples of this on the t-route repo or in #153.

Metadata

Metadata

Assignees

Labels

ngen.calRelated to ngen.cal packageperformanceSomething is slow

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions