|
df = pd.read_csv(filepath, index_col=0) |
ngen.cal supports reading t-route output in a variety of formats (see #153). One supported format is csv_output. This format contains simulated flow, velocity, and depth values for each waterbody for each t-route timestep. For example:
,"(0, 'q')","(0, 'v')","(0, 'd')","(1, 'q')","(1, 'v')","(1, 'd')"
2420800,0.0,0.0,0.0,0.0,0.0,0.0
t-route csv_output configuration
output_parameters:
csv_output:
csv_output_folder: output/
Crucially, this means the longer the simulation time the wider each row will be.
csv parsers like pandas c parser or arrow's csv parser optimize for reading long csv files rather than wide csv files. Both of these parsers use a "chunking" approach where they allocate a buffer, read rows from the csv file into the buffer until its full, and process the data. However, when a row is sufficiently long it cannot fit fully into the buffer. Because of this and other implementation specific details, parsing and deserializing these wide csv files into a pandas.DataFrame can take on the order of minutes. In a local test I found that a csv file with 3 years of 5 minute timestep data (315360 timesteps) took roughly 3.5 minutes to deserialize into a pandas dataframe on an M2 pro macbook.
One potential solution to this is to disable pd.read_csv's low_memory flag:
df = pd.read_csv(filepath, index_col=0, engine="c", low_memory=False)
In local testing it too ~9 seconds to read and deserialize the same file.
For now, my general recommendation is to use t-route's stream_output instead of csv_output if possible. stream_output still supports csv, but instead uses a long format instead of a wide format that does not suffer the same performance penalty. See the most up to date examples of this on the t-route repo or in #153.
ngen-cal/python/ngen_cal/src/ngen/cal/ngen_hooks/ngen_output.py
Line 181 in 2823b2c
ngen.calsupports readingt-routeoutput in a variety of formats (see #153). One supported format iscsv_output. This format contains simulated flow, velocity, and depth values for each waterbody for eacht-routetimestep. For example:t-routecsv_outputconfigurationCrucially, this means the longer the simulation time the wider each row will be.
csvparsers likepandascparser orarrow'scsvparser optimize for reading longcsvfiles rather than widecsvfiles. Both of these parsers use a "chunking" approach where they allocate a buffer, read rows from thecsvfile into the buffer until its full, and process the data. However, when a row is sufficiently long it cannot fit fully into the buffer. Because of this and other implementation specific details, parsing and deserializing these wide csv files into apandas.DataFramecan take on the order of minutes. In a local test I found that acsvfile with 3 years of 5 minute timestep data (315360 timesteps) took roughly 3.5 minutes to deserialize into apandasdataframe on an M2 pro macbook.One potential solution to this is to disable
pd.read_csv'slow_memoryflag:In local testing it too ~9 seconds to read and deserialize the same file.
For now, my general recommendation is to use
t-route'sstream_outputinstead ofcsv_outputif possible.stream_outputstill supportscsv, but instead uses a long format instead of a wide format that does not suffer the same performance penalty. See the most up to date examples of this on thet-routerepo or in #153.