Binary storage and serialization #227

Guillaume227 · 2023-02-07T08:50:59Z

Guillaume227
Feb 7, 2023

I am only at the browse-the-doc stage. I see three file formats are supported, all text-based (csv, csv2, json). In addition there is also string-based serialization.
Has there been no need for writing in binary format directly? I would think that would save quite a bit of disk space and parsing time when handling large data sets (there seem to be a financial industry background to this and I have heard it matters there). It also matters when concerned with exact representation of floating point values (float->string->float injects noise in the process).

Was there any thought given to this already? Any major obstacles?

hosseinmoein · 2023-02-07T14:33:57Z

hosseinmoein
Feb 7, 2023
Maintainer

That's one of the things on my todo list. Binary storage could definitely come in handy in some cases.

But in general binary is not always more efficient -- at least in size -- than strings. Consider Options market data. Option premiums are small like 2, 0.5, 3.25, ... also Option sizes are small too. These prices could be stored in a few bytes in a string, but in binary they are always 8 bytes.

On a separate note, I am looking for contributors. If you or somebody you know wants to contribute and for example implement the binary format please let me know

0 replies

Guillaume227 · 2023-02-07T21:52:13Z

Guillaume227
Feb 7, 2023
Author

I might be up for contributing on that topic. I still need to play around the with library to understand what I am getting myself into.

By the way, have you shared your todo list or a roadmap?

0 replies

hosseinmoein · 2023-02-08T14:09:46Z

hosseinmoein
Feb 8, 2023
Maintainer

I don't have an official todo list. But these are the things I am thinking about and have no idea when/if to find time to do them:

Add more algos especially more ML and clustering algos
Add more I/O formats like binary, Parquet, Markdown, ...
Add more benchmarks against other language DataFrame's like Rust's Polaris
Simplify the interfaces. I am sure there are ways to simplify how we specify types in DataFrame interfaces.
Better organize the source code files

1 reply

Guillaume227 Feb 11, 2023
Author

Regarding point 2, I/O formats: I see the io_format enum already mentions hdf5 (not implemented). What's your vision on hdf5 compatibility? Maybe not the priority it used to be when you listed it in that enum?

Regarding point 4, Simplify the interfaces: I notice in hello_world.cc that it is unnecessarily verbose for calls to .load_column<T> : template type deduction is already possible given the T&& argument. It's a minor thing but maybe worth updating to make it more appealing? get_column is another story.

Also, is maintaining c++17 backward compatibility a hard constraint? or is reliance on c++20 features acceptable ?

hosseinmoein · 2023-02-11T15:49:32Z

hosseinmoein
Feb 11, 2023
Maintainer

Re/ HDF5 you are correct. I put them there a few years ago as place holders. I think at this time, Parquet would be the highest priority, since it would make it compatible with popular packages like Arrow and Hadoop. One requirement that I have kept from the beginning of developing this package was that it should be self contained. It means DataFrame should not have any dependencies on other libraries except STL. So, if one can write Parquet format read/write routines from scratch, it would be great.

Re/ 4. yes there are places we can look to simplify the interface. load_column() is an interesting interface. You can already call it without specifying the type -- the compiler already figures it out. But you are correct that inside the code for load_column() you can look at the parameter type passed to it and not depend on T. Although, it still needs to be a template member function to generate separate code for different types.

I have no restriction to be compatible with C++17. C++20 is just fine. I never had time to get around incorporating C++20 upgrades.

0 replies

hosseinmoein · 2024-07-14T13:53:34Z

hosseinmoein
Jul 14, 2024
Maintainer

reading/writing in binary format is now implemented in DataFrame

3 replies

Zheka17 Jul 23, 2024

Parquet would significantly broaden DataFrame's usability as part of a integrated analytical pipeline.
The self-imposed 'STL-only' requirement seems too strict; header-only libs would not be such a hurdle for a user.

On a different note, Howard Hinnant's datetime library is now part of STL. And it is very efficient and mighty.
Also, I personally find somewhat inconvenient that the week starts from Sun=1, Mon=2 (which is an old Win libs convention), rather than Sun=0 and Mon=1 - which seems more natural/ easier to use.

hosseinmoein Jul 23, 2024
Maintainer

I know Parquet is very cool
But STL-only has a lot of benefits:

You are only responsible for your own bugs.
Makes building, deploying, and versioning simpler to a lot simpler
Makes backward compatibility simpler

But, I am willing to make an exception for Parquet. I have to find time or contributor.

hosseinmoein Jul 23, 2024
Maintainer

Re/ datatime, I will look at Hinnant's library

I believe Sunday is by all conventions the first day of the week and making it 1 makes it consistant with months like Jan.

Zheka17 · 2024-07-25T21:22:08Z

Zheka17
Jul 25, 2024

HH's library became part of ..but he does have a very useful page with 'the best' low-level algorithms

https://en.cppreference.com/w/cpp/chrono/weekday...

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Binary storage and serialization #227

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Binary storage and serialization #227

Guillaume227 Feb 7, 2023

Replies: 6 comments · 4 replies

hosseinmoein Feb 7, 2023 Maintainer

Guillaume227 Feb 7, 2023 Author

hosseinmoein Feb 8, 2023 Maintainer

Guillaume227 Feb 11, 2023 Author

hosseinmoein Feb 11, 2023 Maintainer

hosseinmoein Jul 14, 2024 Maintainer

Zheka17 Jul 23, 2024

hosseinmoein Jul 23, 2024 Maintainer

hosseinmoein Jul 23, 2024 Maintainer

Zheka17 Jul 25, 2024

Guillaume227
Feb 7, 2023

Replies: 6 comments 4 replies

hosseinmoein
Feb 7, 2023
Maintainer

Guillaume227
Feb 7, 2023
Author

hosseinmoein
Feb 8, 2023
Maintainer

Guillaume227 Feb 11, 2023
Author

hosseinmoein
Feb 11, 2023
Maintainer

hosseinmoein
Jul 14, 2024
Maintainer

hosseinmoein Jul 23, 2024
Maintainer

hosseinmoein Jul 23, 2024
Maintainer

Zheka17
Jul 25, 2024