Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add option to set header row #453

Merged
merged 15 commits into from
Oct 8, 2024

Conversation

PrettyWood
Copy link
Contributor

@PrettyWood PrettyWood commented Jul 20, 2024

We have an issue ToucanToco/fastexcel#209 that has been getting some popularity
People want to be able to set the "header row" knowing where it starts and the fact that calamine skips by default empty rows at the beginning is not always wanted.

I suggest we add an option to be able to change the default behavior directly in calamine. It will be a bit less performant for the ones who enable this option since the whole vec of cells might be shifted but has no impact on people who keep the default behavior.

Note: I decided to tackle only xlsx files for now because

  1. I don't know if this will be accepted
  2. I don't know if the behavior is the same for other files
  3. I don't know if the "XlsxOptions" is something we want to generalize as an associated type type Option in the Reader trait

related to #147 (comment)

@dimastbk
Copy link
Contributor

I don't know if the behavior is the same for other files

Yes, the same. See first four tests https://github.com/dimastbk/python-calamine/blob/master/tests/test_base.py#L11

src/xlsx/mod.rs Outdated
#[derive(Debug, Default)]
pub struct XlsxOptions {
/// By default, calamine skips empty rows until a nonempty row is found
pub keep_first_empty_rows: bool,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a more generic approach would be to have the option to set header row?
By default None means that it'll skip all empty rows (like today) and then we could have with_header_row(self, row: u32)?
I understand that >90% of use cases are the first row but it still feels better and not necessarily more complicated to give the option to the users?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect will do that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -1786,3 +1787,39 @@ fn test_ref_xlsb() {
]
);
}

#[rstest]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't know about rstest. It seems neat.

@tafia
Copy link
Owner

tafia commented Sep 16, 2024

EDIT: SORRY for the late answer!!

I don't know if this will be accepted

I don't see anything wrong here. I'd like the comments to reflect where/how it applies if possible, in particular if there is no option for other files.

I don't know if the behavior is the same for other files

While I'd love all files to be supported equally, I understand people focusing on their own issues first. If you don't mind trying to check other files i'd be great, else it is ok. Thanks anyway for your time!

I don't know if the "XlsxOptions" is something we want to generalize as an associated type type Option in the Reader trait

I don't really know too. I suppose it could be something useful in more contexts but I haven't felt a strong need either.

@PrettyWood PrettyWood changed the title feat: add option to keep first empty rows for xlsx feat: add option to set header row Sep 17, 2024
@PrettyWood
Copy link
Contributor Author

I'll try to tackle other file formats by the end of the week

@PrettyWood
Copy link
Contributor Author

@tafia I added the support for ODS. I reckon having the option at workbook level is not appropriate. We could want to open a sheet with one header row and another sheet with another header row.
For me we should have

/// Options for reading data
pub struct ReadDataOptions {
    /// header row
    pub header_row: Option<u32>,
    ...
}

pub trait Reader<RS>: Sized
where
    RS: Read + Seek,
{
...
    fn worksheet_range(&mut self, name: &str, options: ReadDataOptions) -> Result<Range<Data>, Self::Error>;
    ...
 
    fn worksheet_range_at(&mut self, n: usize, options: ReadDataOptions) -> Option<Result<Range<Data>, Self::Error>> {
        let name = self.sheet_names().get(n)?.to_string();
        Some(self.worksheet_range(&name, options))
    }
...
}

pub trait ReaderRef<RS>: Reader<RS>
where
    RS: Read + Seek,
{
    fn worksheet_range_ref<'a>(&'a mut self, name: &str, options: ReadDataOptions)
        -> Result<Range<DataRef<'a>>, Self::Error>;

    fn worksheet_range_at_ref(&mut self, n: usize, options: ReadDataOptions) -> Option<Result<Range<DataRef>, Self::Error>> {
        let name = self.sheet_names().get(n)?.to_string();
        Some(self.worksheet_range_ref(&name, options))
    }
}

The issue is that it would break the API
I could add new method worksheet_range_with_option for example but it sounds quite heavy.
WDYT?

@ACronje
Copy link

ACronje commented Sep 20, 2024

@PrettyWood would your solution work for this use case?

image

@PrettyWood
Copy link
Contributor Author

@ACronje yup!

@ACronje
Copy link

ACronje commented Sep 25, 2024

Even though my Rust knowledge is pretty limited I would be willing to help get this over the line @PrettyWood @tafia .

We really need this!

@PrettyWood
Copy link
Contributor Author

The rust part is not really a problem and I can finish the job easily, it's more a problem of API ;)
I need @tafia opinion on #453 (comment)

@ACronje
Copy link

ACronje commented Sep 25, 2024

@tafia I added the support for ODS. I reckon having the option at workbook level is not appropriate. We could want to open a sheet with one header row and another sheet with another header row. For me we should have

/// Options for reading data
pub struct ReadDataOptions {
    /// header row
    pub header_row: Option<u32>,
    ...
}

pub trait Reader<RS>: Sized
where
    RS: Read + Seek,
{
...
    fn worksheet_range(&mut self, name: &str, options: ReadDataOptions) -> Result<Range<Data>, Self::Error>;
    ...
 
    fn worksheet_range_at(&mut self, n: usize, options: ReadDataOptions) -> Option<Result<Range<Data>, Self::Error>> {
        let name = self.sheet_names().get(n)?.to_string();
        Some(self.worksheet_range(&name, options))
    }
...
}

pub trait ReaderRef<RS>: Reader<RS>
where
    RS: Read + Seek,
{
    fn worksheet_range_ref<'a>(&'a mut self, name: &str, options: ReadDataOptions)
        -> Result<Range<DataRef<'a>>, Self::Error>;

    fn worksheet_range_at_ref(&mut self, n: usize, options: ReadDataOptions) -> Option<Result<Range<DataRef>, Self::Error>> {
        let name = self.sheet_names().get(n)?.to_string();
        Some(self.worksheet_range_ref(&name, options))
    }
}

The issue is that it would break the API I could add new method worksheet_range_with_option for example but it sounds quite heavy. WDYT?

@PrettyWood since adding options like you've suggested above would be a breaking change, and since adding a new method like worksheet_range_with_option would inevitably require you to add worksheet_formula_with_options etc. how about rather passing the sheet-specific parsing options to with_options?

I think this could be useful for the other file types that do not support lazy iteration (they are loading the sheets into memory at instantiation time I believe? see: ods) and therefore might decide to use this sheet-specific config to, for example, be more efficient with loading sheet data (e.g. only keeping in memory the sheets defined in the config)

@PrettyWood
Copy link
Contributor Author

I first went with the with_options approach (it's still the case in this PR) but I reckon it's wrong to have sheet options (like header_row) that are meant for a single sheet to be set at workbook level.

@ACronje
Copy link

ACronje commented Sep 25, 2024

I understand what you're saying and I saw your implementation. I explained why I think it's acceptable for the sheet options to be at the workbook level.

@PrettyWood
Copy link
Contributor Author

PrettyWood commented Sep 25, 2024

Thanks for sharing your opinion. I'll finish implementing all the extensions and let the with_options for now.
If it's good enough it will be mergeable directly. If not it won't cost too much to update

EDIT: it is now implemented for all extensions. I went for an associated type if we want to have that at reader level. I still reckon we should have options at workbook level and options at sheet level. And header_row should be at sheet level IMO.

@severinh
Copy link

severinh commented Oct 1, 2024

Huge thanks to @PrettyWood for this improvement! Our team depends on this PR to fully adopt calamine.

@tafia Even though you've already approved the PR, I believe @PrettyWood is still waiting for you to take another look at the post-approval changes before merging. Any chance you could take a look in the coming days? Or can he go ahead and merge? Thanks a lot!

src/auto.rs Outdated Show resolved Hide resolved
@tafia
Copy link
Owner

tafia commented Oct 3, 2024

Thanks a lot for the extra effort! LGTM, just one tiny refactor I believe makes the intent clearer.

@PrettyWood PrettyWood force-pushed the feat/keep-first-empty-rows branch 3 times, most recently from d481deb to 3a5543d Compare October 5, 2024 09:29
@PrettyWood PrettyWood force-pushed the feat/keep-first-empty-rows branch 2 times, most recently from a7e41ef to 625fb73 Compare October 6, 2024 20:06
@tafia
Copy link
Owner

tafia commented Oct 7, 2024

@PrettyWood let me know once you think you're done. I'm happy to merge it as it is already.
Thanks again!

@PrettyWood
Copy link
Contributor Author

PrettyWood commented Oct 7, 2024

Ok @tafia sorry for those last changes but I prefer switching to an enum. It's more explicit and I plan on adding skip_rows and n_rows to options after. So I guess it will make sense to also have None for the header row to say "I don't want any header at all". Right now there is no difference between "the first row of the range" and "the header row" but it will make sense after if some kind of pagination is added

If that's ok for you this PR is ready to merge on my end
I'll open new PRs soon

@tafia tafia merged commit 8efe95d into tafia:master Oct 8, 2024
4 checks passed
@tafia
Copy link
Owner

tafia commented Oct 8, 2024

Awesome thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants