Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trained model for febrl5M #1068 #1072

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

Arjun-Zingg
Copy link
Contributor

No description provided.

@SauronShepherd
Copy link

SauronShepherd commented Mar 19, 2025

Looks like there is a mismatch between the configuration file (examples/febrl5M/config.json) and its corresponding input data:

  • The file febrl_data_5M_part1.csv.gz includes a header, but the configuration file specifies the flag as false.
  • The dataset contains 14 fields, but the fieldDefinition section of the configuration file only defines 11 fields.

This mismatch is causing the Spark CSV parser to throw 5M (the entire dataset) of internal exceptions, severely impacting the load performance (around 8 minutes on my laptop, using only 3 out of my 16 cores).

@sonalgoyal
Copy link
Member

@SauronShepherd thank you for looking into this. This is a great find! How did you see the CSV errors - I never saw them on my runs :-(

One thing to add here is that it is absolutely ok for the field definitions to have lesser number of fields than the actual data. fieldDefs only contain the fields of interest for matching. For csv, the schema is defined as part of the data attribute. I see that also has 11 columns only, not 14. Which is wrong.

Will update the PR and test again.

@SauronShepherd
Copy link

SauronShepherd commented Mar 19, 2025

I never saw them on my runs

That's why I called them "internal"!
Last summer, I noticed how costly throwing internal exceptions can be—exceptions that are not shown to the user and are simply ignored by Spark. I wrote an article about this. It took an unusually long time to load just 5 million rows, which caught my attention. When I checked what the executor threads were doing and saw them creating new exceptions as if there were no tomorrow, it became crystal clear to me that something was going wrong.

One thing to add here is that it is absolutely ok for the field definitions to have lesser number of fields than the actual data.

Yep, I already thought of that. Then it's fine, but it comes with a cost you need to be aware of—that's all.

I've just starting analyzing the project, but I've noticed a few other things. It might be better to generate a report and send it to you with everything I've found rather than opening new issues or adding to existing ones.

One other thing related to this—4000 partitions to handle just 5 million rows is way too much and is most likely having a negative impact (more partitions don’t necessarily mean better performance in Spark). Additionally, when I analyzed the OOM issue in the GraphFrames library, I noticed that the more partitions there were, the more iterations the algorithm took to converge. The ConnectedComponents unit test that threw an OOM error with 10 partitions ran smoothly and faster with just 4. I haven't tested it yet in this case, but it would be worth trying, don't you think?

@sonalgoyal
Copy link
Member

Awesome @SauronShepherd, we should be able to get rid of the internal parsing exceptions if we define the schema correctly in the data attribute of our config, even if the field definition has lesser fields. Right?

I would love to see how your experiments with numPartitions go. This is the reason this hook is being exposed and we use it at a couple of places. The blocking model would hash the input data into different blocks based on the labels , and we use the hash values to join within a block. It is not uncommon to have varied sized blocks. We try to achieve a block size of 100, but it doesn’t happen always. The The partition number is used in this join.

Hope this gives you enough ammunition to discover more 😊

@Arjun-Zingg
Copy link
Contributor Author

Arjun-Zingg commented Mar 20, 2025

Hi @SauronShepherd, the config and model have been updated in this pr during the commits 'config correction' and 'new model' respectively. You'll have to pick the older commit to get the previous(original) model. Thanks again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants