Trained model for febrl5M #1068 #1072

Arjun-Zingg · 2025-02-26T13:53:37Z

No description provided.

SauronShepherd · 2025-03-19T06:03:26Z

Looks like there is a mismatch between the configuration file (examples/febrl5M/config.json) and its corresponding input data:

The file febrl_data_5M_part1.csv.gz includes a header, but the configuration file specifies the flag as false.
The dataset contains 14 fields, but the fieldDefinition section of the configuration file only defines 11 fields.

This mismatch is causing the Spark CSV parser to throw 5M (the entire dataset) of internal exceptions, severely impacting the load performance (around 8 minutes on my laptop, using only 3 out of my 16 cores).

sonalgoyal · 2025-03-19T07:00:32Z

@SauronShepherd thank you for looking into this. This is a great find! How did you see the CSV errors - I never saw them on my runs :-(

One thing to add here is that it is absolutely ok for the field definitions to have lesser number of fields than the actual data. fieldDefs only contain the fields of interest for matching. For csv, the schema is defined as part of the data attribute. I see that also has 11 columns only, not 14. Which is wrong.

Will update the PR and test again.

SauronShepherd · 2025-03-19T08:22:22Z

I never saw them on my runs

That's why I called them "internal"!
Last summer, I noticed how costly throwing internal exceptions can be—exceptions that are not shown to the user and are simply ignored by Spark. I wrote an article about this. It took an unusually long time to load just 5 million rows, which caught my attention. When I checked what the executor threads were doing and saw them creating new exceptions as if there were no tomorrow, it became crystal clear to me that something was going wrong.

One thing to add here is that it is absolutely ok for the field definitions to have lesser number of fields than the actual data.

Yep, I already thought of that. Then it's fine, but it comes with a cost you need to be aware of—that's all.

I've just starting analyzing the project, but I've noticed a few other things. It might be better to generate a report and send it to you with everything I've found rather than opening new issues or adding to existing ones.

One other thing related to this—4000 partitions to handle just 5 million rows is way too much and is most likely having a negative impact (more partitions don’t necessarily mean better performance in Spark). Additionally, when I analyzed the OOM issue in the GraphFrames library, I noticed that the more partitions there were, the more iterations the algorithm took to converge. The ConnectedComponents unit test that threw an OOM error with 10 partitions ran smoothly and faster with just 4. I haven't tested it yet in this case, but it would be worth trying, don't you think?

sonalgoyal · 2025-03-19T11:08:21Z

Awesome @SauronShepherd, we should be able to get rid of the internal parsing exceptions if we define the schema correctly in the data attribute of our config, even if the field definition has lesser fields. Right?

I would love to see how your experiments with numPartitions go. This is the reason this hook is being exposed and we use it at a couple of places. The blocking model would hash the input data into different blocks based on the labels , and we use the hash values to join within a block. It is not uncommon to have varied sized blocks. We try to achieve a block size of 100, but it doesn’t happen always. The The partition number is used in this join.

Hope this gives you enough ammunition to discover more 😊

Arjun-Zingg · 2025-03-20T06:43:28Z

Hi @SauronShepherd, the config and model have been updated in this pr during the commits 'config correction' and 'new model' respectively. You'll have to pick the older commit to get the previous(original) model. Thanks again.

Trained model for febrl5M

e60389c

Arjun-Zingg added the pr_pending_review label Feb 26, 2025

Arjun-Zingg added 2 commits March 3, 2025 12:17

Merge branch 'zinggAI:main' into febrl5M-model

cff4679

Merge branch 'zinggAI:main' into febrl5M-model

1cb30d3

Arjun-Zingg added 2 commits March 19, 2025 13:07

Merge branch 'zinggAI:main' into febrl5M-model

e8d5111

config correction

58e8323

new model

6a1749f

Merge branch 'zinggAI:main' into febrl5M-model

243af99

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trained model for febrl5M #1068 #1072

Trained model for febrl5M #1068 #1072

Arjun-Zingg commented Feb 26, 2025

SauronShepherd commented Mar 19, 2025 •

edited

Loading

sonalgoyal commented Mar 19, 2025

SauronShepherd commented Mar 19, 2025 •

edited

Loading

sonalgoyal commented Mar 19, 2025

Arjun-Zingg commented Mar 20, 2025 •

edited

Loading

Trained model for febrl5M #1068 #1072

Are you sure you want to change the base?

Trained model for febrl5M #1068 #1072

Conversation

Arjun-Zingg commented Feb 26, 2025

SauronShepherd commented Mar 19, 2025 • edited Loading

sonalgoyal commented Mar 19, 2025

SauronShepherd commented Mar 19, 2025 • edited Loading

sonalgoyal commented Mar 19, 2025

Arjun-Zingg commented Mar 20, 2025 • edited Loading

SauronShepherd commented Mar 19, 2025 •

edited

Loading

SauronShepherd commented Mar 19, 2025 •

edited

Loading

Arjun-Zingg commented Mar 20, 2025 •

edited

Loading