-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trained model for febrl5M #1068 #1072
base: main
Are you sure you want to change the base?
Conversation
Looks like there is a mismatch between the configuration file (examples/febrl5M/config.json) and its corresponding input data:
This mismatch is causing the Spark CSV parser to throw 5M (the entire dataset) of internal exceptions, severely impacting the load performance (around 8 minutes on my laptop, using only 3 out of my 16 cores). |
@SauronShepherd thank you for looking into this. This is a great find! How did you see the CSV errors - I never saw them on my runs :-( One thing to add here is that it is absolutely ok for the field definitions to have lesser number of fields than the actual data. fieldDefs only contain the fields of interest for matching. For csv, the schema is defined as part of the data attribute. I see that also has 11 columns only, not 14. Which is wrong. Will update the PR and test again. |
That's why I called them "internal"!
Yep, I already thought of that. Then it's fine, but it comes with a cost you need to be aware of—that's all. I've just starting analyzing the project, but I've noticed a few other things. It might be better to generate a report and send it to you with everything I've found rather than opening new issues or adding to existing ones. One other thing related to this—4000 partitions to handle just 5 million rows is way too much and is most likely having a negative impact (more partitions don’t necessarily mean better performance in Spark). Additionally, when I analyzed the OOM issue in the GraphFrames library, I noticed that the more partitions there were, the more iterations the algorithm took to converge. The ConnectedComponents unit test that threw an OOM error with 10 partitions ran smoothly and faster with just 4. I haven't tested it yet in this case, but it would be worth trying, don't you think? |
Awesome @SauronShepherd, we should be able to get rid of the internal parsing exceptions if we define the schema correctly in the I would love to see how your experiments with Hope this gives you enough ammunition to discover more 😊 |
Hi @SauronShepherd, the config and model have been updated in this pr during the commits 'config correction' and 'new model' respectively. You'll have to pick the older commit to get the previous(original) model. Thanks again. |
No description provided.