-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scalability issues #247
Scalability issues #247
Conversation
optimized by preparing index of foreign values + optimized way to check group relations
Oh yeah of course since this PR is based on an update of the tableschema-py dependency, CI failes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
I've released [email protected]
with required by this PR code.
Just a few minor change requests
datapackage/group.py
Outdated
print('in %s: ' % resource.name) | ||
if exception.multiple: | ||
for error in exception.errors: | ||
print(error) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably a debug print instead of raise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, raising looks like the good thing to do.
datapackage/group.py
Outdated
for error in exception.errors: | ||
print(error) | ||
else: | ||
print(exception) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably a debug print instead of raise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, raising looks like the good thing to do
@@ -52,3 +53,24 @@ def read(self, limit=None, **options): | |||
if count == limit: | |||
break | |||
return rows | |||
|
|||
def check_relations(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a test for this method - https://coveralls.io/builds/26209815/source?filename=datapackage%2Fgroup.py#L62?
Our coverage dropped below its validity threshold
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep almost done.
@roll I just removed my try catch to let exceptions flow upstream. Maybe there is a policy to raise a datapackage exceptions and not tableschema ones ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Exceptions are good because actually, they are the same objects:
datapackage.RelationError = tableschema.exceptions.RelationError
See the related PR in tableschema-py frictionlessdata/tableschema-py#254
I have scalability issues when trying to validate a datapackage which contains one resource with 397201 lines containing foreign keys.
I needed to split my 397k lines resource into many files to organize those by source.
I finally chose to use share schema and group notion to split into many resources.
Trying to validate all those lines brought many scalability issues in both tableschema and datapackage libraries.
First issue is about memory management.
Checking relations in a resource, makes the object not only to read the related resource but to hold the data in memory as it is kept as an object attribute.
This has the consequences to make the memory grow when checking relations of a large amount of resource.
I don't know why the relations data are kept in the object so I proposed in this PR a new method to clean the relations data drop_relations.
Once the memory issue solved I had a performance one. Checking relations of my +1000 resources hodling 397k lines of data took 98m49.895s.
That's very long.
So I thought about two optimizations.
The first one is to avoid loading the relations for each resources which belongs to a group.
To enable this optimization I propose to add a check_relations method into the Group object.
Thus we load relations data once and then use this data in memory to validate all the resource belonging to that group.
Than a second optimization has been proposed into tableschema-py by the PR tableschema-py frictionlessdata/tableschema-py#254.
The idea is to pre-index the relations data by the values of the foreign keys. This index is called foreign_keys_values. This index is then used to test if the row reference some of the existing value (simple hash map lookup).
To speed up, I also propose to expose the get_foreign_keys_values() method so that the Group object can pre-compute the index only once before using it to validate all the resource.
Using those optimizations made the validation process to drop from 98m49.895s to 1m3.609s.
I've just realized that this PR miss updates on the documentation.
Let's see if the principles are good by reviewers before updating the doc..
@roll what do you think ?