Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scalability issues #247

Merged
merged 8 commits into from
Oct 21, 2019
Merged

Conversation

paulgirard
Copy link
Contributor

See the related PR in tableschema-py frictionlessdata/tableschema-py#254

I have scalability issues when trying to validate a datapackage which contains one resource with 397201 lines containing foreign keys.
I needed to split my 397k lines resource into many files to organize those by source.
I finally chose to use share schema and group notion to split into many resources.
Trying to validate all those lines brought many scalability issues in both tableschema and datapackage libraries.

First issue is about memory management.
Checking relations in a resource, makes the object not only to read the related resource but to hold the data in memory as it is kept as an object attribute.
This has the consequences to make the memory grow when checking relations of a large amount of resource.
I don't know why the relations data are kept in the object so I proposed in this PR a new method to clean the relations data drop_relations.

Once the memory issue solved I had a performance one. Checking relations of my +1000 resources hodling 397k lines of data took 98m49.895s.
That's very long.

So I thought about two optimizations.
The first one is to avoid loading the relations for each resources which belongs to a group.
To enable this optimization I propose to add a check_relations method into the Group object.
Thus we load relations data once and then use this data in memory to validate all the resource belonging to that group.

Than a second optimization has been proposed into tableschema-py by the PR tableschema-py frictionlessdata/tableschema-py#254.
The idea is to pre-index the relations data by the values of the foreign keys. This index is called foreign_keys_values. This index is then used to test if the row reference some of the existing value (simple hash map lookup).

To speed up, I also propose to expose the get_foreign_keys_values() method so that the Group object can pre-compute the index only once before using it to validate all the resource.

Using those optimizations made the validation process to drop from 98m49.895s to 1m3.609s.

I've just realized that this PR miss updates on the documentation.
Let's see if the principles are good by reviewers before updating the doc..

@roll what do you think ?

optimized by preparing index of foreign values
+ optimized way to check group relations
@paulgirard
Copy link
Contributor Author

paulgirard commented Sep 27, 2019

Oh yeah of course since this PR is based on an update of the tableschema-py dependency, CI failes.
This PR can't be valid before the tablechema one is accepted and deployed...

@roll roll added the review label Sep 30, 2019
@roll roll self-requested a review September 30, 2019 06:43
Copy link
Member

@roll roll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

I've released [email protected] with required by this PR code.

Just a few minor change requests

print('in %s: ' % resource.name)
if exception.multiple:
for error in exception.errors:
print(error)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a debug print instead of raise?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, raising looks like the good thing to do.

for error in exception.errors:
print(error)
else:
print(exception)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a debug print instead of raise?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, raising looks like the good thing to do

@@ -52,3 +53,24 @@ def read(self, limit=None, **options):
if count == limit:
break
return rows

def check_relations(self):
Copy link
Member

@roll roll Oct 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a test for this method - https://coveralls.io/builds/26209815/source?filename=datapackage%2Fgroup.py#L62?

Our coverage dropped below its validity threshold

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep almost done.

@paulgirard
Copy link
Contributor Author

@roll I just removed my try catch to let exceptions flow upstream. Maybe there is a policy to raise a datapackage exceptions and not tableschema ones ?

Copy link
Member

@roll roll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Exceptions are good because actually, they are the same objects:

datapackage.RelationError = tableschema.exceptions.RelationError

@roll roll merged commit 8c0852f into frictionlessdata:master Oct 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants