Scalability issues #247

paulgirard · 2019-09-27T13:44:30Z

See the related PR in tableschema-py frictionlessdata/tableschema-py#254

I have scalability issues when trying to validate a datapackage which contains one resource with 397201 lines containing foreign keys.
I needed to split my 397k lines resource into many files to organize those by source.
I finally chose to use share schema and group notion to split into many resources.
Trying to validate all those lines brought many scalability issues in both tableschema and datapackage libraries.

First issue is about memory management.
Checking relations in a resource, makes the object not only to read the related resource but to hold the data in memory as it is kept as an object attribute.
This has the consequences to make the memory grow when checking relations of a large amount of resource.
I don't know why the relations data are kept in the object so I proposed in this PR a new method to clean the relations data drop_relations.

Once the memory issue solved I had a performance one. Checking relations of my +1000 resources hodling 397k lines of data took 98m49.895s.
That's very long.

So I thought about two optimizations.
The first one is to avoid loading the relations for each resources which belongs to a group.
To enable this optimization I propose to add a check_relations method into the Group object.
Thus we load relations data once and then use this data in memory to validate all the resource belonging to that group.

Than a second optimization has been proposed into tableschema-py by the PR tableschema-py frictionlessdata/tableschema-py#254.
The idea is to pre-index the relations data by the values of the foreign keys. This index is called foreign_keys_values. This index is then used to test if the row reference some of the existing value (simple hash map lookup).

To speed up, I also propose to expose the get_foreign_keys_values() method so that the Group object can pre-compute the index only once before using it to validate all the resource.

Using those optimizations made the validation process to drop from 98m49.895s to 1m3.609s.

I've just realized that this PR miss updates on the documentation.
Let's see if the principles are good by reviewers before updating the doc..

@roll what do you think ?

optimized by preparing index of foreign values + optimized way to check group relations

paulgirard · 2019-09-27T13:48:21Z

Oh yeah of course since this PR is based on an update of the tableschema-py dependency, CI failes.
This PR can't be valid before the tablechema one is accepted and deployed...

roll

Thanks!

I've released [email protected] with required by this PR code.

Just a few minor change requests

roll · 2019-10-09T15:29:52Z

datapackage/group.py

+                print('in %s: ' % resource.name)
+                if exception.multiple:
+                    for error in exception.errors:
+                        print(error)


Probably a debug print instead of raise?

Indeed, raising looks like the good thing to do.

roll · 2019-10-09T15:30:15Z

datapackage/group.py

+                    for error in exception.errors:
+                        print(error)
+                else:
+                    print(exception)


Probably a debug print instead of raise?

Indeed, raising looks like the good thing to do

roll · 2019-10-09T15:30:58Z

datapackage/group.py

@@ -52,3 +53,24 @@ def read(self, limit=None, **options):
            if count == limit:
                break
        return rows
+
+    def check_relations(self):


Could you please add a test for this method - https://coveralls.io/builds/26209815/source?filename=datapackage%2Fgroup.py#L62?

Our coverage dropped below its validity threshold

Yep almost done.

…e-py into scalability_issues

paulgirard · 2019-10-16T20:47:25Z

@roll I just removed my try catch to let exceptions flow upstream. Maybe there is a policy to raise a datapackage exceptions and not tableschema ones ?

roll

Thanks!

Exceptions are good because actually, they are the same objects:

datapackage.RelationError = tableschema.exceptions.RelationError

paulgirard added 2 commits September 26, 2019 16:50

a alternative way to check_relations

4c85132

optimized by preparing index of foreign values + optimized way to check group relations

refacto, passing FK index opti into tableschema

eb82ece

roll added the review label Sep 30, 2019

roll self-requested a review September 30, 2019 06:43

Merge branch 'master' into scalability_issues

691b044

roll requested changes Oct 9, 2019

View reviewed changes

paulgirard added 5 commits October 16, 2019 22:39

don't block exceptions

a0af7ea

group.check_relations test

e6d58ba

minimal doc

f98bb64

Merge branch 'master' into scalability_issues

99e0f93

Merge branch 'scalability_issues' of github.com:paulgirard/datapackag…

670e94d

…e-py into scalability_issues

roll approved these changes Oct 21, 2019

View reviewed changes

roll merged commit 8c0852f into frictionlessdata:master Oct 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scalability issues #247

Scalability issues #247

paulgirard commented Sep 27, 2019

paulgirard commented Sep 27, 2019 •

edited

Loading

roll left a comment

roll Oct 9, 2019

paulgirard Oct 16, 2019

roll Oct 9, 2019

paulgirard Oct 16, 2019

roll Oct 9, 2019 •

edited

Loading

paulgirard Oct 16, 2019

paulgirard commented Oct 16, 2019

roll left a comment

Scalability issues #247

Scalability issues #247

Conversation

paulgirard commented Sep 27, 2019

paulgirard commented Sep 27, 2019 • edited Loading

roll left a comment

Choose a reason for hiding this comment

roll Oct 9, 2019

Choose a reason for hiding this comment

paulgirard Oct 16, 2019

Choose a reason for hiding this comment

roll Oct 9, 2019

Choose a reason for hiding this comment

paulgirard Oct 16, 2019

Choose a reason for hiding this comment

roll Oct 9, 2019 • edited Loading

Choose a reason for hiding this comment

paulgirard Oct 16, 2019

Choose a reason for hiding this comment

paulgirard commented Oct 16, 2019

roll left a comment

Choose a reason for hiding this comment

paulgirard commented Sep 27, 2019 •

edited

Loading

roll Oct 9, 2019 •

edited

Loading