You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[IMP] snippets: move all work from parent to mp workers
In `convert_html_columns()`, we select 100MiB worth of DB tuples and pass them
to a ProcessPoolExecutor together with a converter callable. So far, the
converter returns all tuples, changed or unchanged together with the
information if it has changed something. All this is returned through IPC to
the parent process. In the parent process, the caller only acts on the changed
tuples, though, the rest is ignored. In any scenario I've seen, only a small
proportion of the input tuples is actually changed, meaning that a large
proportion is returned through IPC unnecessarily.
What makes it worse is that processing of the converted results in the parent
process is often slower than the conversion, leading to two effects:
1) The results of all workers sit in the parent process's memory, possibly
leading to MemoryError (upg-2021031)
2) The parallel processing is being serialized on the feedback, defeating a
large part of the intended performance gains
To improve this, this commit
- moves all work into the workers, meaning not just the conversion filter, but
also the DB query as well as the DB updates.
- by doing so reduces the amount of data passed by IPC to just the query
texts
- by doing so distributes the data held in memory to all worker processes
- reduces the chunk size by one order of magnitude, which means
- a lot less memory used at a time
- a lot better distribution of "to-be-changed" rows when these rows are
clustered in the table
All in all, in my test case, this
- reduces maximum process size in memory to 300MiB for all processes compared
to formerly >2GiB (and MemoryError) in the parent process
- reduces runtime from 17 minutes to less than 2 minutes
0 commit comments