Replies: 1 comment 4 replies
-
Hi @orxfun, came across this post from a 'mention' in another Discussion. |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Parallel Runner Trait and Implementations
This discussion aims to shape future implementation of the parallel execution of the
Par
. It would most likely be a part of v2. The goal is to make the parallel execution logic a trait so that it isand as a natural consequence
In the next section, current parallel executor is briefly summarized, which could be a default implementation of the trait to be defined. Then, some ideas on the trait and implementations are provided as a basis of the discussion.
Current Parallel Execution
Current parallel execution is governed by the
Runner
. Runner takes three bits of information:find
method. Stopping execution is handled by immediately skipping to the end of the concurrent iterator. Therefore, each thread will know that they must return once they are idle again and see that the iterator is consumed. Note that a very large chunk size would be most performance degrading in this scenario.The executor sequentially spawns threads one after the other. It is capable of avoiding unnecessary threads by checking the concurrent iterator's status in between spawning two threads. However, there is little time here to decide. In majority of computations, unless the computation is too easy, it quickly spawns the desired number of threads.
Chunk size can be set exactly by the caller. Here there is no work for the executor. When it is set to Auto, it will be converted to a minimum chunk size by a heuristic depending on the input concurrent iterator.
Then, in between spawning threads, the executor has the opportunity of the progress of already spawned threads and spawn the new thread with a larger chunk size. Although this seems very limited, it is already very effective in reducing the overhead of parallelization. However, it could provide much more capability if the executor had the opportunity to decide on the chunk size every time a thread requests to pull from the concurrent iterator.
Runner Trait - Step 1 - Dynamic Chunk Sizes
In the first attempt, we can leave the decision of number of threads out and let the runner control the parallel execution through deciding on the chunk sizes. Then, the problem it must solve is
Let's define some helper types and a draft for the trait.
Then, each thread that becomes idle first calls the
chunk_size
method and then progresses with respect to the runner's chunk size decision.How is dynamic chunk size decision helpful?
As a rule of thumb, we want to work with a large enough chunk size to make the parallelization overhead negligible. Now assume a long iterator with many elements. We would tend to start with a not so small chunk size, say 64. But at the beginning, we don't know how much each individual computation takes. This approach would lead to very poor results if each computation is super lengthy, and if the computation class is early return (find). Parallel execution would most likely be magnitudes slower than sequential if the element we are looking for is the first one. This would then force us to be more conservative in chunk size decision for early return scenarios, but then we might likely suffer from parallelization overhead when the computations are much simpler.
But if we can observe and decide, the runner could make good decisions.
Consider the same scenario and we start with a chunk size of 1. In the heavy computation case, our runner's chunk size method can easily keep returning 1 as it sees that each computation takes long enough to overweigh the parallelization overhead. Therefore, we would be correct and efficient.
If the computations are easy, the threads would compute the single elements super quickly and ask for the next chunk size. This would be visible to the runner. And the runner can decide to increase the chunk size, say to 4. And in the next call to 8, etc. Whenever, it sees that the computation time overweighs the parallelization overhead, it could stop increasing. And we could still be correct and efficient.
Runner Trait - Step 2 - Dynamic Number of Threads
We would always give the opportunity to caller to set an upper bound on the number of threads that can be used for the computation. If it is omitted, we can assume that all available threads can be used. Let's say this number is
N
.Most of the parallel computations are not ideal; we get diminishing improvements as we add more threads for it.
N
threads is almost always optimal (unless the computation is super trivial).Even currently in v1, the caller can benchmark and find the desired sweet spot and set a maximum number of threads. This could be good enough in the future as well.
One alternative, on the other hand, would be to give more responsibility to the
ParallelRunner
trait; in particular, to decide whether or not, or when, to spawn a new thread.An example use case could be as follows. Consider a computation with very unexpected computation time of individual elements and tight resource environment. A strategy can be implemented through the parallel runner trait as follows:
T
seconds have passed and iterator has not reached 20% yet, spawn the second thread.2T
seconds have passed and the iterator has not reached 50% yet, spawn two more threads,This could give very detailed control to the implementor. Most likely, the complexity of implementing such a strategy could be worthwhile only in very critical use cases.
Beta Was this translation helpful? Give feedback.
All reactions