-
-
Notifications
You must be signed in to change notification settings - Fork 3
Towards thread pool #107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Towards thread pool #107
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
New pools
Revise readme for pools
…ools-to-benchmarks Add orx parallel with different pools to benchmarks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Towards thread pools
This PR redefines the parallel runner as follows.
This crate defines parallel computation by combining two basic components.
Pulling inputs
ConcurrentIter
. Concurrent iterator implementations are lock-free, efficient and support pull-by-chunks optimization to reduce the parallelization overhead. A thread can pull any number of inputs from the concurrent iterator every time it becomes idle. This provides the means to dynamically decide on the chunk sizes.n
threads, a closure holding a reference to the input concurrent iterator is defined to represent the computation. This same closure is passed ton
threads; i.e.,n
spawn calls are made. Each of these threads keep pulling elements from the input until the computation is completed, without requiring to define another task.Writing outputs
ConcurrentBag
andConcurrentOrderedBag
which aim for high performance collection of results.There are two main decisions to be taken while executing these components:
A
ParallelRunner
is a combination of aParThreadPool
and aParallelExecutor
that are responsible for these decisions, respectively.ParThreadPool: number of threads
ParThreadPool
trait generalizes thread pools that can be used for parallel computations. This allows the parallel computation to be generic over thread pools.When not explicitly set,
DefaultPool
is used:StdDefaultPool
. In other words, all available native threads can be used by the parallel computation. This number can be globally bounded by "ORX_PARALLEL_MAX_NUM_THREADS" environment variable when set.SequentialPool
. As the name suggests, this pool executes the parallel computation sequentially on the main thread. It can be considered as a placeholder to be overwritten bywith_pool
orwith_runner
methods to achieve parallelism.Note that thread pool defines the resource, or upper bound. This upper bound can further be bounded by the
num_threads
configuration. Finally, parallel executor might choose not to use all available threads if it decides that the computation is small enough.To overwrite the defaults and explicitly set the thread pool to be used for the computation,
with_pool
orwith_runner
methods are used.ParThreadPool
implementations of several thread pools are provided in this crate as optional features (see features). Provided that the pool supports scoped computations, it is trivial to implement this trait in most cases (see implementations for examples).In most of the cases, rayon-core, scoped_threadpool and scoped_pool perform better than others, and get close to native threads performance with
StdDefaultPool
.Since parallel computations are generic over the thread pools, performances can be conveniently compared for specific use cases. Such an example benchmark can be found in collect_filter_map file. To have quick tests, you may also use the example benchmark_pools.
ParallelExecutor: chunk size
Once thread pool provides the computation resources, it is
ParallelExecutor
's task to distribute work to available threads. As mentioned above, all threads receive exactly the same closure. This closure continues to pull elements from the input concurrent iterator and operate on the inputs until all elements are processed.The critical decision that parallel executor makes is the chunk size. Depending on the state of the computation, it can dynamically decide on number of elements to pull from the input iterator. The tradeoff it tries to solve is as follows:
Features
With this PR, the crate is converted into a no-std crate.
--no-default-features
flag for no-std use cases. std feature enablesStdDefaultPool
as the default thread provider which uses native threads.rayon_core::ThreadPool
for parallel computations.scoped_threadpool::Pool
.scoped-pool::Pool
.yastl::Pool
.pond::Pool
.poolite::Pool
.Breaking Change
The changes on the
ParallelRunner
trait are breaking changes, if you have been usingwith_runner
tranformation. However, prior to thread pool, this transformation was being used pretty much as an internal experimental and benchmarking tool. None of the tests, examples or benchmarks are broken.Target Issues
This PR aims to address the pre-requisites
Fixes #82