Skip to content

Conversation

orxfun
Copy link
Owner

@orxfun orxfun commented Sep 11, 2025

Towards thread pools

This PR redefines the parallel runner as follows.

This crate defines parallel computation by combining two basic components.

Pulling inputs

  • Pulling inputs in parallel is achieved through ConcurrentIter. Concurrent iterator implementations are lock-free, efficient and support pull-by-chunks optimization to reduce the parallelization overhead. A thread can pull any number of inputs from the concurrent iterator every time it becomes idle. This provides the means to dynamically decide on the chunk sizes.
  • Furthermore, this allows to reduce the overhead of defining creating tasks. To illustrate, provided that the computation will be handled by n threads, a closure holding a reference to the input concurrent iterator is defined to represent the computation. This same closure is passed to n threads; i.e., n spawn calls are made. Each of these threads keep pulling elements from the input until the computation is completed, without requiring to define another task.

Writing outputs

  • When we collect results, writing outputs is handled using lock-free containers such as ConcurrentBag and ConcurrentOrderedBag which aim for high performance collection of results.

There are two main decisions to be taken while executing these components:

  • how many threads do we use?
  • what is the chunk size; i.e., how many input items does a thread pull each time?

A ParallelRunner is a combination of a ParThreadPool and a ParallelExecutor that are responsible for these decisions, respectively.

ParThreadPool: number of threads

ParThreadPool trait generalizes thread pools that can be used for parallel computations. This allows the parallel computation to be generic over thread pools.

When not explicitly set, DefaultPool is used:

  • When std feature is enabled, default pool is the StdDefaultPool. In other words, all available native threads can be used by the parallel computation. This number can be globally bounded by "ORX_PARALLEL_MAX_NUM_THREADS" environment variable when set.
  • When working in a no-std environment, default pool is the SequentialPool. As the name suggests, this pool executes the parallel computation sequentially on the main thread. It can be considered as a placeholder to be overwritten by with_pool or with_runner methods to achieve parallelism.

Note that thread pool defines the resource, or upper bound. This upper bound can further be bounded by the num_threads configuration. Finally, parallel executor might choose not to use all available threads if it decides that the computation is small enough.

To overwrite the defaults and explicitly set the thread pool to be used for the computation, with_pool or with_runner methods are used.

use orx_parallel::*;

let inputs: Vec<_> = (0..42).collect();

// uses the DefaultPool
// assuming "std" enabled, StdDefaultPool will be used; i.e., native threads
let sum = inputs.par().sum();

// equivalent to:
let sum2 = inputs.par().with_pool(StdDefaultPool::default()).sum();
assert_eq!(sum, sum2);

#[cfg(feature = "scoped_threadpool")]
{
    let mut pool = scoped_threadpool::Pool::new(8);
    // uses the scoped_threadpool::Pool created with 8 threads
    let sum2 = inputs.par().with_pool(&mut pool).sum();
    assert_eq!(sum, sum2);
}

#[cfg(feature = "rayon-core")]
{
    let pool = rayon_core::ThreadPoolBuilder::new()
        .num_threads(8)
        .build()
        .unwrap();
    // uses the rayon-core::ThreadPool created with 8 threads
    let sum2 = inputs.par().with_pool(&pool).sum();
    assert_eq!(sum, sum2);
}

#[cfg(feature = "yastl")]
{
    let pool = YastlPool::new(8);
    // uses the yastl::Pool created with 8 threads
    let sum2 = inputs.par().with_pool(&pool).sum();
    assert_eq!(sum, sum2);
}

ParThreadPool implementations of several thread pools are provided in this crate as optional features (see features). Provided that the pool supports scoped computations, it is trivial to implement this trait in most cases (see implementations for examples).

In most of the cases, rayon-core, scoped_threadpool and scoped_pool perform better than others, and get close to native threads performance with StdDefaultPool.

Since parallel computations are generic over the thread pools, performances can be conveniently compared for specific use cases. Such an example benchmark can be found in collect_filter_map file. To have quick tests, you may also use the example benchmark_pools.

ParallelExecutor: chunk size

Once thread pool provides the computation resources, it is ParallelExecutor's task to distribute work to available threads. As mentioned above, all threads receive exactly the same closure. This closure continues to pull elements from the input concurrent iterator and operate on the inputs until all elements are processed.

The critical decision that parallel executor makes is the chunk size. Depending on the state of the computation, it can dynamically decide on number of elements to pull from the input iterator. The tradeoff it tries to solve is as follows:

  • the larger the chunk size,
    • the smaller the parallelization overhead; but also
    • the larger the risk of imbalance in cases of heterogeneity.

Features

With this PR, the crate is converted into a no-std crate.

  • std: This is a no-std crate while std is included as a default feature. Please use --no-default-features flag for no-std use cases. std feature enables StdDefaultPool as the default thread provider which uses native threads.
  • rayon-core: This feature enables using rayon_core::ThreadPool for parallel computations.
  • scoped_threadpool: This feature enables using scoped_threadpool::Pool.
  • scoped-pool: This feature enables using scoped-pool::Pool.
  • yastl: This feature enables using yastl::Pool.
  • pond: This feature enables using pond::Pool.
  • poolite: This feature enables using poolite::Pool.

Breaking Change

The changes on the ParallelRunner trait are breaking changes, if you have been using with_runner tranformation. However, prior to thread pool, this transformation was being used pretty much as an internal experimental and benchmarking tool. None of the tests, examples or benchmarks are broken.

Target Issues

This PR aims to address the pre-requisites

Fixes #82

@orxfun orxfun marked this pull request as ready for review September 18, 2025 21:36
@orxfun orxfun merged commit 1b0bf95 into main Sep 25, 2025
2 checks passed
@orxfun orxfun deleted the towards-thread-pool branch September 25, 2025 07:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parallel runner using a thread pool
1 participant