Towards thread pool #107

orxfun · 2025-09-11T14:35:56Z

Towards thread pools

This PR redefines the parallel runner as follows.

This crate defines parallel computation by combining two basic components.

Pulling inputs

Pulling inputs in parallel is achieved through ConcurrentIter. Concurrent iterator implementations are lock-free, efficient and support pull-by-chunks optimization to reduce the parallelization overhead. A thread can pull any number of inputs from the concurrent iterator every time it becomes idle. This provides the means to dynamically decide on the chunk sizes.
Furthermore, this allows to reduce the overhead of defining creating tasks. To illustrate, provided that the computation will be handled by n threads, a closure holding a reference to the input concurrent iterator is defined to represent the computation. This same closure is passed to n threads; i.e., n spawn calls are made. Each of these threads keep pulling elements from the input until the computation is completed, without requiring to define another task.

Writing outputs

When we collect results, writing outputs is handled using lock-free containers such as ConcurrentBag and ConcurrentOrderedBag which aim for high performance collection of results.

There are two main decisions to be taken while executing these components:

how many threads do we use?
what is the chunk size; i.e., how many input items does a thread pull each time?

A ParallelRunner is a combination of a ParThreadPool and a ParallelExecutor that are responsible for these decisions, respectively.

ParThreadPool: number of threads

ParThreadPool trait generalizes thread pools that can be used for parallel computations. This allows the parallel computation to be generic over thread pools.

When not explicitly set, DefaultPool is used:

When std feature is enabled, default pool is the StdDefaultPool. In other words, all available native threads can be used by the parallel computation. This number can be globally bounded by "ORX_PARALLEL_MAX_NUM_THREADS" environment variable when set.
When working in a no-std environment, default pool is the SequentialPool. As the name suggests, this pool executes the parallel computation sequentially on the main thread. It can be considered as a placeholder to be overwritten by with_pool or with_runner methods to achieve parallelism.

Note that thread pool defines the resource, or upper bound. This upper bound can further be bounded by the num_threads configuration. Finally, parallel executor might choose not to use all available threads if it decides that the computation is small enough.

To overwrite the defaults and explicitly set the thread pool to be used for the computation, with_pool or with_runner methods are used.

use orx_parallel::*;

let inputs: Vec<_> = (0..42).collect();

// uses the DefaultPool
// assuming "std" enabled, StdDefaultPool will be used; i.e., native threads
let sum = inputs.par().sum();

// equivalent to:
let sum2 = inputs.par().with_pool(StdDefaultPool::default()).sum();
assert_eq!(sum, sum2);

#[cfg(feature = "scoped_threadpool")]
{
    let mut pool = scoped_threadpool::Pool::new(8);
    // uses the scoped_threadpool::Pool created with 8 threads
    let sum2 = inputs.par().with_pool(&mut pool).sum();
    assert_eq!(sum, sum2);
}

#[cfg(feature = "rayon-core")]
{
    let pool = rayon_core::ThreadPoolBuilder::new()
        .num_threads(8)
        .build()
        .unwrap();
    // uses the rayon-core::ThreadPool created with 8 threads
    let sum2 = inputs.par().with_pool(&pool).sum();
    assert_eq!(sum, sum2);
}

#[cfg(feature = "yastl")]
{
    let pool = YastlPool::new(8);
    // uses the yastl::Pool created with 8 threads
    let sum2 = inputs.par().with_pool(&pool).sum();
    assert_eq!(sum, sum2);
}

ParThreadPool implementations of several thread pools are provided in this crate as optional features (see features). Provided that the pool supports scoped computations, it is trivial to implement this trait in most cases (see implementations for examples).

In most of the cases, rayon-core, scoped_threadpool and scoped_pool perform better than others, and get close to native threads performance with StdDefaultPool.

Since parallel computations are generic over the thread pools, performances can be conveniently compared for specific use cases. Such an example benchmark can be found in collect_filter_map file. To have quick tests, you may also use the example benchmark_pools.

ParallelExecutor: chunk size

Once thread pool provides the computation resources, it is ParallelExecutor's task to distribute work to available threads. As mentioned above, all threads receive exactly the same closure. This closure continues to pull elements from the input concurrent iterator and operate on the inputs until all elements are processed.

The critical decision that parallel executor makes is the chunk size. Depending on the state of the computation, it can dynamically decide on number of elements to pull from the input iterator. The tradeoff it tries to solve is as follows:

the larger the chunk size,
- the smaller the parallelization overhead; but also
- the larger the risk of imbalance in cases of heterogeneity.

Features

With this PR, the crate is converted into a no-std crate.

std: This is a no-std crate while std is included as a default feature. Please use --no-default-features flag for no-std use cases. std feature enables StdDefaultPool as the default thread provider which uses native threads.
rayon-core: This feature enables using rayon_core::ThreadPool for parallel computations.
scoped_threadpool: This feature enables using scoped_threadpool::Pool.
scoped-pool: This feature enables using scoped-pool::Pool.
yastl: This feature enables using yastl::Pool.
pond: This feature enables using pond::Pool.
poolite: This feature enables using poolite::Pool.

Breaking Change

The changes on the ParallelRunner trait are breaking changes, if you have been using with_runner tranformation. However, prior to thread pool, this transformation was being used pretty much as an internal experimental and benchmarking tool. None of the tests, examples or benchmarks are broken.

Target Issues

This PR aims to address the pre-requisites

for issue Support for scopes #104 , and
for discussion WASM support #77

Fixes #82

New pools

Revise readme for pools

…ools-to-benchmarks Add orx parallel with different pools to benchmarks

orxfun added 30 commits September 10, 2025 11:02

collect-ordered m2 is implemented with orchestrator

4908653

m computations use orchestrator

9b0f71c

par computations fixed to accept orchestrator

e0b3ede

map computations fixed to use orchestrator

b1e2065

collect into methods are fixed for orchestrator for the m variant

4c27f90

par and map collections use orchestrator

4c3467e

tests fixed

d484929

checkpoint: orch compiles

db96d00

m reduce is simplified

d49579d

M next and next_any simplified

0428a74

clean up

2deae40

collect into takes par-map directly

2022194

m computation flattened and simplified

e3a6857

collect map tests fixed

f64614c

m computation made redundant

b19fb1e

m tests moved to computational variants

e5ffff9

m computation clean up

09457f7

collect arbitrary receives ParXap

f30f90d

xap compute methods expect ParXap with orchestrator

72b1b2d

ParXap computations flattened and simplified

fff239f

clean up

8f81f45

par-xap tests are fixed for orchestrator

12a0983

xap uses direct reduce

627f38d

xap-result first flattened and simplified

3b681fb

seq_try_collect_into is implemented for par iter result

343d61d

par_collect_into for par-xap

21888b7

rename

b26d5cd

clean up

e638b3f

pub fallible modules

eabd40f

x_try_collect_into_2 is defined

d0138f0

orxfun added 20 commits September 18, 2025 20:19

fix doc

e6681fd

with_runner doc test

4a1a0ca

with_executor is added to RunnerWithPool

c04b8dd

with_pool is documented

39e4c6a

doc

d8e0658

with_pool documentation

d154682

with_pool transformation for fallible iterators

288a239

with_pool for optional iterator

4083ea5

with-pool transformation for using iterators

2b81f83

fix docs

a672233

fix default

51319ea

Merge pull request #114 from orxfun/new-pools

c3448eb

New pools

clippy fix

43b735a

readme is updated for pools

5feb4d1

update contributing

a0c5bb9

Merge pull request #115 from orxfun/revise-readme-for-pools

16bd926

Revise readme for pools

performance note on pool performance

f800454

different thread pools are added to some of the benchmarks.

012ba2f

update readme

7466587

Merge pull request #116 from orxfun/add-orx-parallel-with-different-p…

71ca681

…ools-to-benchmarks Add orx parallel with different pools to benchmarks

orxfun marked this pull request as ready for review September 18, 2025 21:36

orxfun added 7 commits September 18, 2025 23:44

--all-features is added to benchmark script

fc8f0ae

increment version

2d6403e

simplify yastl thread pool impl

9a2ef21

remove self-or dependency

0c38489

revise example parameters

517efa4

fix 32-bit error in test

bc409be

version number is set

4682901

orxfun merged commit 1b0bf95 into main Sep 25, 2025
2 checks passed

orxfun deleted the towards-thread-pool branch September 25, 2025 07:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Towards thread pool #107

Towards thread pool #107

Uh oh!

orxfun commented Sep 11, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Towards thread pool #107

Towards thread pool #107

Uh oh!

Conversation

orxfun commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Towards thread pools

ParThreadPool: number of threads

ParallelExecutor: chunk size

Features

Breaking Change

Target Issues

Uh oh!

Uh oh!

Uh oh!

orxfun commented Sep 11, 2025 •

edited

Loading