Parallel Runner Trait and Potential Implementations #26

orxfun · 2024-07-09T15:56:06Z

orxfun
Jul 9, 2024
Maintainer

Parallel Runner Trait and Implementations

This discussion aims to shape future implementation of the parallel execution of the Par. It would most likely be a part of v2. The goal is to make the parallel execution logic a trait so that it is

extendable,
easily replaceable just as a parameter to the parallel iterator,
per computation rather than a global setting,

and as a natural consequence

testable
tunable

In the next section, current parallel executor is briefly summarized, which could be a default implementation of the trait to be defined. Then, some ideas on the trait and implementations are provided as a basis of the discussion.

Current Parallel Execution

Current parallel execution is governed by the Runner. Runner takes three bits of information:

num-threads: Maximum number of threads.
chunk-size: Minimum or exact number of inputs each thread will pull from the concurrent iterator each time it becomes idle.
computation-class: Currently, computations are divided into three classes which might hint different strategies to the runner:
- Collect: All items will be processed and written concurrently.
- Reduce: All items will be processed; however, will not be written concurrently. Each thread will have its own reduction.
- EarlyReturn: First time a condition is satisfied, the execution stops, such as in find method. Stopping execution is handled by immediately skipping to the end of the concurrent iterator. Therefore, each thread will know that they must return once they are idle again and see that the iterator is consumed. Note that a very large chunk size would be most performance degrading in this scenario.

The executor sequentially spawns threads one after the other. It is capable of avoiding unnecessary threads by checking the concurrent iterator's status in between spawning two threads. However, there is little time here to decide. In majority of computations, unless the computation is too easy, it quickly spawns the desired number of threads.

Chunk size can be set exactly by the caller. Here there is no work for the executor. When it is set to Auto, it will be converted to a minimum chunk size by a heuristic depending on the input concurrent iterator.

Then, in between spawning threads, the executor has the opportunity of the progress of already spawned threads and spawn the new thread with a larger chunk size. Although this seems very limited, it is already very effective in reducing the overhead of parallelization. However, it could provide much more capability if the executor had the opportunity to decide on the chunk size every time a thread requests to pull from the concurrent iterator.

Runner Trait - Step 1 - Dynamic Chunk Sizes

In the first attempt, we can leave the decision of number of threads out and let the runner control the parallel execution through deciding on the chunk sizes. Then, the problem it must solve is

What is the optimal chunk size given:

the computation class,

number of spawned threads,

input of the computation as a concurrent iterator,

number of elements already pulled from the concurrent iterator,

number of times each thread pulled from the iterator and how long each iteration took.

Let's define some helper types and a draft for the trait.

enum ComputationClass {
    Collect,
    Reduce,
    EarlyReturn,
}

struct ChunkExecution {
    len: usize,
    start_time: u64,
    completion_time: Option<u64>,
}

struct ThreadProgress {
    chunks: Vec<ChunkExecution>,
}

struct ExecutionStatus {
    threads_progresses: Vec<ThreadProgress>
}

trait ParallelRunner {
    /// Returns the chunk size the idle thread must pull from the concurrent iterator. The thread must return when this method returns 0.
    fn chunk_size(class: ComputationClass, current_status: ExecutionStatus) -> usize;
}

Then, each thread that becomes idle first calls the chunk_size method and then progresses with respect to the runner's chunk size decision.

How is dynamic chunk size decision helpful?

As a rule of thumb, we want to work with a large enough chunk size to make the parallelization overhead negligible. Now assume a long iterator with many elements. We would tend to start with a not so small chunk size, say 64. But at the beginning, we don't know how much each individual computation takes. This approach would lead to very poor results if each computation is super lengthy, and if the computation class is early return (find). Parallel execution would most likely be magnitudes slower than sequential if the element we are looking for is the first one. This would then force us to be more conservative in chunk size decision for early return scenarios, but then we might likely suffer from parallelization overhead when the computations are much simpler.

But if we can observe and decide, the runner could make good decisions.

Consider the same scenario and we start with a chunk size of 1. In the heavy computation case, our runner's chunk size method can easily keep returning 1 as it sees that each computation takes long enough to overweigh the parallelization overhead. Therefore, we would be correct and efficient.

If the computations are easy, the threads would compute the single elements super quickly and ask for the next chunk size. This would be visible to the runner. And the runner can decide to increase the chunk size, say to 4. And in the next call to 8, etc. Whenever, it sees that the computation time overweighs the parallelization overhead, it could stop increasing. And we could still be correct and efficient.

Runner Trait - Step 2 - Dynamic Number of Threads

We would always give the opportunity to caller to set an upper bound on the number of threads that can be used for the computation. If it is omitted, we can assume that all available threads can be used. Let's say this number is N.

Most of the parallel computations are not ideal; we get diminishing improvements as we add more threads for it.

If the objective is to minimize the computation time, using N threads is almost always optimal (unless the computation is super trivial).
But if the objective is to maximize the efficient usage of the threads, there is no dominating solution. We need to find the sweet spot, but sweet spot is not well defined, it is a preference.

Even currently in v1, the caller can benchmark and find the desired sweet spot and set a maximum number of threads. This could be good enough in the future as well.

One alternative, on the other hand, would be to give more responsibility to the ParallelRunner trait; in particular, to decide whether or not, or when, to spawn a new thread.

Should we spawn a new thread given:

the current execution status,

number of available threads.

trait ParallelRunner {
    /// Returns the chunk size the idle thread must pull from the concurrent iterator. The thread must return when this method returns 0.
    fn chunk_size(class: ComputationClass, current_status: ExecutionStatus) -> usize;

    /// Returns whether or not a new thread needs to be spawned.
    fn do_spawn(class: ComputationClass, current_status: ExecutionStatus, num_available_threads: usize) -> bool;
}

An example use case could be as follows. Consider a computation with very unexpected computation time of individual elements and tight resource environment. A strategy can be implemented through the parallel runner trait as follows:

start sequentially,
when T seconds have passed and iterator has not reached 20% yet, spawn the second thread.
when 2T seconds have passed and the iterator has not reached 50% yet, spawn two more threads,
etc.

This could give very detailed control to the implementor. Most likely, the complexity of implementing such a strategy could be worthwhile only in very critical use cases.

v1gnesh · 2025-08-05T01:53:03Z

v1gnesh
Aug 5, 2025

Hi @orxfun, came across this post from a 'mention' in another Discussion.
Regarding dynamic chunk sizes, what are your thoughts on how to handle a length-delimited codec.
Keen to know if there's potential here to do synchronous I/O, but then "ship out" these chunks to different Vecs (whose maximum possible size is known, let's say), and then do processing on all of them either concurrently or parallelly.

4 replies

v1gnesh Aug 12, 2025

Thoughts, @orxfun ?

orxfun Aug 13, 2025
Maintainer Author

Hi @v1gnesh . Sorry for the late reply. I have been focusing on adding new methods such as take_while and map_while. Once those features are released (targeting end of this week), I am looking forward to dive into the parallel runner. And thanks a lot for your suggestions, I'll check carefully and share my thoughts this week. Hopefully, then we can work on this topic.

orxfun Sep 8, 2025
Maintainer Author

Hi @v1gnesh . Thanks again for the idea. Browsed through the length_delimited documentation and revisited your suggestion. Unfortunately, I am not very experienced in async io, and the use case is still not very clear for me. It would be very helpful for me if you could add a code sample (doesn't need to compile) on how we could combine parallel processing.

v1gnesh Sep 9, 2025

Thank you @orxfun, yes sure.
Async I/O is not a requirement.
Following is like examples 1-3 in the above LDC docs page:

let framed_read_stream = LengthDelimitedCodec::builder()
        .length_field_type::<u16>()
        .num_skip(0)
        .new_read(input)
        .map_ok(BytesMut::freeze);

Thinking of parsing this data, as it's length-delimited.
Here's some C code (there's also a header file in the repo) for one way of parsing this data.

Use case is to see if I can use length-delimited without async I/O, i.e., plain I/O, chop it up into records (of max possible length, 32760). Then, pass the records to downstream iterator to parse concurrently and/or parallelly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Parallel Runner Trait and Potential Implementations #26

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

Parallel Runner Trait and Potential Implementations #26

Uh oh!

Uh oh!

orxfun Jul 9, 2024 Maintainer

Parallel Runner Trait and Implementations

Current Parallel Execution

Runner Trait - Step 1 - Dynamic Chunk Sizes

How is dynamic chunk size decision helpful?

Runner Trait - Step 2 - Dynamic Number of Threads

Replies: 1 comment · 4 replies

Uh oh!

v1gnesh Aug 5, 2025

Uh oh!

v1gnesh Aug 12, 2025

Uh oh!

orxfun Aug 13, 2025 Maintainer Author

Uh oh!

orxfun Sep 8, 2025 Maintainer Author

Uh oh!

Uh oh!

v1gnesh Sep 9, 2025

orxfun
Jul 9, 2024
Maintainer

Replies: 1 comment 4 replies

v1gnesh
Aug 5, 2025

orxfun Aug 13, 2025
Maintainer Author

orxfun Sep 8, 2025
Maintainer Author