Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC for the task arena waiting functionality #1617

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
135 changes: 135 additions & 0 deletions rfcs/proposed/task_arena_waiting/readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
# Wait for work completion in a task_arena

## Motivation

Task arenas in oneTBB are the places for threads to share and execute tasks.
A `task_arena` instance represents a configurable execution context for parallel work.

There are two primary ways to submit work to an arena: the `execute` and `enqueue` functions.
Both take a callable object and run it in the context of the arena. The callable object
might start more parallel work in the arena by invoking a oneTBB algorithm, running a flow graph,
or submitting work into a task group.
`execute` is a blocking call: the calling thread does not return until the callable object
completes. `enqueue` is a “fire-and-forget” call: the calling thread submits the callable
object as a task and returns immediately, providing no way to synchronize with the completion
of the task.

In [the oneDPL case study of asynchronous algorithms](https://github.com/uxlfoundation/oneDPL/tree/main/rfcs/archived/asynchronous_api_general#use-case-study)
three main usage patterns were observed, each assuming subsequent synchronization either with
the asynchronous job or with all jobs previously submitted into a work queue or device.
A task arena can be considered analogous to a device or a work queue in these patterns,
but it lacks synchronization capabilities and so cannot alone support these use cases -
for that it has to be paired with something "waitable".

### Current solution: combining with a task group

In oneTBB, asynchronous execution is supported by `task_group` and the flow graph API; both allow
submitting a job and waiting for its completion later. The flow graph is more suitable for building
and executing graphs of dependent computations, while the `task_group`, as of now, allows starting
and waiting for independent tasks. Notably, both require calling `wait`/`wait_for_all` to ensure that
the work will be done. The `task_arena::enqueue`, on the other hand, being "fire-and-forget", enforces
eventual availability of another thread in the arena to execute the task.

So, a reasonable solution for the described use cases seems to combine a `task_arena` with a `task_group`.
However, it is notoriously non-trivial to do right. For example, the following "naive" attempt is subtly
incorrect:
```cpp
tbb::task_arena ta{/*args*/};
tbb::task_group tg;
ta.enqueue([&tg]{ tg.run([]{ foo(); }); });
bar();
ta.execute([&tg]{ tg.wait(); });
```
The problem is that `enqueue` submits a task that calls `tg.run` to add `[]{ foo(); }` to the task group,
but it is unknown if that task was actually executed prior to `tg.wait`. Simply put,
the task group might yet be empty, in which case `tg.wait` exits prematurely.
To avoid that, `execute` can be used instead of `enqueue`, but then the mentioned
thread availability guarantee is lost. The approach with `execute` is shown in the
[oneTBB Developer Guide](https://oneapi-src.github.io/oneTBB/main/tbb_userguide/Guiding_Task_Scheduler_Execution.html)
as an example to split the work across several NUMA domains. The example utilizes the fork-join
synchronization pattern described in the mentioned oneDPL document to ensure that the work is complete
in all the arenas. It also illustrates that the problem stated in this proposal is relevant.
A better way of using these classes together, however, is the following:
```cpp
tbb::task_arena ta{/*args*/};
tbb::task_group tg;
ta.enqueue(tg.defer([]{ foo(); }));
bar();
ta.execute([&tg]{ tg.wait(); });
```
In this case, the task group "registers" a deferred task to run `foo()`, which is then enqueued
to the task arena. The task is added by the calling thread, so we can be sure that `tg.wait` will not
return until the task completes.

While combining `task_arena` with `task_group` is a viable solution, it would be good to avoid
the extra complexity and verbosity existing now.

### Waiting for all tasks in the arena

The idea to wait for completion of all tasks and parallel jobs executed in a task arena was suggested
and considered in the past for TBB. At that time, it was rejected. There was no good way to wait
for completion of all tasks in the arena: even if all the task queues are empty, for as long as any thread
still executes a task, new tasks might be produced. Essentially, waiting for arena to drain
would mean waiting for all threads to leave the arena. Also such an API was considered unsafe,
as it would result in a deadlock if accidentally called from a thread that holds a slot in the arena.

Still, supporting the "synchronize with the queue/device" pattern would in general be useful.

## Proposal

The proposal consists of two ideas that are not mutually exclusive and can be implemented independently.

### 1. Simplify the use of a task group

To address the existing shortcomings of the `task_arena` and `task_group` combination, we could add
a new overload of the `enqueue` method that takes task_group as an argument, and add a `task_arena::wait_for`
method that also takes a task group.

In the simplest possible implementation, it would be just header-based "syntax sugar" for the code
described above:
```cpp
ta.enqueue( []{ foo(); }, tg ); // => ta.enqueue(tg.defer([]{ foo(); }));
ta.wait_for( tg ); // => ta.execute([&tg]{ tg.wait(); });
```
If justified performance-wise, a more elaborated implementation could perhaps shave off some overheads.
It would likely require new library entry points, though.
The example code to split work across NUMA-bound task arenas could then look like this (assuming also
a special function that creates and initializes a vector of arenas):
```cpp
std::vector<tbb::task_arena> numa_arenas =
initialize_constrained_arenas(/*some arguments*/);
std::vector<tbb::task_group> task_groups(numa_arenas.size());
for(unsigned j = 0; j < numa_arenas.size(); j++) {
numa_arenas[j].enqueue( (){/*some parallel stuff*/}, task_groups[j] );
}
for(unsigned j = 0; j < numa_arenas.size(); j++) {
numa_arenas[j].wait_for( task_groups[j] );
}
```

Additionally, a task group parameter can also be added to `execute` as well as to `this_task_arena::isolate`,
which is similar to `execute` but provides additional work isolation guarantees. That would make
the experimental `isolated_task_group` class fully obsolete.

### 2. Reconsider waiting for all tasks

With the redesign of the task scheduler in oneTBB, we might evaluate if there is now a reliable way
to ensure completion of all jobs in an arena. The safety concern can potentially be mitigated
by throwing an exception or returning `false`, similarly to the `tbb::finalize` function.

Implementation-wise, a waiting context (reference counter) could be added to the internal arena class.
It would be incremented on each call to `execute`, `enqueue`, and `isolate`, and decremented after
the corresponding task is complete. The more tricky part is to track parallel jobs started in an
implicit thread-associated arenas; perhaps it can be done when registering and deregistering
the corresponding task group contexts. The implementation can likely be backward compatible,
as no layout or function changes in `task_arena` seems necessary. A new library entry point
would be added for the waiting function.

## Open Questions

- API details need to be elaborated
Loading