diff --git a/rfcs/proposed/task_arena_waiting/readme.md b/rfcs/proposed/task_arena_waiting/readme.md new file mode 100644 index 0000000000..c546e153f9 --- /dev/null +++ b/rfcs/proposed/task_arena_waiting/readme.md @@ -0,0 +1,135 @@ +# Wait for work completion in a task_arena + +## Motivation + +Task arenas in oneTBB are the places for threads to share and execute tasks. +A `task_arena` instance represents a configurable execution context for parallel work. + +There are two primary ways to submit work to an arena: the `execute` and `enqueue` functions. +Both take a callable object and run it in the context of the arena. The callable object +might start more parallel work in the arena by invoking a oneTBB algorithm, running a flow graph, +or submitting work into a task group. +`execute` is a blocking call: the calling thread does not return until the callable object +completes. `enqueue` is a “fire-and-forget” call: the calling thread submits the callable +object as a task and returns immediately, providing no way to synchronize with the completion +of the task. + +In [the oneDPL case study of asynchronous algorithms](https://github.com/uxlfoundation/oneDPL/tree/main/rfcs/archived/asynchronous_api_general#use-case-study) +three main usage patterns were observed, each assuming subsequent synchronization either with +the asynchronous job or with all jobs previously submitted into a work queue or device. +A task arena can be considered analogous to a device or a work queue in these patterns, +but it lacks synchronization capabilities and so cannot alone support these use cases - +for that it has to be paired with something "waitable". + +### Current solution: combining with a task group + +In oneTBB, asynchronous execution is supported by `task_group` and the flow graph API; both allow +submitting a job and waiting for its completion later. The flow graph is more suitable for building +and executing graphs of dependent computations, while the `task_group`, as of now, allows starting +and waiting for independent tasks. Notably, both require calling `wait`/`wait_for_all` to ensure that +the work will be done. The `task_arena::enqueue`, on the other hand, being "fire-and-forget", enforces +eventual availability of another thread in the arena to execute the task. + +So, a reasonable solution for the described use cases seems to combine a `task_arena` with a `task_group`. +However, it is notoriously non-trivial to do right. For example, the following "naive" attempt is subtly +incorrect: +```cpp +tbb::task_arena ta{/*args*/}; +tbb::task_group tg; +ta.enqueue([&tg]{ tg.run([]{ foo(); }); }); +bar(); +ta.execute([&tg]{ tg.wait(); }); +``` +The problem is that `enqueue` submits a task that calls `tg.run` to add `[]{ foo(); }` to the task group, +but it is unknown if that task was actually executed prior to `tg.wait`. Simply put, +the task group might yet be empty, in which case `tg.wait` exits prematurely. + +To avoid that, `execute` can be used instead of `enqueue`, but then the mentioned +thread availability guarantee is lost. The approach with `execute` is shown in the +[oneTBB Developer Guide](https://oneapi-src.github.io/oneTBB/main/tbb_userguide/Guiding_Task_Scheduler_Execution.html) +as an example to split the work across several NUMA domains. The example utilizes the fork-join +synchronization pattern described in the mentioned oneDPL document to ensure that the work is complete +in all the arenas. It also illustrates that the problem stated in this proposal is relevant. + +A better way of using these classes together, however, is the following: +```cpp +tbb::task_arena ta{/*args*/}; +tbb::task_group tg; +ta.enqueue(tg.defer([]{ foo(); })); +bar(); +ta.execute([&tg]{ tg.wait(); }); +``` +In this case, the task group "registers" a deferred task to run `foo()`, which is then enqueued +to the task arena. The task is added by the calling thread, so we can be sure that `tg.wait` will not +return until the task completes. + +While combining `task_arena` with `task_group` is a viable solution, it would be good to avoid +the extra complexity and verbosity existing now. + +### Waiting for all tasks in the arena + +The idea to wait for completion of all tasks and parallel jobs executed in a task arena was suggested +and considered in the past for TBB. At that time, it was rejected. There was no good way to wait +for completion of all tasks in the arena: even if all the task queues are empty, for as long as any thread +still executes a task, new tasks might be produced. Essentially, waiting for arena to drain +would mean waiting for all threads to leave the arena. Also such an API was considered unsafe, +as it would result in a deadlock if accidentally called from a thread that holds a slot in the arena. + +Still, supporting the "synchronize with the queue/device" pattern would in general be useful. + +## Proposal + +The proposal consists of two ideas that are not mutually exclusive and can be implemented independently. + +### 1. Simplify the use of a task group + +To address the existing shortcomings of the `task_arena` and `task_group` combination, we could add +a new overload of the `enqueue` method that takes task_group as an argument, and add a `task_arena::wait_for` +method that also takes a task group. + +In the simplest possible implementation, it would be just header-based "syntax sugar" for the code +described above: +```cpp +ta.enqueue( []{ foo(); }, tg ); // => ta.enqueue(tg.defer([]{ foo(); })); +ta.wait_for( tg ); // => ta.execute([&tg]{ tg.wait(); }); +``` +If justified performance-wise, a more elaborated implementation could perhaps shave off some overheads. +It would likely require new library entry points, though. + +The example code to split work across NUMA-bound task arenas could then look like this (assuming also +a special function that creates and initializes a vector of arenas): +```cpp +std::vector numa_arenas = + initialize_constrained_arenas(/*some arguments*/); +std::vector task_groups(numa_arenas.size()); + +for(unsigned j = 0; j < numa_arenas.size(); j++) { + numa_arenas[j].enqueue( (){/*some parallel stuff*/}, task_groups[j] ); +} + +for(unsigned j = 0; j < numa_arenas.size(); j++) { + numa_arenas[j].wait_for( task_groups[j] ); +} +``` + +Additionally, a task group parameter can also be added to `execute` as well as to `this_task_arena::isolate`, +which is similar to `execute` but provides additional work isolation guarantees. That would make +the experimental `isolated_task_group` class fully obsolete. + +### 2. Reconsider waiting for all tasks + +With the redesign of the task scheduler in oneTBB, we might evaluate if there is now a reliable way +to ensure completion of all jobs in an arena. The safety concern can potentially be mitigated +by throwing an exception or returning `false`, similarly to the `tbb::finalize` function. + +Implementation-wise, a waiting context (reference counter) could be added to the internal arena class. +It would be incremented on each call to `execute`, `enqueue`, and `isolate`, and decremented after +the corresponding task is complete. The more tricky part is to track parallel jobs started in an +implicit thread-associated arenas; perhaps it can be done when registering and deregistering +the corresponding task group contexts. The implementation can likely be backward compatible, +as no layout or function changes in `task_arena` seems necessary. A new library entry point +would be added for the waiting function. + +## Open Questions + +- API details need to be elaborated