uxlfoundation · akukanov · Jan 27, 2025 · Jan 29, 2025
diff --git a/rfcs/proposed/task_arena_waiting/readme.md b/rfcs/proposed/task_arena_waiting/readme.md
@@ -0,0 +1,135 @@
+# Wait for work completion in a task_arena
+
+## Motivation
+
+Task arenas in oneTBB are the places for threads to share and execute tasks.
+A `task_arena` instance represents a configurable execution context for parallel work.
+
+There are two primary ways to submit work to an arena: the `execute` and `enqueue` functions.
+Both take a callable object and run it in the context of the arena. The callable object
+might start more parallel work in the arena by invoking a oneTBB algorithm, running a flow graph,
+or submitting work into a task group.
+`execute` is a blocking call: the calling thread does not return until the callable object
+completes. `enqueue` is a “fire-and-forget” call: the calling thread submits the callable
+object as a task and returns immediately, providing no way to synchronize with the completion
+of the task.
+
+In [the oneDPL case study of asynchronous algorithms](https://github.com/uxlfoundation/oneDPL/tree/main/rfcs/archived/asynchronous_api_general#use-case-study)
+three main usage patterns were observed, each assuming subsequent synchronization either with
+the asynchronous job or with all jobs previously submitted into a work queue or device.
+A task arena can be considered analogous to a device or a work queue in these patterns,
+but it lacks synchronization capabilities and so cannot alone support these use cases -
+for that it has to be paired with something "waitable".
+
+### Current solution: combining with a task group
+
+In oneTBB, asynchronous execution is supported by `task_group` and the flow graph API; both allow
+submitting a job and waiting for its completion later. The flow graph is more suitable for building
+and executing graphs of dependent computations, while the `task_group`, as of now, allows starting
+and waiting for independent tasks. Notably, both require calling `wait`/`wait_for_all` to ensure that
+the work will be done. The `task_arena::enqueue`, on the other hand, being "fire-and-forget", enforces
+eventual availability of another thread in the arena to execute the task.
+
+So, a reasonable solution for the described use cases seems to combine a `task_arena` with a `task_group`.
+However, it is notoriously non-trivial to do right. For example, the following "naive" attempt is subtly
+incorrect:
+```cpp
+tbb::task_arena ta{/*args*/};
+tbb::task_group tg;
+ta.enqueue([&tg]{ tg.run([]{ foo(); }); });
+bar();
+ta.execute([&tg]{ tg.wait(); });
+```
+The problem is that `enqueue` submits a task that calls `tg.run` to add `[]{ foo(); }` to the task group,
+but it is unknown if that task was actually executed prior to `tg.wait`. Simply put,
+the task group might yet be empty, in which case `tg.wait` exits prematurely.
+
+To avoid that, `execute` can be used instead of `enqueue`, but then the mentioned
+thread availability guarantee is lost. The approach with `execute` is shown in the
+[oneTBB Developer Guide](https://oneapi-src.github.io/oneTBB/main/tbb_userguide/Guiding_Task_Scheduler_Execution.html)
+as an example to split the work across several NUMA domains. The example utilizes the fork-join
+synchronization pattern described in the mentioned oneDPL document to ensure that the work is complete
+in all the arenas. It also illustrates that the problem stated in this proposal is relevant.
+
+A better way of using these classes together, however, is the following:
+```cpp
+tbb::task_arena ta{/*args*/};
+tbb::task_group tg;
+ta.enqueue(tg.defer([]{ foo(); }));
+bar();
+ta.execute([&tg]{ tg.wait(); });
+```
+In this case, the task group "registers" a deferred task to run `foo()`, which is then enqueued
+to the task arena. The task is added by the calling thread, so we can be sure that `tg.wait` will not
+return until the task completes.
+
+While combining `task_arena` with `task_group` is a viable solution, it would be good to avoid
+the extra complexity and verbosity existing now.
+
+### Waiting for all tasks in the arena
+
+The idea to wait for completion of all tasks and parallel jobs executed in a task arena was suggested
+and considered in the past for TBB. At that time, it was rejected. There was no good way to wait
+for completion of all tasks in the arena: even if all the task queues are empty, for as long as any thread
+still executes a task, new tasks might be produced. Essentially, waiting for arena to drain
+would mean waiting for all threads to leave the arena. Also such an API was considered unsafe,
+as it would result in a deadlock if accidentally called from a thread that holds a slot in the arena.
+
+Still, supporting the "synchronize with the queue/device" pattern would in general be useful.
+
+## Proposal
+
+The proposal consists of two ideas that are not mutually exclusive and can be implemented independently.
+
+### 1. Simplify the use of a task group
+
+To address the existing shortcomings of the `task_arena` and `task_group` combination, we could add
+a new overload of the `enqueue` method that takes task_group as an argument, and add a `task_arena::wait_for`
+method that also takes a task group.
+
+In the simplest possible implementation, it would be just header-based "syntax sugar" for the code
+described above:
+```cpp
+ta.enqueue( []{ foo(); }, tg ); // => ta.enqueue(tg.defer([]{ foo(); }));
+ta.wait_for( tg );              // => ta.execute([&tg]{ tg.wait(); });
+```
+If justified performance-wise, a more elaborated implementation could perhaps shave off some overheads.
+It would likely require new library entry points, though.
+
+The example code to split work across NUMA-bound task arenas could then look like this (assuming also
+a special function that creates and initializes a vector of arenas):
+```cpp
+std::vector<tbb::task_arena> numa_arenas =
+    initialize_constrained_arenas(/*some arguments*/);
+std::vector<tbb::task_group> task_groups(numa_arenas.size());
+
+for(unsigned j = 0; j < numa_arenas.size(); j++) {
+    numa_arenas[j].enqueue( (){/*some parallel stuff*/}, task_groups[j] );
+}
+
+for(unsigned j = 0; j < numa_arenas.size(); j++) {
+    numa_arenas[j].wait_for( task_groups[j] );
+}
+```
+
+Additionally, a task group parameter can also be added to `execute` as well as to `this_task_arena::isolate`,
+which is similar to `execute` but provides additional work isolation guarantees. That would make
+the experimental `isolated_task_group` class fully obsolete.
+
+### 2. Reconsider waiting for all tasks
+
+With the redesign of the task scheduler in oneTBB, we might evaluate if there is now a reliable way
+to ensure completion of all jobs in an arena. The safety concern can potentially be mitigated
+by throwing an exception or returning `false`, similarly to the `tbb::finalize` function.
+
+Implementation-wise, a waiting context (reference counter) could be added to the internal arena class.
+It would be incremented on each call to `execute`, `enqueue`, and `isolate`, and decremented after
+the corresponding task is complete. The more tricky part is to track parallel jobs started in an
+implicit thread-associated arenas; perhaps it can be done when registering and deregistering
+the corresponding task group contexts. The implementation can likely be backward compatible,
+as no layout or function changes in `task_arena` seems necessary. A new library entry point
+would be added for the waiting function.
+
+## Open Questions
+
+- API details need to be elaborated