Skip to content

Fix l0 teardown and sycl windows teardown #17869

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 27 commits into
base: sycl
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
cba2b47
fix for windows shutdown, minus some reviewer feedback. See sycl/doc…
cperkinsintel Feb 21, 2025
9130945
resolve cherry-pick
cperkinsintel Feb 21, 2025
97a3ab5
Merge branch 'sycl' into cperkins-windows-shutdown-fix-plus-latest-SYCL
cperkinsintel Mar 3, 2025
fc42629
fix for windows. Should probabaly be removed from Linux too, more tes…
cperkinsintel Mar 3, 2025
11795f3
Merge branch 'sycl' into cperkins-windows-shutdown-fix
cperkinsintel Mar 5, 2025
3541e75
Merge branch 'sycl' into cperkins-windows-shutdown-fix
cperkinsintel Mar 27, 2025
6da51cd
[UR][L0] Fix L0 teardown checks for stability
nrspruit Apr 2, 2025
cad11ab
[UR][L0] Verify the Loader is stable before cleanup of all handles
nrspruit Apr 2, 2025
aec3fd5
Fixed formatting
nrspruit Apr 2, 2025
10d2b78
Remove uneeded print
nrspruit Apr 2, 2025
c33beab
Use updated L0 static loader
nrspruit Apr 3, 2025
6880bf5
Fix formatting issues
nrspruit Apr 3, 2025
39d914e
Ensure bindless image is removed even if destroy fails
nrspruit Apr 3, 2025
0ab2423
Use the offical commit of the Loader
nrspruit Apr 3, 2025
c414b96
Remove interop handle tracking support
nrspruit Apr 3, 2025
78cf7a9
Fix formatting
nrspruit Apr 3, 2025
ddae783
Remove remaining use of isInteropHandle
nrspruit Apr 3, 2025
7af1f67
Merge remote-tracking branch 'chris/cperkins-windows-shutdown-fix' in…
nrspruit Apr 4, 2025
40d6a35
test fix and expand usage of checkL0LoaderTeardown to event status re…
cperkinsintel Apr 9, 2025
a4eb867
resolve merge conflicts by accepting new incoming changes, but keepin…
cperkinsintel Apr 9, 2025
dc644e5
preserving Neil's UR commit to resolve merge conflict. Hopefully that…
cperkinsintel Apr 9, 2025
55c980c
clang-format always has the last say
cperkinsintel Apr 9, 2025
c78aaa6
urQueueRelease has too many unqualified calls to ZE_CALL, both direct…
cperkinsintel Apr 10, 2025
9607f19
bumping L0 tag as instructed by Neil
cperkinsintel Apr 10, 2025
2784062
fix for last error (hopefully)
cperkinsintel Apr 14, 2025
29610fd
resolve merge conflict
cperkinsintel Apr 15, 2025
ad92047
simplify
cperkinsintel Apr 15, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 66 additions & 7 deletions sycl/doc/design/GlobalObjectsInRuntime.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,9 +55,65 @@ Deinitialization is platform-specific. Upon application shutdown, the DPC++
runtime frees memory pointed by `GlobalHandler` global pointer, which triggers
destruction of nested `std::unique_ptr`s.

### Shutdown Tasks and Challenges

As the user's app ends, SYCL's primary goal is to release any UR adapters that
have been gotten, and teardown the plugins/adapters themselves. Additionally,
we need to stop deferring any new buffer releases and clean up any memory
whose release was deferred.

To this end, the shutdown occurs in two phases: early and late. The purpose
for early shutdown is primarily to stop any further deferring of memory release.
This is because the deferred memory release is based on threads and on Windows
the threads will be abandoned. So as soon as possible we want to stop deferring
memory and try to let go any that has been deferred. The purpose for late
shutdown is to hold onto the handles and adapters longer than the user's
application. We don't want to initiate late shutdown until after all the users
static and thread local vars have been destroyed, in case those destructors are
calling SYCL.

In the early shutdown we stop deferring, tell the scheduler to prepare for release, and
try releasing the memory that has been deferred so far. Following this, if
the user has any global or static handles to sycl objects, they'll be destroyed.
Finally, the late shutdown routine is called the last of the UR handles and
adapters are let go, as is the GlobalHandler itself.


#### Threads
The deferred memory marshalling is built on a thread pool, but there is a
challenge here in that on Windows, once the end of the users main() is reached
and their app is shutting down, the Windows OS will abandon all remaining
in-flight threads. These threads can be .join() but they simply return instantly,
the threads are not completed. Further any thread specific variables
(or thread_local static vars) will NOT have their destructors called. Note
that the standard while-loop-over-condition-var pattern will cause a hang -
we cannot "wait" on abandoned threads.
On Windows, short of adding some user called API to signal this, there is
no way to detect or avoid this. None of the "end-of-library" lifecycle events
occurs before the threads are abandoned. ( not std::atexit(), not globals or
static, or static thread_local var destruction, not DllMain(DLL_PROCESS_DETACH) )
This means that on Windows, once we arrive at shutdown_early we cannot wait on
host events or the thread pool.

For the deferred memory itself, there is no issue here. The Windows OS will
reclaim the memory for us. The issue of which we must be wary is placing UR
handles (and similar) in host threads. The RAII mechanism of unique and
shared pointers will not work in any thread that is abandoned on Windows.

One last note about threads. It is entirely the OS's discretion when to
start or schedule a thread. If the main process is very busy then it is
possible that threads the SYCL library creates (host_tasks/thread_pool)
won't even be started until AFTER the host application main() function is done.
This is not a normal occurrence, but it can happen if there is no call to queue.wait()


### Linux

On Linux DPC++ runtime uses `__attribute__((destructor))` property with low
On Linux, the "early_shutdown()" is begun by the destruction of a static
StaticVarShutdownHandler object, which is initialized by
platform::get_platforms().

late_shutdown() timing uses `__attribute__((destructor))` property with low
priority value 110. This approach does not guarantee, that `GlobalHandler`
destructor is the last thing to run, as user code may contain a similar function
with the same priority value. At the same time, users may specify priorities
Expand All @@ -72,10 +128,14 @@ times, the memory leak may impact code performance.

### Windows

To identify shutdown moment on Windows, DPC++ runtime uses default `DllMain`
function with `DLL_PROCESS_DETACH` reason. This guarantees, that global objects
deinitialization happens right before `sycl.dll` is unloaded from process
address space.
Differing from Linux, on Windows the "early_shutdown()" is begun by
DllMain(PROCESS_DETACH), unless statically linked.

The "late_shutdown()" is begun by the destruction of a
static StaticVarShutdownHandler object, which is initialized by
platform::get_platforms(). ( On linux, this is when we do "early_shutdown()".
Go figure.) This is as late as we can manage, but it is later than any user
application global, static, or thread_local variable destruction.

### Recommendations for DPC++ runtime developers

Expand Down Expand Up @@ -109,8 +169,7 @@ for (adapter in initializedAdapters) {
urLoaderTearDown();
```

Which in turn is called by either `shutdown_late()` or `shutdown_win()`
depending on platform.
Which in turn is called by `shutdown_late()`.

![](images/adapter-lifetime.jpg)

Expand Down
100 changes: 64 additions & 36 deletions sycl/source/detail/global_handler.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -37,9 +37,11 @@ using LockGuard = std::lock_guard<SpinLock>;
SpinLock GlobalHandler::MSyclGlobalHandlerProtector{};

// forward decl
void shutdown_win(); // TODO: win variant will go away soon
void shutdown_early();
void shutdown_late();
#ifdef _WIN32
BOOL isLinkedStatically();
#endif

// Utility class to track references on object.
// Used for GlobalHandler now and created as thread_local object on the first
Expand All @@ -60,10 +62,6 @@ class ObjectUsageCounter {

LockGuard Guard(GlobalHandler::MSyclGlobalHandlerProtector);
MCounter--;
GlobalHandler *RTGlobalObjHandler = GlobalHandler::getInstancePtr();
if (RTGlobalObjHandler) {
RTGlobalObjHandler->prepareSchedulerToRelease(!MCounter);
}
} catch (std::exception &e) {
__SYCL_REPORT_EXCEPTION_TO_STREAM("exception in ~ObjectUsageCounter", e);
}
Expand Down Expand Up @@ -254,24 +252,37 @@ void GlobalHandler::releaseDefaultContexts() {
MPlatformToDefaultContextCache.Inst.reset(nullptr);
}

struct EarlyShutdownHandler {
~EarlyShutdownHandler() {
// Shutdown is split into two parts. shutdown_early() stops any more
// objects from being deferred and takes an initial pass at freeing them.
// shutdown_late() finishes and releases the adapters and the GlobalHandler.
// For Windows, early shutdown is typically called from DllMain,
// and late shutdown is here.
// For Linux, early shutdown is here, and late shutdown is called from
// a low priority destructor.
struct StaticVarShutdownHandler {

~StaticVarShutdownHandler() {
try {
#ifdef _WIN32
// on Windows we keep to the existing shutdown procedure
GlobalHandler::instance().releaseDefaultContexts();
// If statically linked, DllMain will not be called. So we do its work
// here.
if (isLinkedStatically()) {
shutdown_early();
}

shutdown_late();
#else
shutdown_early();
#endif
} catch (std::exception &e) {
__SYCL_REPORT_EXCEPTION_TO_STREAM("exception in ~EarlyShutdownHandler",
e);
__SYCL_REPORT_EXCEPTION_TO_STREAM(
"exception in ~StaticVarShutdownHandler", e);
}
}
};

void GlobalHandler::registerEarlyShutdownHandler() {
static EarlyShutdownHandler handler{};
void GlobalHandler::registerStaticVarShutdownHandler() {
static StaticVarShutdownHandler handler{};
}

bool GlobalHandler::isOkToDefer() const { return OkToDefer; }
Expand Down Expand Up @@ -304,35 +315,29 @@ void GlobalHandler::prepareSchedulerToRelease(bool Blocking) {
#ifndef _WIN32
if (Blocking)
drainThreadPool();
#endif
if (MScheduler.Inst)
MScheduler.Inst->releaseResources(Blocking ? BlockingT::BLOCKING
: BlockingT::NON_BLOCKING);
#endif
}

void GlobalHandler::drainThreadPool() {
if (MHostTaskThreadPool.Inst)
MHostTaskThreadPool.Inst->drain();
}

#ifdef _WIN32
// because of something not-yet-understood on Windows
// threads may be shutdown once the end of main() is reached
// making an orderly shutdown difficult. Fortunately, Windows
// itself is very aggressive about reclaiming memory. Thus,
// we focus solely on unloading the adapters, so as to not
// accidentally retain device handles. etc
void shutdown_win() {
GlobalHandler *&Handler = GlobalHandler::getInstancePtr();
Handler->unloadAdapters();
}
#else
void shutdown_early() {
const LockGuard Lock{GlobalHandler::MSyclGlobalHandlerProtector};
GlobalHandler *&Handler = GlobalHandler::getInstancePtr();
if (!Handler)
return;

#if defined(XPTI_ENABLE_INSTRUMENTATION) && defined(_WIN32)
if (xptiTraceEnabled())
return; // When doing xpti tracing, we can't safely shutdown on Win.
// TODO: figure out why XPTI prevents release.
#endif

// Now that we are shutting down, we will no longer defer MemObj releases.
Handler->endDeferredRelease();

Expand All @@ -354,6 +359,12 @@ void shutdown_late() {
if (!Handler)
return;

#if defined(XPTI_ENABLE_INSTRUMENTATION) && defined(_WIN32)
if (xptiTraceEnabled())
return; // When doing xpti tracing, we can't safely shutdown on Win.
// TODO: figure out why XPTI prevents release.
#endif

// First, release resources, that may access adapters.
Handler->MPlatformCache.Inst.reset(nullptr);
Handler->MScheduler.Inst.reset(nullptr);
Expand All @@ -370,7 +381,6 @@ void shutdown_late() {
delete Handler;
Handler = nullptr;
}
#endif

#ifdef _WIN32
extern "C" __SYCL_EXPORT BOOL WINAPI DllMain(HINSTANCE hinstDLL,
Expand All @@ -391,23 +401,18 @@ extern "C" __SYCL_EXPORT BOOL WINAPI DllMain(HINSTANCE hinstDLL,
if (PrintUrTrace)
std::cout << "---> DLL_PROCESS_DETACH syclx.dll\n" << std::endl;

#ifdef XPTI_ENABLE_INSTRUMENTATION
if (xptiTraceEnabled())
return TRUE; // When doing xpti tracing, we can't safely call shutdown.
// TODO: figure out what XPTI is doing that prevents
// release.
#endif

try {
shutdown_win();
shutdown_early();
} catch (std::exception &e) {
__SYCL_REPORT_EXCEPTION_TO_STREAM("exception in shutdown_win", e);
__SYCL_REPORT_EXCEPTION_TO_STREAM("exception in DLL_PROCESS_DETACH", e);
return FALSE;
}

break;
case DLL_PROCESS_ATTACH:
if (PrintUrTrace)
std::cout << "---> DLL_PROCESS_ATTACH syclx.dll\n" << std::endl;

break;
case DLL_THREAD_ATTACH:
break;
Expand All @@ -416,6 +421,29 @@ extern "C" __SYCL_EXPORT BOOL WINAPI DllMain(HINSTANCE hinstDLL,
}
return TRUE; // Successful DLL_PROCESS_ATTACH.
}
BOOL isLinkedStatically() {
// If the exePath is the same as the dllPath,
// or if the module handle for DllMain is not retrievable,
// then we are linked statically
// Otherwise we are dynamically linked or loaded.
HMODULE hModule = nullptr;
auto LpModuleAddr = reinterpret_cast<LPCSTR>(&DllMain);
if (!GetModuleHandleExA(GET_MODULE_HANDLE_EX_FLAG_FROM_ADDRESS, LpModuleAddr,
&hModule)) {
return true; // not retrievable, therefore statically linked
} else {
char dllPath[MAX_PATH];
if (GetModuleFileNameA(hModule, dllPath, MAX_PATH)) {
char exePath[MAX_PATH];
if (GetModuleFileNameA(NULL, exePath, MAX_PATH)) {
if (std::string(dllPath) == std::string(exePath)) {
return true; // paths identical, therefore statically linked
}
}
}
}
return false; // Otherwise dynamically linked or loaded
}
#else
// Setting low priority on destructor ensures it runs after all other global
// destructors. Priorities 0-100 are reserved by the compiler. The priority
Expand Down
3 changes: 1 addition & 2 deletions sycl/source/detail/global_handler.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ class GlobalHandler {
XPTIRegistry &getXPTIRegistry();
ThreadPool &getHostTaskThreadPool();

static void registerEarlyShutdownHandler();
static void registerStaticVarShutdownHandler();

bool isOkToDefer() const;
void endDeferredRelease();
Expand All @@ -95,7 +95,6 @@ class GlobalHandler {

bool OkToDefer = true;

friend void shutdown_win();
friend void shutdown_early();
friend void shutdown_late();
friend class ObjectUsageCounter;
Expand Down
9 changes: 9 additions & 0 deletions sycl/source/detail/host_task.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
#pragma once

#include <detail/cg.hpp>
#include <detail/global_handler.hpp>
#include <sycl/detail/cg_types.hpp>
#include <sycl/handler.hpp>
#include <sycl/interop_handle.hpp>
Expand All @@ -33,6 +34,10 @@ class HostTask {
bool isInteropTask() const { return !!MInteropTask; }

void call(HostProfilingInfo *HPI) {
if (!GlobalHandler::instance().isOkToDefer()) {
return;
}

if (HPI)
HPI->start();
MHostTask();
Expand All @@ -41,6 +46,10 @@ class HostTask {
}

void call(HostProfilingInfo *HPI, interop_handle handle) {
if (!GlobalHandler::instance().isOkToDefer()) {
return;
}

if (HPI)
HPI->start();
MInteropTask(handle);
Expand Down
2 changes: 1 addition & 1 deletion sycl/source/detail/platform_impl.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -166,7 +166,7 @@ std::vector<platform> platform_impl::get_platforms() {

// This initializes a function-local variable whose destructor is invoked as
// the SYCL shared library is first being unloaded.
GlobalHandler::registerEarlyShutdownHandler();
GlobalHandler::registerStaticVarShutdownHandler();

return Platforms;
}
Expand Down
2 changes: 2 additions & 0 deletions sycl/source/detail/program_manager/program_manager.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -3846,7 +3846,9 @@ extern "C" void __sycl_register_lib(sycl_device_binaries desc) {
// Executed as a part of current module's (.exe, .dll) static initialization
extern "C" void __sycl_unregister_lib(sycl_device_binaries desc) {
// Partial cleanup is not necessary at shutdown
#ifndef _WIN32
if (!sycl::detail::GlobalHandler::instance().isOkToDefer())
return;
sycl::detail::ProgramManager::getInstance().removeImages(desc);
#endif
}
12 changes: 11 additions & 1 deletion sycl/source/detail/queue_impl.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -264,7 +264,17 @@ class queue_impl {
destructorNotification();
#endif
throw_asynchronous();
getAdapter()->call<UrApiKind::urQueueRelease>(MQueue);
auto status =
getAdapter()->call_nocheck<UrApiKind::urQueueRelease>(MQueue);
// If loader is already closed, it'll return a not-initialized status
// which the UR should convert to SUCCESS code. But that isn't always
// working on Windows. This is a temporary workaround until that is fixed.
// TODO: Remove this workaround when UR is fixed, and restore
// ->call<>() instead of ->call_nocheck<>() above.
if (status != UR_RESULT_SUCCESS &&
status != UR_RESULT_ERROR_UNINITIALIZED) {
__SYCL_CHECK_UR_CODE_NO_EXC(status);
}
} catch (std::exception &e) {
__SYCL_REPORT_EXCEPTION_TO_STREAM("exception in ~queue_impl", e);
}
Expand Down
Loading