-
Notifications
You must be signed in to change notification settings - Fork 190
[pathfinder] RTLD_DI_LINKMAP
-based new implementation of abs_path_for_dynamic_library()
#834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…eed for EXPECTED_LIB_SYMBOLS
…m check_if_already_loaded_from_elsewhere(), for consistency with load_with_system_search()
…s just an oversight)
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
/ok to test |
|
Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
/ok to test |
LIBDL.dladdr.argtypes = [ctypes.c_void_p, ctypes.c_void_p] | ||
LIBDL.dladdr.restype = ctypes.c_int | ||
LIBDL.dlinfo.argtypes = [ctypes.c_void_p, ctypes.c_int, ctypes.c_void_p] | ||
LIBDL.dlinfo.restype = ctypes.c_int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a bit nervous about us using either dladdr
or dlinfo
since none of them is POSIX-compliant. Since dladdr
has been used, switching to dlinfo
does not make it more "non-compliant", so it's still a step forward. We just have to see if it runs everywhere...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since none of them is POSIX-compliant
Do you think the ChatGPT conclusion below is reasonable?
My ChatGPT prompt:
This makes me want to ask:
Does CUDA run on any systems that are not Linux or Windows?
(Windows dynamic loading is handled separately. I.e. basically I'm only asking: does CUDA run on any Unix system that is not Linux?)
Do you know?
ChatGPT said:
Thought for 19s
Short answer: No. Today CUDA is officially supported on Linux (various distros/arches, including aarch64/Jetson) and Windows (including WSL2 Linux environments on Windows). There’s no current macOS runtime support, and no official support for other Unix-like OSes (e.g., *BSD).
So if your scope is “Linux only,” it’s reasonable to lean on glibc-specific behavior (e.g., dlinfo)—CUDA isn’t targeting musl/Alpine or non-Linux Unix platforms.
# First appeared in 2004-era glibc. Universally correct on Linux for all practical purposes. | ||
RTLD_DI_LINKMAP = 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: Can't we just use RTLD_DI_ORIGIN
and get the path back without populating the full struct (since we just need the path)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reduced the struct to just what we need (commit be20782).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am unaware that ctypes allows for incomplete struct definitions. It's FFI based so I assume it needs to know the full struct layout/size for ABI-based operations. This is nerve wrecking...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think because we aren't ever allocating a struct we never need to know the full size of it and only need to know the needed member offsets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kkraus14 — yes, exactly.
That’s what I had in mind when I pared down the table.
The use of ctypes here is in line with what we do elsewhere (e.g., cuda.cccl
). The big difference is that we don’t own struct link_map
.
So last night (TH TZ) I started drilling down to be sure we’re on solid ground. Two questions I wanted to answer:
Q1: How stable has struct link_map
been over the years?
Q2: Is paring down the ctypes
definition for struct link_map
valid for our purposes?
Side remark: ChatGPT 5 is vastly more thorough than ChatGPT 4, at the cost of longer waits (I’ve seen >2 min of "thinking" for one question; with ChatGPT 4 most responses came back almost instantly).
Re Q1: I started with (AI-assisted) legwork, looking at the struct link_map
source code. This is what I found:
git clone https://sourceware.org/git/glibc.git
commit d66e34cd423425c348bcc83df127dd19711b0b9a
Date: Tue May 2 06:35:55 1995 +0000
git show d66e34cd423425c348bcc83df127dd19711b0b9a:elf/link.h
struct link_map
{
/* These first few members are part of the protocol with the debugger.
This is the same format used in SVR4. */
Elf32_Addr l_addr; /* Base address shared object is loaded at. */
char *l_name; /* Absolute file name object was found in. */
Elf32_Dyn *l_ld; /* Dynamic section of the shared object. */
struct link_map *l_next, *l_prev; /* Chain of loaded objects. */
/* All following members are internal to the dynamic linker.
They may change without notice. */
const char *l_libname; /* Name requested (before search). */
Elf32_Dyn *l_info[DT_NUM]; /* Indexed pointers to dynamic section. */
const Elf32_Phdr *l_phdr; /* Pointer to program header table in core. */
Elf32_Word l_phnum; /* Number of program header entries. */
/* Symbol hash table. */
Elf32_Word l_nbuckets;
const Elf32_Word *l_buckets, *l_chain;
unsigned int l_opencount; /* Reference count for dlopen/dlclose. */
enum /* Where this object came from. */
{
lt_executable, /* The main executable program. */
lt_interpreter, /* The interpreter: the dynamic linker. */
lt_library, /* Library needed by main executable. */
lt_loaded, /* Extra run-time loaded shared object. */
} l_type:2;
unsigned int l_deps_loaded:1; /* Nonzero if DT_NEEDED items loaded. */
unsigned int l_relocated:1; /* Nonzero if object's relocations done. */
unsigned int l_init_called:1; /* Nonzero if DT_INIT function called. */
unsigned int l_init_running:1; /* Nonzero while DT_INIT function runs. */
};
commit 2642002380aafb71a1d3b569b6d7ebeab3284816
Date: Wed Jan 1 10:14:45 2025 -0800
git show 2642002380aafb71a1d3b569b6d7ebeab3284816:elf/link.h
struct link_map
{
/* These first few members are part of the protocol with the debugger.
This is the same format used in SVR4. */
ElfW(Addr) l_addr; /* Difference between the address in the ELF
file and the addresses in memory. */
char *l_name; /* Absolute file name object was found in. */
ElfW(Dyn) *l_ld; /* Dynamic section of the shared object. */
struct link_map *l_next, *l_prev; /* Chain of loaded objects. */
};
To make this more concrete, on my Ubuntu 24.04 workstation I get:
cc -E /usr/include/link.h
typedef uint64_t Elf64_Addr;
struct link_map
{
Elf64_Addr l_addr;
char *l_name;
Elf64_Dyn *l_ld;
struct link_map *l_next, *l_prev;
};
Conclusion: The "first few members" of struct link_map
have been stable for ~30 years.
Re Q2: Because we want to use ctypes
(and not compile a small C helper), we need to rely on pointer arithmetic:
-
dlinfo()
gives is us astruct link_map
pointer. -
We add
offsetof(struct link_map, l_name)
and dereference the resulting pointer.
That's exactly what the previous ctyptes
-based lm_ptr.contents.l_name
did. However, ChatGPT suggested an alternative that makes the pointer arithmetic explicit. That is now implemented in
commit ee5d05e
With that, LinkMap
is gone completely.
While I was at it, I also ensured the .decode()
s in load_dl_linux.py
will not raise UnicodeDecodeError
(same commit, plus commit 97f1c36), and I added a test to ensure the abs_path
is actually absolute, and does in fact exist (commit 5b2012a). That should make it obvious if our assumptions about struct link_map
ever get violated.
|
||
Returns: | ||
The absolute path to the library file, or None if no expected symbol is found | ||
raise OSError(f"abs_path_for_dynamic_library failed for {libname=!r}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One potential improvement for error handling is we subsequently call dlerror
to get the err msg if -1 is returned by dladdr
/dlinfo
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, also commit be20782.
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…r arithmetic instead. Use `os.fsdecode()` instead of `l_name.decode()` to avoid `UnicodeDecodeError`
/ok to test |
# l_name is the second field, right after l_addr (both pointer-sized) | ||
l_name_field_addr = lm_ptr.value + ctypes.sizeof(ctypes.c_void_p) | ||
l_name_addr = ctypes.c_void_p.from_address(l_name_field_addr).value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally find this less intuitive than using the minimal struct definition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very ambivalent myself, but thought it'll be useful to have the alternative fully worked out here, for reference.
I'll change back to the more intuitive LinkMap
approach, because I believe it'll be easier to maintain in the future. I'll work in comments to explain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: commit 49a54c3
…ame. Explain safety constraints in depth.
/ok to test |
Closes #833
Also a step towards resolving #776 — please see comments there.
This PR eliminates
supported_nvidia_libs.EXPECTED_LIB_SYMBOLS
entirely, which is a major simplification in its own right.Piggy-backed: Minor fix (commit d13ad8e) and cleanup (commit e716f1c).