Skip to content

[pathfinder] RTLD_DI_LINKMAP-based new implementation of abs_path_for_dynamic_library() #834

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

rwgk
Copy link
Collaborator

@rwgk rwgk commented Aug 13, 2025

Closes #833

Also a step towards resolving #776 — please see comments there.

This PR eliminates supported_nvidia_libs.EXPECTED_LIB_SYMBOLS entirely, which is a major simplification in its own right.

Piggy-backed: Minor fix (commit d13ad8e) and cleanup (commit e716f1c).

rwgk added 3 commits August 12, 2025 22:43
…m check_if_already_loaded_from_elsewhere(), for consistency with load_with_system_search()
@rwgk rwgk self-assigned this Aug 13, 2025
@rwgk rwgk added the cuda.pathfinder Everything related to the cuda.pathfinder module label Aug 13, 2025
@github-project-automation github-project-automation bot moved this to Todo in CCCL Aug 13, 2025
Copy link
Contributor

copy-pr-bot bot commented Aug 13, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@rwgk
Copy link
Collaborator Author

rwgk commented Aug 13, 2025

/ok to test

Copy link

@rwgk rwgk marked this pull request as ready for review August 13, 2025 08:47
Copy link
Contributor

copy-pr-bot bot commented Aug 13, 2025

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@rwgk
Copy link
Collaborator Author

rwgk commented Aug 13, 2025

/ok to test

@rwgk rwgk requested a review from leofang August 13, 2025 08:48
Comment on lines 17 to 18
LIBDL.dladdr.argtypes = [ctypes.c_void_p, ctypes.c_void_p]
LIBDL.dladdr.restype = ctypes.c_int
LIBDL.dlinfo.argtypes = [ctypes.c_void_p, ctypes.c_int, ctypes.c_void_p]
LIBDL.dlinfo.restype = ctypes.c_int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit nervous about us using either dladdr or dlinfo since none of them is POSIX-compliant. Since dladdr has been used, switching to dlinfo does not make it more "non-compliant", so it's still a step forward. We just have to see if it runs everywhere...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since none of them is POSIX-compliant

Do you think the ChatGPT conclusion below is reasonable?

My ChatGPT prompt:


This makes me want to ask:

Does CUDA run on any systems that are not Linux or Windows?

(Windows dynamic loading is handled separately. I.e. basically I'm only asking: does CUDA run on any Unix system that is not Linux?)

Do you know?


ChatGPT said:

Thought for 19s

Short answer: No. Today CUDA is officially supported on Linux (various distros/arches, including aarch64/Jetson) and Windows (including WSL2 Linux environments on Windows). There’s no current macOS runtime support, and no official support for other Unix-like OSes (e.g., *BSD).

So if your scope is “Linux only,” it’s reasonable to lean on glibc-specific behavior (e.g., dlinfo)—CUDA isn’t targeting musl/Alpine or non-Linux Unix platforms.


Comment on lines +21 to +22
# First appeared in 2004-era glibc. Universally correct on Linux for all practical purposes.
RTLD_DI_LINKMAP = 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Can't we just use RTLD_DI_ORIGIN and get the path back without populating the full struct (since we just need the path)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reduced the struct to just what we need (commit be20782).

Copy link
Member

@leofang leofang Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unaware that ctypes allows for incomplete struct definitions. It's FFI based so I assume it needs to know the full struct layout/size for ABI-based operations. This is nerve wrecking...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think because we aren't ever allocating a struct we never need to know the full size of it and only need to know the needed member offsets.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kkraus14 — yes, exactly.

That’s what I had in mind when I pared down the table.

The use of ctypes here is in line with what we do elsewhere (e.g., cuda.cccl). The big difference is that we don’t own struct link_map.

So last night (TH TZ) I started drilling down to be sure we’re on solid ground. Two questions I wanted to answer:

Q1: How stable has struct link_map been over the years?

Q2: Is paring down the ctypes definition for struct link_map valid for our purposes?

Side remark: ChatGPT 5 is vastly more thorough than ChatGPT 4, at the cost of longer waits (I’ve seen >2 min of "thinking" for one question; with ChatGPT 4 most responses came back almost instantly).

Re Q1: I started with (AI-assisted) legwork, looking at the struct link_map source code. This is what I found:

git clone https://sourceware.org/git/glibc.git
commit d66e34cd423425c348bcc83df127dd19711b0b9a
Date:   Tue May 2 06:35:55 1995 +0000
git show d66e34cd423425c348bcc83df127dd19711b0b9a:elf/link.h

struct link_map
  {
    /* These first few members are part of the protocol with the debugger.
       This is the same format used in SVR4.  */

    Elf32_Addr l_addr;		/* Base address shared object is loaded at.  */
    char *l_name;		/* Absolute file name object was found in.  */
    Elf32_Dyn *l_ld;		/* Dynamic section of the shared object.  */
    struct link_map *l_next, *l_prev; /* Chain of loaded objects.  */

    /* All following members are internal to the dynamic linker.
       They may change without notice.  */

    const char *l_libname;	/* Name requested (before search).  */
    Elf32_Dyn *l_info[DT_NUM];	/* Indexed pointers to dynamic section.  */
    const Elf32_Phdr *l_phdr;	/* Pointer to program header table in core.  */
    Elf32_Word l_phnum;		/* Number of program header entries.  */

    /* Symbol hash table.  */
    Elf32_Word l_nbuckets;
    const Elf32_Word *l_buckets, *l_chain;

    unsigned int l_opencount;	/* Reference count for dlopen/dlclose.  */
    enum			/* Where this object came from.  */
      {
	lt_executable,		/* The main executable program.  */
	lt_interpreter,		/* The interpreter: the dynamic linker.  */
	lt_library,		/* Library needed by main executable.  */
	lt_loaded,		/* Extra run-time loaded shared object.  */
      } l_type:2;
    unsigned int l_deps_loaded:1; /* Nonzero if DT_NEEDED items loaded.  */
    unsigned int l_relocated:1;	/* Nonzero if object's relocations done.  */
    unsigned int l_init_called:1; /* Nonzero if DT_INIT function called.  */
    unsigned int l_init_running:1; /* Nonzero while DT_INIT function runs.  */
  };

commit 2642002380aafb71a1d3b569b6d7ebeab3284816
Date:   Wed Jan 1 10:14:45 2025 -0800
git show 2642002380aafb71a1d3b569b6d7ebeab3284816:elf/link.h
struct link_map
  {
    /* These first few members are part of the protocol with the debugger.
       This is the same format used in SVR4.  */

    ElfW(Addr) l_addr;		/* Difference between the address in the ELF
				   file and the addresses in memory.  */
    char *l_name;		/* Absolute file name object was found in.  */
    ElfW(Dyn) *l_ld;		/* Dynamic section of the shared object.  */
    struct link_map *l_next, *l_prev; /* Chain of loaded objects.  */
  };

To make this more concrete, on my Ubuntu 24.04 workstation I get:

cc -E /usr/include/link.h
typedef uint64_t Elf64_Addr;

struct link_map
  {
    Elf64_Addr l_addr;
    char *l_name;
    Elf64_Dyn *l_ld;
    struct link_map *l_next, *l_prev;
  };

Conclusion: The "first few members" of struct link_map have been stable for ~30 years.


Re Q2: Because we want to use ctypes (and not compile a small C helper), we need to rely on pointer arithmetic:

  • dlinfo() gives is us a struct link_map pointer.

  • We add offsetof(struct link_map, l_name) and dereference the resulting pointer.

That's exactly what the previous ctyptes-based lm_ptr.contents.l_name did. However, ChatGPT suggested an alternative that makes the pointer arithmetic explicit. That is now implemented in

commit ee5d05e

With that, LinkMap is gone completely.

While I was at it, I also ensured the .decode()s in load_dl_linux.py will not raise UnicodeDecodeError (same commit, plus commit 97f1c36), and I added a test to ensure the abs_path is actually absolute, and does in fact exist (commit 5b2012a). That should make it obvious if our assumptions about struct link_map ever get violated.


Returns:
The absolute path to the library file, or None if no expected symbol is found
raise OSError(f"abs_path_for_dynamic_library failed for {libname=!r}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One potential improvement for error handling is we subsequently call dlerror to get the err msg if -1 is returned by dladdr/dlinfo.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, also commit be20782.

@rwgk
Copy link
Collaborator Author

rwgk commented Aug 14, 2025

/ok to test

kkraus14
kkraus14 previously approved these changes Aug 14, 2025
Copy link
Collaborator

@kkraus14 kkraus14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-project-automation github-project-automation bot moved this from Todo to In Review in CCCL Aug 14, 2025
@rwgk
Copy link
Collaborator Author

rwgk commented Aug 15, 2025

/ok to test

kkraus14
kkraus14 previously approved these changes Aug 15, 2025
Comment on lines 56 to 58
# l_name is the second field, right after l_addr (both pointer-sized)
l_name_field_addr = lm_ptr.value + ctypes.sizeof(ctypes.c_void_p)
l_name_addr = ctypes.c_void_p.from_address(l_name_field_addr).value
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally find this less intuitive than using the minimal struct definition

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very ambivalent myself, but thought it'll be useful to have the alternative fully worked out here, for reference.

I'll change back to the more intuitive LinkMap approach, because I believe it'll be easier to maintain in the future. I'll work in comments to explain.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: commit 49a54c3

@rwgk
Copy link
Collaborator Author

rwgk commented Aug 15, 2025

/ok to test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda.pathfinder Everything related to the cuda.pathfinder module
Projects
Status: In Review
Development

Successfully merging this pull request may close these issues.

[BUG]: KeyError in load_dl_linux.py:47 EXPECTED_LIB_SYMBOLS[libname]
3 participants