Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGSEGV when using DuckDB and RocksDB #13092

Open
loicmathieu opened this issue Oct 25, 2024 · 13 comments
Open

SIGSEGV when using DuckDB and RocksDB #13092

loicmathieu opened this issue Oct 25, 2024 · 13 comments

Comments

@loicmathieu
Copy link

Expected behavior

Using both RocksDB (via Kafka Streams) and the DuckDB JDBC driver works.

Actual behavior

When both RockDB (via Kafka Streams) and the DuckDB JDBC driver are in the classpath, as soon as we try to use the DuckDB driver the JVM will SIGSEV.

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f523603fd60, pid=37746, tid=39346
#
# JRE version: OpenJDK Runtime Environment Temurin-17.0.5+8 (17.0.5+8) (build 17.0.5+8)
# Java VM: OpenJDK 64-Bit Server VM Temurin-17.0.5+8 (17.0.5+8, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
# Problematic frame:
# C  0x00007f523603fd60
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" (or dumping to <redacted>)
#
# An error report file with more information is saved as:
# <redacted>
#
# If you would like to submit a bug report, please visit:
#   https://github.com/adoptium/adoptium-support/issues
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

The exact same code on the same JVM and the same OS will not SIGSEGV if the RocksDB native library is not loaded.

This has been raised at DuckDB side but they ask to raise it at RockDB side also, see duckdb/duckdb-java#14

Forcing to pre-load libstdc fixes the issue but as we deliver a product that can be installed by users on environments we didn't manage this cannot be an issue for us.

LD_PRELOAD="/lib/x86_64-linux-gnu/libstdc++.so.6"

Steps to reproduce the behavior

This is enough to trigger the crash:
https://github.com/Mause/duckdb_rocksdb_crash/blob/main/src/test/java/com/mycompany/app/AppTest.java

@rhubner
Copy link
Contributor

rhubner commented Oct 25, 2024

Hello @loicmathieu,

Thanks for sending us a bug. Unfortunately I'm afraid I won't be able to help you too much. From previous hs_error_pid you have on DuckDB issue, stack-trace clearly states it's in DuckDB native code. And it looks like some illegal memory access. What I will suggest, as first, run JVM with -ea -Xcheck:jni parameters. Especially -Xcheck:jni helped me once to figure out when I forgot to call NewGlobalRef for one of my variable.

Next step can be to compile DuckDB driver with address sanitizer.

If you think it's the problem in RocksDD, can you please provide more information. Environment where are you running your code, name and version of Linux distribution, version of RocksDB, hs_error_pid, ...

Radek

@loicmathieu
Copy link
Author

@rhubner DuckDB team ask to open an issue here as they are not sure the issue is inside DuckDB or RocksDB.

@shoffmeister seems to have debugged this more deeper, maybe you'll find this comment more explanatory:
duckdb/duckdb-java#14 (comment)

@shoffmeister
Copy link

I have lost most of knowledge about C++, I guess, but I wonder why the RocksDB dynamic shared object for JNI, bundled inside the distributed JAR, has lots of things in the std:: namespace exposed.

Specifically,

unzip -o rocksdbjni-9.6.1.jar librocksdbjni-linux64.so -d . &&  nm --demangle ./librocksdbjni-linux64.so | cut -c 18- | grep 'T std::' | sort

on x64 Linux yields a long list of items exported from that DSO. Because of the way ELF linking works, all these symbols leak into the global namespace (IIRC), and that then results in symbol resolution either into libstdc++ or into RocksDB, with mixed results.

Example:

❯ unzip -o rocksdbjni-9.6.1.jar librocksdbjni-linux64.so -d . &&  nm --demangle ./librocksdbjni-linux64.so | cut -c 18- | grep 'T std::' | sort
Archive:  rocksdbjni-9.6.1.jar
  inflating: ./librocksdbjni-linux64.so  
...
T std::bad_array_new_length::~bad_array_new_length()
...
T std::random_device::_M_fini()
T std::random_device::_M_getval()
T std::random_device::_M_getval_pretr1()
T std::random_device::_M_init_pretr1(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
T std::random_device::_M_init_pretr1(std::string const&)
T std::random_device::_M_init(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
T std::random_device::_M_init(std::string const&)
...

I do not see why std::random functions should be exported by RocksDB - and it is that which seems to be giving DuckDB quite a bit of a headache.

@shoffmeister
Copy link

shoffmeister commented Oct 25, 2024

For additional commentary - this is a JNI "thing", which I interpret to be a "plugin-like addon".

As such, I wonder why this DSO needs to export anything at all beyond the absolute minimum required to interface to the Java JNI wrapper (T Java_org_rocksdb_*, I'd guess, plus init/finit)?

By doing so, the complete process ELF namespace gets polluted with symbols from the plugin, with possible side-effects, for lack of isolation. This then is the ELF variant of the Microsoft Windows DLL hell (https://en.wikipedia.org/wiki/DLL_Hell)

@rhubner
Copy link
Contributor

rhubner commented Oct 25, 2024

Hello @shoffmeister,

thanks for the details, I was obviously looking on it in wrong ways. I will check what we can do. Recenly we have some requirements from Debian maintainers to reduce the amount of exported functions which will help them to track changes in ABI, but it didn't work with the rest of the tools from RocksDB.

Maybe in this case we should be able to limit exported symbols only on Java_org_rocksdb_ as all headers files generated with javah already contains macro JNIEXPORT. Let me check with my colleagues.

Radek

@rhubner
Copy link
Contributor

rhubner commented Oct 25, 2024

Hello @shoffmeister @loicmathieu,

Thanks for example, I was able to reproduce and you are right. When I compile RockDB with hidden symbols, it starts to work. You can check #12944 PR where I did some implementation of hiding private symbols before. Unfortunately I don't know if we will be able to merge this and release RockDBJava with hidden symbols. At the moment It still break some stuff.

Please can you try to build and let me know if it works in your environment?

cmake:

cmake -DCMAKE_BUILD_TYPE=Release -DJNI=ON -S . -B build -DWITH_GFLAGS=ON -DWITH_TESTS=OFF -DWITH_BENCHMARK_TOOLS=OFF -DWITH_TOOLS=OFF
cd build
make -j <number of CPU> rocksdbjava 

gcc:

HIDE_PRIVATE_SYMBOLS=1 make -j <nubmer of CPU> rocksdbjava

Radek

@shoffmeister
Copy link

Your branch yields

rocksdb/build/java on  main via ☕ v17.0.13 
❯ readelf --demangle --wide --syms ./librocksdbjni-linux64.so | grep -E '(random_device|^Symbol)'
Symbol table '.dynsym' contains 13323 entries:
   143: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND std::random_device::_M_fini()@GLIBCXX_3.4.18 (23)
   217: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND std::random_device::_M_getval()@GLIBCXX_3.4.18 (23)
   279: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND std::random_device::_M_init(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)@GLIBCXX_3.4.21 (13)
Symbol table '.symtab' contains 21952 entries:
 14678: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND _ZNSt13random_device7_M_finiEv@GLIBCXX_3.4.18
 17799: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND _ZNSt13random_device9_M_getvalEv@GLIBCXX_3.4.18
 20018: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND _ZNSt13random_device7_M_initERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE@GLIBCXX_3.4.21

This will be interesting to try (later), with UND now being flagged on _ZNSt13random_device7_M_initERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE@GLIBCXX_3.4.21

@shoffmeister
Copy link

@rhubner many thanks for supplying the build.

After running

cp /git/github.com/evolvedbinary/rocksdb/build/java/rocksdbjni-9.8.0-linux64.jar .

mvn install:install-file \
    -Dfile=rocksdbjni-9.8.0-linux64.jar \
    -DgroupId=org.rocksdb \
    -DartifactId=rocksdbjni \
    -Dversion=9.8.0-SNAPSHOT \
    -Dpackaging=jar

I could plug

    <dependency>
      <groupId>org.rocksdb</groupId>
      <artifactId>rocksdbjni</artifactId>
      <version>9.8.0-SNAPSHOT</version>
    </dependency>

into the pom.xml and rebuild a fat JAR successfully.

Alas, when running LD_DEBUG=symbols LD_DEBUG_OUTPUT=debug.log java -jar target/my-app-1.0-SNAPSHOT-jar-with-dependencies.jar

  • on my main "local" development system I get
    free(): double free detected in tcache 2
    Aborted (core dumped)
    
  • on the containerized reproducing system I get Exception in thread "main" java.lang.UnsatisfiedLinkError: /tmp/librocksdbjni15379998538951915121.so: librocksdb.so.9: cannot open shared object file: No such file or directory

For context, my "local" development system has a RocksDB package installed (https://archlinux.org/packages/extra/x86_64/rocksdb/); this would explain that I can launch the process there.

So, I see two real problems:

  • free(): double free detected in tcache 2 - I guess this is a RocksDB implementation challenge
  • librocksdb.so.9: cannot open shared object file is due to the build procedures (from above) yielding a binary JNI which does not seem to be suitable for usage, as it links to the DSO inside the build system with the RPATH:
    unzip rocksdbjni-9.8.0-linux64.jar
    
    ldd librocksdbjni-linux64.so
        linux-vdso.so.1 (0x000075818638f000)
        librocksdb.so.9 => /git/github.com/evolvedbinary/rocksdb/build/librocksdb.so.9 (0x0000758184a00000)
        libgflags.so.2.2 => /usr/lib/libgflags.so.2.2 (0x00007581857a0000)
        libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x0000758184600000)
        libm.so.6 => /usr/lib/libm.so.6 (0x0000758184911000)
        libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x0000758185772000)
        libc.so.6 => /usr/lib/libc.so.6 (0x000075818440f000)
        /usr/lib64/ld-linux-x86-64.so.2 (0x0000758186391000)
    
    readelf -d librocksdbjni-linux64.so 
    
    Dynamic section at offset 0xb26e58 contains 32 entries:
      Tag        Type                         Name/Value
     0x0000000000000001 (NEEDED)             Shared library: [librocksdb.so.9]
     0x0000000000000001 (NEEDED)             Shared library: [libgflags.so.2.2]
     0x0000000000000001 (NEEDED)             Shared library: [libstdc++.so.6]
     0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
     0x0000000000000001 (NEEDED)             Shared library: [libgcc_s.so.1]
     0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
     0x0000000000000001 (NEEDED)             Shared library: [ld-linux-x86-64.so.2]
     0x000000000000000e (SONAME)             Library soname: [librocksdbjni-linux64.so]
     0x000000000000001d (RUNPATH)            Library runpath: [/git/github.com/evolvedbinary/rocksdb/build]
    

This effectively makes it impossible for me to test the symbol visibility changes inside the reproducing container.

I guess updated build instructions / procedures might help to produce a usable JAR?

I'll be glad to rebuild and give that another try.

@rhubner
Copy link
Contributor

rhubner commented Oct 26, 2024

Hello @shoffmeister

librocksdb.so.9 => /git/github.com/evolvedbinary/rocksdb/build/librocksdb.so.9 (0x0000758184a00000)

That's very strange. I never experience this behavior. RocksDB should be statically compiled inside librocksdbjava. Can you please check your build system, or send me direct commands which you used for compilation?

free(): double free detected in tcache 2

I experience this error couple of times, wasn't able to find where it is exactly problem. It happens only under specific condition and it didn't affect out build and tests.

Radek

@shoffmeister
Copy link

Hi @rhubner

https://github.com/shoffmeister/duckdb_rocksdb_crash/blob/main/rebuild-rocksdb.bash is my build setup; in the repo at large I am trying to persist my knowledge.

Locally I am using a directory structure

  • github.com
    • shoffmeister/
    • evolvedbinary/

If you mirror that, all of that should be working fine for you.

My "local" development system is Arch Linux.

free(): double free detected in tcache 2

That kicks the Java process with a core dump on my side, tough.

@shoffmeister
Copy link

@rhubner I have managed to get a reproducible build up and running, with changes to https://github.com/shoffmeister/duckdb_rocksdb_crash/blob/main/rebuild-rocksdb.bash

That reproducible build (now) creates a JNI which is statically linked - so success on that front.

The script https://github.com/shoffmeister/duckdb_rocksdb_crash/blob/main/reproduce.bash can be used to test the self-built JAR with the self-built JNI shared object inside.

This should print

Wombat connected
mkdir: cannot create directory ‘/root’: Permission denied
Can not write to /root/.m2/copy_reference_file.log. Wrong volume permissions? Carrying on ...
Wombat connected

so, effectively, print "Wombat connected" twice (and some docker-related noise in between).

I have full success with a self-built binary.

Looking at the self-built JNI itself,

❯ readelf --dyn-syms --demangle --wide ${EXTRACT_DIR}/${JNI_SO} | grep random_device

   143: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND std::random_device::_M_fini()@GLIBCXX_3.4.18 (23)
   217: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND std::random_device::_M_getval()@GLIBCXX_3.4.18 (23)
   282: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND std::random_device::_M_init(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)@GLIBCXX_3.4.21 (13)

also shows the success of your initiative in #12944

@rhubner
Copy link
Contributor

rhubner commented Nov 4, 2024

Hello @shoffmeister,
I'm glad that it works for you. I did some testing and it looks like hiding symbols doesn't resolve the problem on our Centos 6 build container. We compile on Centos 6 because of compatibility with old Glibc. This looks like more thought problem that I thought.

What is even more strange, when I compile on Ubuntu 24.04, it even works without hiding symbols. I suspect linker on Centos 6. But it's only suspicion, need to do more research.

Radek

@rhubner
Copy link
Contributor

rhubner commented Nov 5, 2024

Hello @loicmathieu @shoffmeister,

I did little bit more testing and it doesn't look good.
I compile librocksbdjava with hidden symbols under our Centos 6 container and it produce library with this symbols in :

$ readelf --dyn-syms --demangle --wide java/target/librocksdbjni-linux64.so | grep random_device
  1060: 00000000008311f0   123 FUNC    GLOBAL DEFAULT   12 std::random_device::_M_getval()
  1274: 00000000008370f0   176 FUNC    GLOBAL DEFAULT   12 std::random_device::_M_init(std::string const&)
  1589: 0000000000831120   176 FUNC    GLOBAL DEFAULT   12 std::random_device::_M_init(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
  1985: 00000000008311d0    18 FUNC    GLOBAL DEFAULT   12 std::random_device::_M_fini()
  2065: 00000000008314c0    14 FUNC    GLOBAL DEFAULT   12 std::random_device::_M_getval_pretr1()
  2399: 00000000008312c0   120 FUNC    GLOBAL DEFAULT   12 std::random_device::_M_init_pretr1(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
  2988: 00000000008371a0   158 FUNC    GLOBAL DEFAULT   12 std::random_device::_M_init_pretr1(std::string const&)

$ readelf --dyn-syms --demangle --wide java/target/librocksdbjni-linux64.so | wc -l
3898

Original librocksdbjni from Maven central :

$ readelf --dyn-syms --demangle --wide librocksdbjni-linux64.so | grep random_device
  1620: 0000000000a8c8e0   158 FUNC    GLOBAL DEFAULT   12 std::random_device::_M_init_pretr1(std::string const&)
  5259: 0000000000a86930   123 FUNC    GLOBAL DEFAULT   12 std::random_device::_M_getval()
  6234: 0000000000a86c00    14 FUNC    GLOBAL DEFAULT   12 std::random_device::_M_getval_pretr1()
 10764: 0000000000a86a00   120 FUNC    GLOBAL DEFAULT   12 std::random_device::_M_init_pretr1(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
 11892: 0000000000a8c830   176 FUNC    GLOBAL DEFAULT   12 std::random_device::_M_init(std::string const&)
 15068: 0000000000a86910    18 FUNC    GLOBAL DEFAULT   12 std::random_device::_M_fini()
 16612: 0000000000a86860   176 FUNC    GLOBAL DEFAULT   12 std::random_device::_M_init(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)


$ readelf --dyn-syms --demangle --wide librocksdbjni-linux64.so | wc -l
18201

As you can see, it removed some symbols from output library, but it didn't remove the std::random_device. I think this must be something strange in our build system.

Builtools :

gcc (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5)
GNU ld version 2.28-11.el6
libstdc++.so.6.0.13

Maybe C++ experts @pdillinger or @ajkr can help.

Radek

cc: @adamretter @alanpaxton

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants