Skip to content

SIGSEGV in RtIndex_c::GetIndexFiles when querying table.@files under write load #4578

@kino505

Description

@kino505

Bug Description:

Summary
searchd crashes with SIGSEGV (signal 11) when an internal SELECT file, size FROM

.@files query is executed against an actively-written real-time table. The query appears to be issued by Manticore Buddy (it is not generated by our application). Crash happens reliably under sustained write load and has been observed multiple times on different nodes of the cluster within a single 24-hour test window.
Environment

Manticore version: 25.13.1 e605e80c5@26051415 dev (columnar 13.2.6 0b37507@26042817, secondary 13.2.6, knn 13.2.6, embeddings 1.1.1)
Buddy version: v3.46.2+26051413-42296708-dev
Built: Linux x86_64 (jammy), cross-compiled with Clang 16.0.6
Cluster: 3-node Galera-replicated cluster, all nodes Primary/Synced
Platform: Kubernetes (EKS)
Workload: sustained ~270 RPS SELECT + ~9 RPS write (REPLACE / UPDATE / DELETE) per node on multiple real-time tables, largest is product_variant (~5M docs)

What happened
Across multiple nodes in a 24-hour window, searchd segfaulted while serving a query to the @files virtual table. The crash dump consistently shows the same SQL and the same stack trace.
Crashed query
sqlSELECT file, size FROM product_variant.@files;SHOW META
This query is not generated by our application — we replay only the application's query.log and there are no @files references in it. The query is presumably issued internally by Manticore Buddy (e.g. for the backup or metrics plugin).
Backtrace
------- FATAL: CRASH DUMP -------
--- crashed SphinxQL request dump ---
SELECT file, size FROM product_variant.@files;SHOW META
--- request dump end ---
--- local index:product_variant.@files
Manticore 25.13.1 e605e80c5@26051415 dev ...
Handling signal 11

Trying system backtrace:
searchd(_Z12sphBacktraceib+0x227)[0x560a13ff4167]
searchd(_Z11HandleCrashi+0x2e3)[0x560a13cde8c3]
/lib/x86_64-linux-gnu/libc.so.6(+0x45330)[0x7f81ee436330]
searchd(+0x129987d)[0x560a13ef187d]
searchd(_ZNK13CSphIndex_VLN13GetIndexFilesERN3sph8Vector_TI10CSphStringNS0_13DefaultCopy_TIS2_EENS0_14DefaultRelimitENS0_16DefaultStorage_TIS2_EEEES9_PK17FilenameBuilder_i+0x87)[0x560a13ef10a7]
searchd(_ZNK9RtIndex_c13GetIndexFilesERN3sph8Vector_TI10CSphStringNS0_13DefaultCopy_TIS2_EENS0_14DefaultRelimitENS0_16DefaultStorage_TIS2_EEEES9_PK17FilenameBuilder_i+0x1ba)[0x560a14108c7a]
searchd(Z17HandleSelectFilesR11RowBuffer_iRK10CSphStringS3+0x116)[0x560a13e69166]
Demangled, the relevant frames are:

CSphIndex_VLN::GetIndexFiles(...)
RtIndex_c::GetIndexFiles(...)
HandleSelectFiles(RowBuffer_i&, CSphString const&, CSphString const&)

Frequency / reproducibility
Within a single 24-hour test window we observed:

Worker-1: 2 crashes from this exact @files query
Worker-2: 4 crashes from this exact @files query
Worker-0: 0 crashes from @files (but crashed once from a different bug, filed as #4577)

Pattern is reproducible — wait long enough under sustained write load and one of the nodes crashes from this query.
What was running at the time
The crashes happen while real-time disk chunks are being actively flushed. Lines like the following appear in the log very close to each crash:
[time] [tid] rt: table product_variant: diskchunk N(M), segments K saved in
0.932771 (0.938771) sec [attrs=836ms deadmap=0ms docs=94ms ...]
RAM saved/new 45507875/0 ratio 0.950000 (soft limit 127506841, conf limit 134217728)

[time] [tid] rt: table product_variant: merged chunks .../product_variant.169,
.../product_variant.184 to .../product_variant.185 in 5s ...
Suspicion: GetIndexFiles() enumerates .spa, .spi, .sps, .spb, etc. files from the RT index, but when disk chunks are being merged / renamed concurrently, the file list it captures references files that have already been deleted / superseded — leading to a null-deref or use-after-free.
Reproducer
Sustained write load against a real-time table for several hours, while the Buddy plugin that polls @files is enabled. We don't yet have a minimal reproducer because we don't know which Buddy plugin issues the query — knowing that would let us either trigger it on demand or temporarily disable it as a workaround.
Workaround for users hitting this
None known — the query is internal, applications don't trigger it directly. If Manticore team can confirm which Buddy plugin issues SELECT FROM table.@files periodically, users could disable that plugin via plugins config while waiting for a fix.
Questions for the team

Which Buddy plugin issues SELECT file, size FROM

.@files — backup, metrics, or something else?
Is there a way to disable that single plugin (we see plugins list: core: empty-string, backup, emulate-elastic, fuzzy, create-table, create-cluster, drop, insert, alias, select, replace-select, show, plugin, test, alter-column, alter-distributed-table, alter-rename-table, modify-table, knn, replace, queue, sharding, update, autocomplete, cli-table, distributed-insert, truncate, metrics, conversational-search)?
Is the @files virtual table expected to be safe under concurrent writes, or is this a known limitation?

Additional context
We have a shadow-replay rig that reproduces this reliably and full searchd log captures from each crash event. Happy to share privately if helpful.
Related crash (separate issue): SIGSEGV in Threads::Coro::RWLock_c::Unlock during UPDATE on attribute columns — filed as #4577.

Manticore Search Version:

dev-25.13.1

Operating System Version:

Ubuntu 22.04

Have you tried the latest development version?

Yes

Internal Checklist:

To be completed by the assignee. Check off tasks that have been completed or are not applicable.

Details
  • Implementation completed
  • Tests developed
  • Documentation updated
  • Documentation reviewed

Metadata

Metadata

Assignees

Labels

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions