GPU perf: batched sCMOS kernels, unified retry loop, NVML scheduling by kalidke · Pull Request #14 · JuliaSMLM/SMLMBoxer.jl

kalidke · 2026-02-08T22:29:05Z

Summary

Batched sCMOS GPU kernels via KernelAbstractions with sparse coordinate extraction and in-place DoG — sCMOS GPU throughput 19 → 2,557 images/s (256×256)
Unified GPU retry loop handling NVML polling, TOCTOU context races, and runtime OOM in a single path with jittered backoff
BoxerConfig struct with PSF-aware and direct-control parameter interfaces, config-based calling convention
Memory-aware batching on both CPU and GPU paths with recommend_batch_size() utility
(ROIBatch, BoxesInfo) return tuple with timing, backend, batch, and memory metadata
SMLMData compat bumped to 0.7 (AbstractSMLMConfig/AbstractSMLMInfo inheritance)

Breaking Changes

getboxes now returns (ROIBatch, BoxesInfo) tuple instead of bare ROIBatch
SMLMData 0.7 required (was 0.5)
auto_timeout default changed from 30s to 300s

Performance (descent, RTX A6000)

Image Size	Ideal GPU	sCMOS GPU (before)	sCMOS GPU (after)
128×128	4,564	2,481	4,090
256×256	3,752	19	2,557
512×512	1,553	19	927

Test plan

Pkg.test() passes (80 tests + benchmarks + GPU wait tests on descent)
Documenter.jl docs build clean
CI passes on GitHub Actions

🤖 Generated with Claude Code

Matches GaussMLE convention. GPU jobs should wait longer before falling back to CPU on shared servers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Match GaussMLE README pattern: Quick Start with input/detect/output, Detection Modes table, Parameter Interfaces, Output Format tables, real benchmark numbers from descent (RTX A6000). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Three GPU optimizations: - Single 3D kernel launch for variance-weighted sCMOS (eliminates per-frame overhead) - GPU sparse coordinate extraction via findall (transfers ~1MB vs ~1GB) - In-place DoG subtraction (saves one full-size GPU allocation, 10x→8x memory) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

kalidke and others added 7 commits February 7, 2026 15:16

Change auto_timeout default from 30s to 300s (5 min)

fec90bc

Matches GaussMLE convention. GPU jobs should wait longer before falling back to CPU on shared servers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Rewrite README in annotated-constructor style

7bf13ab

Match GaussMLE README pattern: Quick Start with input/detect/output, Detection Modes table, Parameter Interfaces, Output Format tables, real benchmark numbers from descent (RTX A6000). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add CLAUDE.md with architecture and development guidance

b5f5bc9

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Update README performance numbers from benchmark run

4d3ffce

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix getboxes docstring attachment, simplify api.md to @autodocs

1a685b0

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Bump julia compat to 1.10 (cuDNN 1.4.6 requires it)

3987824

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

kalidke merged commit b57b2fc into main Feb 8, 2026
3 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU perf: batched sCMOS kernels, unified retry loop, NVML scheduling#14

GPU perf: batched sCMOS kernels, unified retry loop, NVML scheduling#14
kalidke merged 7 commits intomainfrom
feature/scmos-gpu-perf

kalidke commented Feb 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kalidke commented Feb 8, 2026

Summary

Breaking Changes

Performance (descent, RTX A6000)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant