Skip to content

GPU perf: batched sCMOS kernels, unified retry loop, NVML scheduling#14

Merged
kalidke merged 7 commits intomainfrom
feature/scmos-gpu-perf
Feb 8, 2026
Merged

GPU perf: batched sCMOS kernels, unified retry loop, NVML scheduling#14
kalidke merged 7 commits intomainfrom
feature/scmos-gpu-perf

Conversation

@kalidke
Copy link
Copy Markdown
Member

@kalidke kalidke commented Feb 8, 2026

Summary

  • Batched sCMOS GPU kernels via KernelAbstractions with sparse coordinate extraction and in-place DoG — sCMOS GPU throughput 19 → 2,557 images/s (256×256)
  • Unified GPU retry loop handling NVML polling, TOCTOU context races, and runtime OOM in a single path with jittered backoff
  • BoxerConfig struct with PSF-aware and direct-control parameter interfaces, config-based calling convention
  • Memory-aware batching on both CPU and GPU paths with recommend_batch_size() utility
  • (ROIBatch, BoxesInfo) return tuple with timing, backend, batch, and memory metadata
  • SMLMData compat bumped to 0.7 (AbstractSMLMConfig/AbstractSMLMInfo inheritance)

Breaking Changes

  • getboxes now returns (ROIBatch, BoxesInfo) tuple instead of bare ROIBatch
  • SMLMData 0.7 required (was 0.5)
  • auto_timeout default changed from 30s to 300s

Performance (descent, RTX A6000)

Image Size Ideal GPU sCMOS GPU (before) sCMOS GPU (after)
128×128 4,564 2,481 4,090
256×256 3,752 19 2,557
512×512 1,553 19 927

Test plan

  • Pkg.test() passes (80 tests + benchmarks + GPU wait tests on descent)
  • Documenter.jl docs build clean
  • CI passes on GitHub Actions

🤖 Generated with Claude Code

kalidke and others added 7 commits February 7, 2026 15:16
Matches GaussMLE convention. GPU jobs should wait longer before
falling back to CPU on shared servers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Match GaussMLE README pattern: Quick Start with input/detect/output,
Detection Modes table, Parameter Interfaces, Output Format tables,
real benchmark numbers from descent (RTX A6000).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three GPU optimizations:
- Single 3D kernel launch for variance-weighted sCMOS (eliminates per-frame overhead)
- GPU sparse coordinate extraction via findall (transfers ~1MB vs ~1GB)
- In-place DoG subtraction (saves one full-size GPU allocation, 10x→8x memory)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@kalidke kalidke merged commit b57b2fc into main Feb 8, 2026
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant