Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
152e08a
Implement tuple-pattern: getboxes returns (ROIBatch, BoxesInfo)
kalidke Feb 2, 2026
465890a
Add memory-aware batching to CPU path
kalidke Feb 2, 2026
8009959
Add batch tracking fields to BoxesInfo for ecosystem consistency
kalidke Feb 2, 2026
b3ba4e4
Change elapsed_ns to elapsed_s (Float64 seconds) in BoxesInfo
kalidke Feb 2, 2026
e88d2cb
Refactor: extract _process_with_batching helper to deduplicate GPU/CP…
kalidke Feb 2, 2026
b27f28c
Add BoxerConfig struct for config-based calling convention
kalidke Feb 4, 2026
24d4ff8
Update docs for BoxerConfig and two calling conventions
kalidke Feb 4, 2026
47b3346
Add AbstractSMLMConfig/AbstractSMLMInfo inheritance for BoxerConfig/B…
kalidke Feb 6, 2026
251ff2b
Fix GPU contention crash: use NVML for memory queries instead of cont…
kalidke Feb 6, 2026
6cdce29
Add GPU-to-CPU fallback for :auto mode during processing
kalidke Feb 6, 2026
dd11583
NVML-based GPU contention handling with two-layer fallback
kalidke Feb 6, 2026
8121ba6
Align kwargs with BoxerConfig: add on_wait field, drop use_gpu
kalidke Feb 6, 2026
435286f
Update SMLMData compat to 0.7 (registered version, no path dev)
kalidke Feb 7, 2026
05f32b4
Release GPU memory in Layer 2 catch block before CPU fallback
kalidke Feb 7, 2026
99ab4d1
Layer 2: retry GPU with remaining timeout instead of immediate CPU fa…
kalidke Feb 7, 2026
d86e49a
Consolidate GPU retry into single unified loop
kalidke Feb 7, 2026
5a2ed78
Reclaim GPU memory pool after successful processing
kalidke Feb 7, 2026
f373ea6
Update docs for BoxerConfig and two calling conventions
kalidke Feb 7, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions Project.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
name = "SMLMBoxer"
uuid = "a7d3b33b-b840-4f8e-b556-99be03134090"
version = "0.2.0"
version = "0.3.0"
authors = ["klidke@unm.edu"]

[deps]
Expand All @@ -15,7 +15,7 @@ cuDNN = "02a925ec-e4fe-4b08-9a7e-0d78e3d38ccd"
CUDA = "4, 5"
KernelAbstractions = "0.9"
NNlib = "0.9"
SMLMData = "0.6"
SMLMData = "0.7"
cuDNN = "1.4.6"
julia = "1.9"

Expand Down
83 changes: 75 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,18 +24,62 @@ Pkg.add(url="https://github.com/JuliaSMLM/SMLMBoxer.jl")
## Usage
The main function provided by the package is `getboxes()`, which detects particles or blobs in a multidimensional image stack and returns an `ROIBatch` containing detected regions centered around local maxima. The function uses a Difference of Gaussians (DoG) filter optimized for blob detection and is capable of GPU acceleration.

### Example (Recommended - PSF-Aware Detection)
### Two Calling Conventions

**Config-based (recommended for reusable settings):**
```julia
using SMLMBoxer, SMLMData

camera = IdealCamera(1:256, 1:256, 0.1f0)
config = BoxerConfig(psf_sigma=0.13, min_photons=500.0, boxsize=11)

(roi_batch, info) = getboxes(imagestack, camera, config)
```

**Kwargs-based (convenient for one-off calls):**
```julia
(roi_batch, info) = getboxes(imagestack, camera;
psf_sigma = 0.13,
min_photons = 500.0,
boxsize = 11)
```

Both conventions are equivalent - kwargs are forwarded to a BoxerConfig internally.

### BoxerConfig

Configuration struct for ROI detection. Create with `@kwdef` defaults:

```julia
config = BoxerConfig(
# PSF-aware interface (recommended)
psf_sigma = 0.13, # PSF sigma in microns (requires camera)
min_photons = 500.0, # Minimum photons for detection

# Box parameters
boxsize = 11, # ROI size in pixels
overlap = 2.0, # Max overlap between detections

# Backend
backend = :auto # :cpu, :gpu, or :auto
)
```

### Example (PSF-Aware Detection)
```julia
using SMLMBoxer, SMLMData

# Setup camera
camera = IdealCamera(1:256, 1:256, 0.1f0) # 256×256 pixels, 100nm pixel size

# Detect with PSF-aware parameters (physical units)
roi_batch = getboxes(imagestack, camera;
(roi_batch, info) = getboxes(imagestack, camera;
psf_sigma = 0.13, # PSF sigma in microns (physical units)
min_photons = 500.0, # Minimum photon count to detect
boxsize = 11) # ROI size in pixels

# info contains: backend, elapsed_s, device_id, n_rois, batch_size, n_batches, memory_per_batch
println("Processed in ", info.elapsed_s * 1000, " ms on ", info.backend)
```

### Primary Parameters (PSF-Aware Interface)
Expand Down Expand Up @@ -67,24 +111,47 @@ For expert users who want direct control over the DoG filter:
- `on_wait::Function`: Optional callback for wait progress reporting.

### Returns
`ROIBatch` object with the following fields:
Tuple of `(ROIBatch, BoxesInfo)`:

**ROIBatch** with the following fields:
- `data`: ROI stack (boxsize × boxsize × n_rois) containing detected image patches.
- `x_corners`: Vector of x (column) corner positions in camera coordinates.
- `y_corners`: Vector of y (row) corner positions in camera coordinates.
- `frame_indices`: Vector of frame indices for each ROI.
- `camera`: Camera object for coordinate system tracking.
- `roi_size`: Size of each ROI.

**BoxesInfo** with processing metadata:
- `backend`: Compute backend used (`:gpu` or `:cpu`)
- `elapsed_s`: Wall time in seconds
- `device_id`: GPU device ID (0-based), or -1 for CPU
- `n_rois`: Number of ROIs detected
- `batch_size`: Frames per batch during processing
- `n_batches`: Number of batches processed
- `memory_per_batch`: Estimated memory per batch in bytes

### How It Works
The `getboxes()` function applies a Difference of Gaussians (DoG) filter to identify blob-like features. When using the PSF-aware interface, the filter scales are automatically matched to your PSF width for optimal detection sensitivity, and the photon threshold is converted to the appropriate intensity threshold accounting for PSF spreading and filter response.

## Additional Tools
### GPU Scheduling

In addition to the `getboxes()` function, *SMLMBoxer.jl* provides a number of lower-level tools that can be useful in processing and analyzing image stacks. These are not exported.
SMLMBoxer uses a unified GPU retry loop that handles multi-process contention on shared GPU servers:

- `SMLMBoxer.genlocalmaximage(imagestack, kernelsize; minval=0.0, use_gpu=false)`: Generates an image where local maxima in the original image are the only non-zero pixels.
1. **NVML polling** scans all GPUs for one with sufficient free memory and low contention (no CUDA context created)
2. **GPU processing** is attempted: context creation, DoG filtering, local max detection
3. **On any failure** (no memory, context race, runtime OOM): GPU memory is released via `GC.gc()` + `CUDA.reclaim()`, then re-polls NVML with remaining timeout
4. **On timeout**: `:auto` falls back to CPU, `:gpu` errors

- `SMLMBoxer.findlocalmax(imagestack, kernelsize; minval=0.0, use_gpu=false)`: Returns the coordinates of local maxima in an image.
Memory is always reclaimed after GPU processing (both success and failure) so finished jobs don't block other processes waiting for GPU resources.

- `SMLMBoxer.convolve(imagestack, kernel; use_gpu=false)`: This function convolves an image stack with a given kernel.
```julia
# Wait up to 30s for GPU, fall back to CPU
(roi_batch, info) = getboxes(imagestack, camera;
psf_sigma=0.13, backend=:auto, auto_timeout=30.0)

# Monitor wait progress
(roi_batch, info) = getboxes(imagestack, camera;
psf_sigma=0.13, backend=:auto,
on_wait=(elapsed, avail, req) -> @info "Waiting..." elapsed avail req)
```

149 changes: 125 additions & 24 deletions api_overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,10 @@ Particle/blob detection in SMLM image stacks using difference-of-Gaussians filte

## Exports

**Total exports:** 4
**Total exports:** 6
- `getboxes` - Main detection function
- `BoxerConfig` - Configuration struct for detection parameters
- `BoxesInfo` - Metadata struct returned alongside ROIBatch
- `recommend_batch_size` - Memory-aware batch sizing utility
- `ROIBatch` - Re-exported from SMLMData.jl
- `SingleROI` - Re-exported from SMLMData.jl
Expand All @@ -32,33 +34,108 @@ When `SCMOSCamera` is provided, implements SMITE-style inverse variance weightin
- High-noise pixels: low weight (reduced false positives)
- GPU-accelerated via KernelAbstractions.jl (device-agnostic kernels)

### GPU Acceleration
- Standard DoG: NNlib with cuDNN backend (10-100× speedup)
### GPU Acceleration and Scheduling
- Standard DoG: NNlib with cuDNN backend (10-100x speedup)
- Variance-weighted: KernelAbstractions custom kernels (same code for CPU/GPU)
- Multi-GPU support: Automatically selects GPU with most free memory
- Memory waiting: Waits for GPU memory availability instead of crashing when busy
- Backend selection: `:cpu`, `:gpu`, or `:auto` with configurable timeouts
- Multi-GPU support: NVML-based polling selects GPU with most free memory across all devices

**Unified GPU retry loop** handles multi-process contention:
1. Poll all GPUs via NVML (no CUDA context creation) for sufficient free memory and low utilization
2. Acquire GPU context and run processing
3. On any failure (no memory, TOCTOU context race, runtime OOM): release memory via `GC.gc() + CUDA.reclaim()`, re-poll NVML with remaining timeout
4. On timeout: `:auto` falls back to CPU, `:gpu` errors
5. On success: reclaim GPU memory pool so finished jobs don't block other processes

**Backend modes:**
- `:cpu` - Always CPU, no GPU involvement
- `:gpu` - Require GPU, retry until `gpu_timeout` (default: Inf), error if unavailable
- `:auto` - Try GPU, retry until `auto_timeout` (default: 30s), fall back to CPU

**Wait progress callback:**
```julia
config = BoxerConfig(
psf_sigma=0.13,
backend=:auto,
on_wait=(elapsed, available, required) -> @info "Waiting for GPU" elapsed available required
)
```

## Configuration

### `BoxerConfig`

Configuration struct for ROI detection parameters. Supports `@kwdef` construction with defaults.

```julia
@kwdef struct BoxerConfig
# PSF-aware interface (recommended)
psf_sigma::Union{Float64,Nothing} = nothing # PSF sigma in microns
min_photons::Float64 = 500.0 # Minimum photons for detection

# Advanced interface (direct control)
sigma_small::Float64 = 1.0 # Small Gaussian sigma in pixels
sigma_large::Float64 = 2.0 # Large Gaussian sigma in pixels
minval::Float64 = 0.0 # DoG intensity threshold

# Box parameters
boxsize::Int = 7 # ROI size in pixels
overlap::Float64 = 2.0 # Max overlap between detections

# Backend parameters
backend::Symbol = :auto # :cpu, :gpu, or :auto
auto_timeout::Float64 = 30.0 # Max wait for GPU in :auto mode
gpu_timeout::Float64 = Inf # Max wait in :gpu mode
on_wait::Union{Function,Nothing} = nothing # Optional wait progress callback
end
```

**Usage:**
```julia
# PSF-aware (recommended)
config = BoxerConfig(psf_sigma=0.13, min_photons=500.0, boxsize=11)

# Advanced (direct control)
config = BoxerConfig(sigma_small=1.5, sigma_large=3.0, minval=10.0)

# GPU-specific
config = BoxerConfig(psf_sigma=0.13, backend=:gpu, gpu_timeout=60.0)
```

## Core Function

### `getboxes(imagestack, camera=nothing; kwargs...) -> ROIBatch`
### `getboxes` - Two Calling Conventions

**Config-based (recommended for reusable settings):**
```julia
getboxes(imagestack, camera, config::BoxerConfig) -> (ROIBatch, BoxesInfo)
```

**Kwargs-based (convenient for one-off calls):**
```julia
getboxes(imagestack, camera=nothing; kwargs...) -> (ROIBatch, BoxesInfo)
```

Both conventions are equivalent - kwargs are forwarded to a BoxerConfig internally.

Main detection function. Applies DoG filtering, finds local maxima, extracts ROI patches.

**Arguments:**
- `imagestack::AbstractArray{<:Real}` - Input image stack (2D or 3D)
- `camera::Union{AbstractCamera,Nothing}` - Camera object (IdealCamera or SCMOSCamera)
- `config::BoxerConfig` - Configuration struct (config-based convention)

**Primary Interface (PSF-Aware, Recommended):**
**Kwargs (kwargs-based convention):**

*PSF-Aware Interface (Recommended):*
- `psf_sigma::Real` - PSF sigma in microns (requires camera for pixel size conversion)
- `min_photons::Real` - Minimum total photons for detection (default: 500.0)

**Advanced Interface (Direct Control):**
*Advanced Interface (Direct Control):*
- `sigma_small::Real` - Small Gaussian sigma in pixels (default: 1.0)
- `sigma_large::Real` - Large Gaussian sigma in pixels (default: 2.0)
- `minval::Real` - DoG intensity threshold (default: 0.0)

**Other Parameters:**
*Other Parameters:*
- `boxsize::Int` - ROI size in pixels (default: 7)
- `overlap::Real` - Maximum overlap between detections in pixels (default: 2.0)
- `backend::Symbol` - Compute backend: `:cpu`, `:gpu`, or `:auto` (default: `:auto`)
Expand All @@ -69,14 +146,25 @@ Main detection function. Applies DoG filtering, finds local maxima, extracts ROI
- `gpu_timeout::Real` - Max seconds to wait in `:gpu` mode (default: Inf)
- `on_wait::Function` - Optional callback `(elapsed, available, required) -> nothing` for wait progress

**Returns:** `ROIBatch` with fields:
**Returns:** Tuple of `(ROIBatch, BoxesInfo)`

`ROIBatch` with fields:
- `data` - ROI stack (boxsize × boxsize × n_rois)
- `x_corners` - Vector of x (column) corner positions
- `y_corners` - Vector of y (row) corner positions
- `frame_indices` - Vector of frame indices for each ROI
- `camera` - Camera object (provided or default IdealCamera)
- `roi_size` - Size of each ROI (square)

`BoxesInfo` with fields:
- `backend` - Compute backend used (`:gpu` or `:cpu`)
- `elapsed_s` - Wall time in seconds
- `device_id` - GPU device ID (0-based), or -1 for CPU
- `n_rois` - Number of ROIs detected
- `batch_size` - Frames per batch during processing
- `n_batches` - Number of batches processed
- `memory_per_batch` - Estimated memory per batch in bytes

### `recommend_batch_size(height, width; backend=:auto, memory_fraction=0.8) -> Int`

Returns the recommended maximum number of frames to load at once given memory constraints.
Expand Down Expand Up @@ -112,7 +200,7 @@ println("Load up to $max_frames frames at a time")
for chunk_start in 1:max_frames:total_frames
chunk_end = min(chunk_start + max_frames - 1, total_frames)
imagestack = load_frames(chunk_start:chunk_end)
roi_batch = getboxes(imagestack, camera; psf_sigma=0.13)
(roi_batch, info) = getboxes(imagestack, camera; psf_sigma=0.13)
# ... process results
end
```
Expand Down Expand Up @@ -145,8 +233,12 @@ using SMLMBoxer, SMLMData
# Create camera with physical pixel size (100nm pixels)
camera = IdealCamera(1:256, 1:256, 0.1f0)

# Detect particles with PSF-aware thresholding
roi_batch = getboxes(imagestack, camera;
# Config-based (recommended for reusable settings)
config = BoxerConfig(psf_sigma=0.13, min_photons=500.0, boxsize=11)
(roi_batch, info) = getboxes(imagestack, camera, config)

# OR kwargs-based (convenient for one-off calls)
(roi_batch, info) = getboxes(imagestack, camera;
psf_sigma = 0.13, # 130nm PSF in microns
min_photons = 500.0, # Minimum 500 photons
boxsize = 11)
Expand All @@ -157,6 +249,10 @@ boxes = roi_batch.data # 11×11×n ROI patches
positions_x = roi_batch.x_corners # Column positions
positions_y = roi_batch.y_corners # Row positions
frames = roi_batch.frame_indices

# Check processing info
println("Backend: ", info.backend)
println("Elapsed: ", info.elapsed_s * 1000, " ms")
```

### sCMOS Variance-Weighted Detection
Expand All @@ -168,7 +264,7 @@ readnoise_map = Float32.(load_readnoise_calibration("camera_calib.mat"))
camera = SCMOSCamera(256, 256, 0.1f0, readnoise_map)

# Variance-weighted detection (automatically enabled)
roi_batch = getboxes(imagestack, camera;
(roi_batch, info) = getboxes(imagestack, camera;
psf_sigma = 0.13,
min_photons = 300.0, # Lower threshold possible with noise weighting
backend = :auto) # GPU with CPU fallback
Expand All @@ -177,7 +273,13 @@ roi_batch = getboxes(imagestack, camera;
### Advanced: Direct Parameter Control
```julia
# Expert mode: bypass PSF-aware interface
roi_batch = getboxes(imagestack;

# Config-based
config = BoxerConfig(sigma_small=1.5, sigma_large=3.0, minval=10.0, boxsize=9, overlap=1.5)
(roi_batch, info) = getboxes(imagestack, nothing, config)

# OR kwargs-based
(roi_batch, info) = getboxes(imagestack;
sigma_small = 1.5, # Custom filter sigma (pixels)
sigma_large = 3.0,
minval = 10.0, # Direct intensity threshold
Expand All @@ -187,7 +289,7 @@ roi_batch = getboxes(imagestack;

### Processing Individual ROIs
```julia
roi_batch = getboxes(imagestack, camera; psf_sigma=0.13)
(roi_batch, info) = getboxes(imagestack, camera; psf_sigma=0.13)

# Iterate over ROIs
for roi in roi_batch
Expand All @@ -196,7 +298,7 @@ for roi in roi_batch
# - roi.corner: (x, y) corner position
# - roi.frame_idx: Frame index
# - roi.camera: Camera ROI calibration

fit_gaussian(roi.data)
end

Expand Down Expand Up @@ -276,10 +378,9 @@ Implements optimal inverse variance weighting for spatially-varying noise.

- **GPU Memory Management**: Automatically batches frames if image stack exceeds GPU memory
- **Type Stability**: All inputs converted to Float32 at entry point
- **Multi-GPU Support**: `find_best_gpu()` selects GPU with most free memory when multiple GPUs available
- **Memory Waiting**: With `:gpu` or `:auto` backend, waits for GPU memory instead of crashing
- `:auto` waits up to 30s then falls back to CPU
- `:gpu` waits indefinitely (or until `gpu_timeout`)
- Uses polling with jittered backoff to avoid thundering herd
- **Multi-GPU NVML Polling**: Scans all GPUs via NVML without creating CUDA contexts. Checks free memory, process contention, and compute utilization. First GPU with sufficient memory and low contention wins.
- **Contention-Safe Retry**: Unified retry loop handles all GPU failure modes (insufficient memory, TOCTOU context race, runtime OOM). Releases memory and re-polls with remaining timeout budget.
- **Memory Pool Reclaim**: Calls `GC.gc() + CUDA.reclaim()` after both successful and failed GPU processing to return memory to the system, preventing finished jobs from blocking other processes.
- **Jittered Backoff**: NVML polling uses jittered sleep intervals to avoid thundering herd when multiple processes compete for GPUs.
- **Backend Abstraction**: KernelAbstractions enables same code for CPU/GPU variance weighting
- **Typical Speedup**: 10-100× with GPU depending on image size and number of frames
- **Typical Speedup**: 10-100x with GPU depending on image size and number of frames
Loading
Loading