Re-scoped 2026-05-31. The original target (hvantk mkmatrix ucsc) no longer exists — the mkmatrix/mktable CLIs were retired in the plugin-system refactor (#109/#110). Long-running builds now run through hvantk reprocess <plugin>:<dataset>, and the UCSC builder survives only as the plugin skills/ucsc_cellbrowser/builder.py. The underlying problem is unchanged and now applies to reprocess.
Problem
Long-running plugin builds via hvantk reprocess (e.g. UCSC cell-browser atlases — 520k cells, ~8.7 GB gzipped, 30–45 min) run with no user-visible progress. Streaming/build loops log at INFO via logger.info(...), but reprocess never configures the root logger, so nothing reaches the terminal. The user can't tell whether the command is working, stuck, or OOM-killed.
Desired behaviour
Periodic progress feedback during long-running reprocess builds without requiring external logging config, plus an opt-in for full logs.
Proposed approach
-
Default: terse progress line to stderr. In the reprocess lifecycle (download/parse/build stages), emit a one-line status to stderr (click.echo(..., err=True)) at stage boundaries and periodically inside streaming build loops, e.g.:
[reprocess ucsc-cellbrowser:expression] build: streamed 10,000 genes (1m 23s)
stderr keeps stdout clean for piping. Total row count is usually unknown upfront (gz doesn't advertise it), so no hard ETA — bytes-consumed % is a possible refinement where the source size is known.
-
--verbose / -v flag on reprocess. Wire logging.basicConfig(level=INFO) early in the command so the builders' existing logger.info(...) lines surface. Must be set before Hail initializes to avoid clobbering Hail's log config.
-
--quiet to suppress the default progress line for batch scripts.
Scope
- Implement on
hvantk reprocess (covers all plugin builds, incl. the UCSC streaming builder).
- A progress hook the builder can call (rather than per-plugin flags) keeps it source-agnostic.
- Could extend to
expression summarize / hgc pipeline later.
Re-scoped 2026-05-31. The original target (
hvantk mkmatrix ucsc) no longer exists — themkmatrix/mktableCLIs were retired in the plugin-system refactor (#109/#110). Long-running builds now run throughhvantk reprocess <plugin>:<dataset>, and the UCSC builder survives only as the pluginskills/ucsc_cellbrowser/builder.py. The underlying problem is unchanged and now applies toreprocess.Problem
Long-running plugin builds via
hvantk reprocess(e.g. UCSC cell-browser atlases — 520k cells, ~8.7 GB gzipped, 30–45 min) run with no user-visible progress. Streaming/build loops log atINFOvialogger.info(...), butreprocessnever configures the root logger, so nothing reaches the terminal. The user can't tell whether the command is working, stuck, or OOM-killed.Desired behaviour
Periodic progress feedback during long-running
reprocessbuilds without requiring external logging config, plus an opt-in for full logs.Proposed approach
Default: terse progress line to stderr. In the
reprocesslifecycle (download/parse/build stages), emit a one-line status to stderr (click.echo(..., err=True)) at stage boundaries and periodically inside streaming build loops, e.g.:stderr keeps stdout clean for piping. Total row count is usually unknown upfront (gz doesn't advertise it), so no hard ETA — bytes-consumed % is a possible refinement where the source size is known.
--verbose/-vflag onreprocess. Wirelogging.basicConfig(level=INFO)early in the command so the builders' existinglogger.info(...)lines surface. Must be set before Hail initializes to avoid clobbering Hail's log config.--quietto suppress the default progress line for batch scripts.Scope
hvantk reprocess(covers all plugin builds, incl. the UCSC streaming builder).expression summarize/hgc pipelinelater.