Skip to content

feat(mkmatrix): add progress reporting for long-running builds #92

@enriquea

Description

@enriquea

Re-scoped 2026-05-31. The original target (hvantk mkmatrix ucsc) no longer exists — the mkmatrix/mktable CLIs were retired in the plugin-system refactor (#109/#110). Long-running builds now run through hvantk reprocess <plugin>:<dataset>, and the UCSC builder survives only as the plugin skills/ucsc_cellbrowser/builder.py. The underlying problem is unchanged and now applies to reprocess.

Problem

Long-running plugin builds via hvantk reprocess (e.g. UCSC cell-browser atlases — 520k cells, ~8.7 GB gzipped, 30–45 min) run with no user-visible progress. Streaming/build loops log at INFO via logger.info(...), but reprocess never configures the root logger, so nothing reaches the terminal. The user can't tell whether the command is working, stuck, or OOM-killed.

Desired behaviour

Periodic progress feedback during long-running reprocess builds without requiring external logging config, plus an opt-in for full logs.

Proposed approach

  1. Default: terse progress line to stderr. In the reprocess lifecycle (download/parse/build stages), emit a one-line status to stderr (click.echo(..., err=True)) at stage boundaries and periodically inside streaming build loops, e.g.:

    [reprocess ucsc-cellbrowser:expression] build: streamed 10,000 genes (1m 23s)
    

    stderr keeps stdout clean for piping. Total row count is usually unknown upfront (gz doesn't advertise it), so no hard ETA — bytes-consumed % is a possible refinement where the source size is known.

  2. --verbose / -v flag on reprocess. Wire logging.basicConfig(level=INFO) early in the command so the builders' existing logger.info(...) lines surface. Must be set before Hail initializes to avoid clobbering Hail's log config.

  3. --quiet to suppress the default progress line for batch scripts.

Scope

  • Implement on hvantk reprocess (covers all plugin builds, incl. the UCSC streaming builder).
  • A progress hook the builder can call (rather than per-plugin flags) keeps it source-agnostic.
  • Could extend to expression summarize / hgc pipeline later.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions