|
| 1 | +# Consensus: profiling outcome & the Jansson decision |
| 2 | + |
| 3 | +**For MRS — read this first.** You asked me to implement Jansson's algorithm |
| 4 | +alongside the current O(kn) consensus, optimise both, and let data pick a winner. |
| 5 | + |
| 6 | +## TL;DR |
| 7 | + |
| 8 | +1. **I did NOT implement the full Jansson (Maj_Rule_Plus) algorithm.** The data |
| 9 | + say it cannot beat the (now-optimised) hashed counter in any realistic, |
| 10 | + time-meaningful regime, and it is high-risk to get right (the authors' own |
| 11 | + reference impl, FACT, ships **broken** majority code). The exact algorithm and |
| 12 | + what a future implementation would need are written down below so it can be |
| 13 | + built on request. |
| 14 | +2. **I optimised the existing hashed counter** (deferred split materialisation): |
| 15 | + **up to 13× faster** at high n (tall trees), median **1.23×**, with **zero |
| 16 | + change to results** (split sets identical to the deterministic `exact` path, |
| 17 | + verified at n up to 3000). This is the real, shipping win. |
| 18 | +3. **hashed stays the default; `exact` stays the deterministic fallback.** No |
| 19 | + crossover that warrants switching algorithms; one approach (hashed) wins |
| 20 | + across the board — so, per your "avoid redundant code" steer, no third path. |
| 21 | + |
| 22 | +## Why not Jansson — the data |
| 23 | + |
| 24 | +Jansson's deterministic O(kn) majority algorithm exists to **count cluster |
| 25 | +frequencies without hashing**, by matching each input tree against an evolving |
| 26 | +O(n)-cluster candidate tree (Day's algorithm + `One_Way_Compatible` + |
| 27 | +`Merge_Trees` + delete/insert). Its only advantage over the current code is |
| 28 | +*determinism* (no ~1e-30 hash-collision risk — which you've explicitly waived) |
| 29 | +and avoiding `exact`'s O(k·n·h) on tall trees. |
| 30 | + |
| 31 | +**Measured lower bound (rigorous):** Jansson ≥ 2× the strict-consensus path |
| 32 | +(Phase 2 ≈ strict's k Day-matchings; Phase 1 ≥ Phase 2, adding |
| 33 | +One_Way_Compatible + Merge_Trees + per-iteration table rebuilds). Strict's inner |
| 34 | +loop is *tighter* than Jansson's, so this under-counts Jansson → the verdict is |
| 35 | +conservative. Where `2×strict ≥ hashed`, **Jansson provably loses**. Results vs |
| 36 | +the optimised hashed (`drivers/jansson-bound.R`): |
| 37 | + |
| 38 | +| regime | concordance | 2×strict / hashed | verdict | |
| 39 | +|--------|-------------|-------------------|---------| |
| 40 | +| high-k/low-n (all) | any | 1.1–1.8 | **Jansson loses (proven)** | |
| 41 | +| high-k/high-n (2000,500) | concordant / moderate | 1.24 / 1.54 | **loses (proven)** | |
| 42 | +| high-k/high-n (1000,1000) | concordant / moderate | 1.32 / 1.42 | **loses (proven)** | |
| 43 | +| low-k/high-n (5000,10) | concordant / moderate | 0.99 / 1.09 | loses / borderline | |
| 44 | +| low-k/high-n (10000,5) | any | 0.45–1.07 | not proven — but <60 ms absolute | |
| 45 | +| high-k/high-n (≥2000,≥500) | **extreme** (rand/tall) | 0.44–0.56 | not proven | |
| 46 | + |
| 47 | +**The only cells where Jansson is not proven to lose are** (a) sub-60 ms |
| 48 | +absolute times (low-k/very-high-n — a 2× gap is <30 ms, irrelevant), or |
| 49 | +(b) **extreme-conflict** input (≈ independent random trees), whose majority |
| 50 | +consensus is a near-empty star — not something anyone computes. Realistic |
| 51 | +consensus inputs (bootstrap / Bayesian / MPT sets, even conflicting gene-tree |
| 52 | +sets) are concordant-to-moderate, and there Jansson provably loses. |
| 53 | + |
| 54 | +This is consistent with the authors' own C++ `Maj_Rule_Plus` timings (~10× |
| 55 | +slower than TreeTools' hashed at small n). |
| 56 | + |
| 57 | +**Net:** building, proving, and CRAN-hardening `One_Way_Compatible` + |
| 58 | +`Merge_Trees` + a dynamic delete/insert tree — the exact machinery FACT got |
| 59 | +wrong — to maybe win an unrealistic corner by a few ms is a bad trade against |
| 60 | +"correctness paramount / avoid redundant code". **Re-openable**: if you want the |
| 61 | +deterministic-O(kn) guarantee regardless, the algorithm is |
| 62 | +Maj_Rule_Plus Phase 1 (Fig. 2 of arXiv:1307.7821) + a standard-majority Phase 2 |
| 63 | +(keep `count > k·p` instead of `K(v) > Q(v)`); subroutine specs in |
| 64 | +PLAN-consensus.md. Say the word. |
| 65 | + |
| 66 | +## The optimisation that shipped (C-001) |
| 67 | + |
| 68 | +`count_splits_hashed` no longer materialises every distinct split's bit pattern |
| 69 | +eagerly. Each distinct split keeps a 12-byte witness `(tree, L, R)`; the packed |
| 70 | +pattern is rebuilt only for splits that reach the consensus threshold (or all, |
| 71 | +for `SplitFrequency`). At high n the wasted materialisation of non-surviving |
| 72 | +splits was the dominant cost. Verified speedups (min of reps, |
| 73 | +`drivers/compare-grids.R`): tall(10000,5) **×13.0**, tall(5000,10) ×8.6, |
| 74 | +rand(10000,5) ×2.5, high-k/high-n ×1.5–2.9, median ×1.23. **Results identical to |
| 75 | +the deterministic `exact` path in every cell.** Both rewired paths are gated at |
| 76 | +shipping scale: `correctness-gate.R` (590 checks) verifies consensus split SETS |
| 77 | +at n≤3000 and `SplitFrequency` split sets AND counts at n=2000/5000 (hashed == |
| 78 | +exact); package `test-consensus.R` (8/8) and `test-Support.R` (6/6) pass; |
| 79 | +`verify-consensus.R` green. |
| 80 | + |
| 81 | +## What else the profiling found (not yet actioned) |
| 82 | + |
| 83 | +- **C-002 (open):** at **high-k/low-n** the R wrapper is **57%** of `Consensus()` |
| 84 | + wall time; `RenumberTips` is 54% of that. A safe fast-path (batch C++ relabel |
| 85 | + / skip when labels already consistent) is the next throughput win in that |
| 86 | + regime. Touches shared code → needs the full test suite as a gate. Deferred to |
| 87 | + avoid shipping a risky shared-code change while you're away. |
| 88 | +- **C-003 (low priority):** hashed's `unordered_map` churn dominates the |
| 89 | + low-height extreme-conflict case; only matters for degenerate inputs. |
| 90 | +- **Threshold convention (FYI, unchanged):** for 0.5<p<1 TreeTools keeps |
| 91 | + `count > k·p`; ape keeps `>= k·p`; roxygen says "p or more". Flagging — your call. |
| 92 | + |
| 93 | +## Correctness fixes made en route |
| 94 | + |
| 95 | +- `dev/red-team/verify-consensus.R` was **dead**: it called |
| 96 | + `consensus_tree(..., hash=FALSE)` but the arg is `exact` → "unused argument". |
| 97 | + Fixed (`exact = TRUE`); now green (0 failures). |
| 98 | +- Built a method-pluggable gate (`correctness-gate.R`) — hashed==exact==ape@0.5, |
| 99 | + 588 checks. (Found & fixed a `which.min(<character>)` bug in it that had been |
| 100 | + silently skipping every ape comparison.) |
0 commit comments