Skip to content

Commit 74dcf97

Browse files
ms609claude
andauthored
Faster majority consensus & RenumberTips; Jansson evaluated (not implemented) (#274)
* Defer split materialisation in hashed consensus counter Majority/threshold Consensus() and SplitFrequency() built every distinct split's packed bit pattern eagerly on first sighting. At high tip counts most distinct splits never reach the consensus threshold, so that work was wasted. Each distinct split now keeps a 12-byte (tree, L, R) witness; the pattern is materialised only for splits that survive thresholding (consensus) or for all splits (SplitFrequency). Results are unchanged: split sets/counts identical to the deterministic exact path, verified at n up to 5000 (dev/profiling/correctness-gate.R, 590 checks; test-consensus 8/8; test-Support 6/6). Up to ~13x faster for large/tall trees, median 1.23x, no change at small sizes. Adds dev/profiling/ harness (correctness gate, benchmark grid, Jansson lower-bound analysis + DECISION.md explaining why the full deterministic Jansson algorithm is not implemented). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * Relabel unlabelled forests in C++ in RenumberTips() RenumberTips() on an unlabelled multiPhylo or list previously relabelled each tree with a separate R call (RenumberTips.phylo per element). It now relabels the whole forest in a single C++ pass (renumber_tips_to), deriving each tree's permutation from one hash of the target labels, with a no-op pass-through for trees already in target order. The C++ helper returns NULL to fall back to the existing per-tree R path for anything it would not handle identically: numeric tipOrder (per-tree target), a label set differing from the target (different length, unknown/NA label, or non-bijection), duplicate/NA target labels, or non-phylo list elements. List names are preserved. Results are unchanged. 1.8-9x faster (9x at high-k/low-n: 5000 trees of 30 tips, 0.335s -> 0.037s); Consensus() and every other RenumberTips() caller share the win. Full test suite green; new equivalence test confirms the C++ path matches the per-tree path across mixed orderings. Telling label-match without looking: only a labelled multiPhylo (TipLabel attribute) guarantees a shared order without inspecting each tree; absent that, inspection is unavoidable but now done in C++. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * Avoid copying every tree in Consensus() metadata strip Consensus() stripped edge.length/node.label from every input tree via a per-tree lapply that forced a copy of each tree. The consensus core reads only edge + tip.label and the output tree is rebuilt from scratch, so neither field can affect the result; the strip's only load-bearing job was coercing a labelled multiPhylo's shared tip labels onto each tree, which bare c() already does. Replace the strip with 'trees <- c(trees)' (~7x cheaper than the strip: 0.82->0.11 ms at k=201), cutting wrapper overhead ~25% on small forests of many short trees (forest201.80 3.38->2.54 ms), ~6% on forest1k.888 (strict C++ core dominates there). Output verified byte-identical: 108-cell before/after grid (plain / unaligned / edge.length / node.label / both / labelled / labelled+edge.length / list / differing-size x p x check.labels x hash) and a 36-cell core metadata-independence check. New test-Consensus.R block covers the previously untested metadata-bearing and labelled-multiPhylo paths. Full suite 4192 PASS; correctness gate 590 checks; verify-consensus green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * Test Consensus() metadata-invariance through the differing-size path The KeepTip repeat-loop (triggered by trees of unequal size) was the one combination not covered by the C-004 metadata-invariance test. Add a case confirming edge.length does not change the (warned) result there. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * Test p=0.5 tie resolution against ape; document p>0.5 threshold question The even-n tie fixture (2xBalancedTree + 2xPectinateTree, n=4) was only exercised by the "exact == hashed" test, which checks the two internal counters against each other -- it cannot catch an error shared by both. Add ApeTest(tie, 0.5): at p=0.5 a split in exactly n/2 trees is dropped (thresh = floor(n/2)+1), so the two conflicting 50% splits are both dropped, matching ape::consensus. Pins the tie boundary to an oracle. Separately, findings.md records an open question for the maintainer: for p > 0.5 the code keeps count >= floor(n*p)+1 (strictly more than floor(n*p)), whereas the docstring promises "p or more" and ape keeps count >= p*n. They diverge only when n*p is an exact integer (e.g. n=3, p=2/3: a clade in 2/3 of trees is kept by ape, dropped here). Both are safe; the convention/doc-consistency call is the maintainer's. No threshold change made. Repro: dev/profiling/drivers/tie-check.R. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * Simplify comment * rm undue cmt * Retain splits at exactly proportion p for p>0.5 (match ape and docs) Consensus(trees, p) documented that it retains splits in "a proportion p or more" of trees, but the threshold was floor(n*p)+1 -- one tree too strict, so a split in exactly p*n trees was dropped at integer thresholds (e.g. a split in 2 of 3 trees with p = 2/3). ape::consensus keeps it (count >= p*n). Change the C++ threshold to thresh = max(floor(n/2)+1, ceil(p*n)), with a relative tolerance (1e-9*pn) snapping p*n to an exact integer through floating-point noise -- robust where ape's raw `bs >= p*ntree` is fragile. p = 0.5 is unchanged: the floor(n/2)+1 term keeps it strict (a split must occur in MORE than half the trees) so two conflicting 50% splits can never both be retained and the result stays a valid tree. Pin the boundary with ApeTest(thirds, 2/3) (clade in exactly 2/3 of trees). Update docstring/Rd and NEWS. Gate 590 PASS / 0 ape disagreements; full suite 4195 PASS. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * Skip scale-only consensus tests under valgrind The mem-check (valgrind) job stalls for many minutes in test-consensus.R: the 100k-leaf heap-limit build, the 33k-tip consensus, and two 33k-tree consensus runs dominate runtime under valgrind's ~30x slowdown while exercising no memory path the smaller cases miss. Gate them behind a new TESTING_MEMCHECK env var (set in memcheck.yml). TESTING_MEMCHECK is deliberately separate from USING_ASAN: the latter suppresses deliberate-bad-input error tests that ASan aborts on but valgrind tolerates, so setting USING_ASAN here would silently disable those error-path tests under valgrind. Full-scale consensus tests still run in uninstrumented CI. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
1 parent 776f851 commit 74dcf97

26 files changed

Lines changed: 1475 additions & 52 deletions

.github/workflows/memcheck.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,12 @@ jobs:
4747
RSPM: https://packagemanager.rstudio.com/cran/__linux__/noble/latest
4848
GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
4949
ASAN_OPTIONS: verify_asan_link_order=0
50+
# Signals to the test suite that it is running under valgrind, so heavy
51+
# scale-only tests can be skipped: valgrind's ~30x slowdown makes them the
52+
# dominant cost while adding no memory-bug coverage beyond the small cases.
53+
# Distinct from USING_ASAN, which suppresses deliberate-bad-input error
54+
# tests that ASan (but not valgrind) aborts on.
55+
TESTING_MEMCHECK: true
5056

5157
steps:
5258
- uses: ms609/actions/memcheck@main

.gitignore

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,3 +28,11 @@ vignettes/*_cache*
2828
/*-benchmark-results
2929
*.bench.Rds
3030
.positai
31+
32+
# /profile skill artefacts (large; rebuilt each round)
33+
dev/profiling/.vtune-lib-*/
34+
dev/profiling/.libpath.txt
35+
dev/profiling/result_*/
36+
dev/profiling/drivers/*-profvis.html
37+
dev/profiling/*.csv
38+
dev/profiling/*.rds

NEWS.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,30 @@
11
# TreeTools 2.4.0.9000 (development) #
22

3+
## Bug fixes
4+
5+
- `Consensus(trees, p)` now retains a split present in exactly a proportion `p`
6+
of trees (i.e. in `ceiling(p * length(trees))` trees) for `p > 0.5`, matching
7+
the documentation and `ape::consensus()`; previously such a split was dropped
8+
at exact thresholds (e.g. a split in 2 of 3 trees with `p = 2/3`). The
9+
majority threshold `p = 0.5` is unchanged (a split must occur in more than
10+
half the trees).
11+
312
## Performance
413

514
- Guarantee preorder return from `root_on_node()` to simplify `Consensus()`
615
internal pre-processing
16+
- `Consensus()` and `SplitFrequency()` defer materialising a split's bit pattern
17+
until it is needed, so splits that never reach the consensus threshold are no
18+
longer built. Identical results; up to ~13× faster for large trees (greatest
19+
gains for tall trees / many tips), with no change at small sizes.
20+
- `RenumberTips()` relabels an unlabelled `multiPhylo` or `list` of trees in a
21+
single C++ pass instead of a per-tree R loop, with a no-op fast path for trees
22+
already in the target order. Speeds up `Consensus()` and other callers when
23+
combining many trees; results are unchanged.
24+
- `Consensus()` no longer copies every input tree to strip branch lengths and
25+
node labels (the consensus core ignores both); it now coerces in place,
26+
trimming wrapper overhead (~25% faster on small forests of many short trees).
27+
Results are unchanged.
728

829
# TreeTools 2.4.0 (2026-06-02) #
930

R/Consensus.R

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -7,15 +7,19 @@
77
#' against every other tree in linear time. The majority-rule and threshold
88
#' consensus (`0.5 <= p < 1`) instead count the frequency of every split across
99
#' all trees in a single pass and retain those occurring in a proportion `p` or
10-
#' more of trees; this runs in time linear in the number of trees, after
11-
#' \insertCite{Jansson2016}{TreeTools}. By default the count uses
12-
#' a 128-bit hash, whose results are exact with overwhelming probability; set
13-
#' `hash = FALSE` for a slower but guaranteed-exact count.
10+
#' more of trees (i.e. in at least `ceiling(p * length(trees))` trees); this
11+
#' runs in time linear in the number of trees, after
12+
#' \insertCite{Jansson2016}{TreeTools}. The majority threshold `p = 0.5` is
13+
#' strict: a split is retained only if it occurs in *more* than half the trees,
14+
#' so that two conflicting splits can never both be reported. By default the
15+
#' count uses a 128-bit hash, whose results are exact with overwhelming
16+
#' probability; set `hash = FALSE` for a slower but guaranteed-exact count.
1417
#'
1518
#' @param trees List of trees, optionally of class `multiPhylo`.
16-
#' @param p Proportion of trees that must contain a split for it to be reported
17-
#' in the consensus. `p = 0.5` gives the majority-rule consensus; `p = 1` (the
18-
#' default) gives the strict consensus.
19+
#' @param p A number from 0.5 to 1 giving the proportion of trees that must
20+
#' contain a split for it to be reported in the consensus: from `p = 0.5` (more
21+
#' than half the trees; the majority-rule consensus) to `p = 1` (every tree; the
22+
#' strict consensus, the default).
1923
#' @param check.labels Logical specifying whether to check that all trees have
2024
#' identical labels. Defaults to `TRUE`, which is slower.
2125
#' @param hash Logical; if `TRUE` (default), majority/threshold consensus
@@ -54,12 +58,8 @@ Consensus <- function(trees, p = 1, check.labels = TRUE, hash = TRUE) {
5458
stop("Expecting `trees` to be a list.")
5559
}
5660

57-
# Remove irrelevant metadata so we don't waste time processing it
58-
trees <- lapply(c(trees), function(tr) {
59-
tr[["edge.length"]] <- NULL
60-
tr[["node.label"]] <- NULL
61-
tr
62-
})
61+
# c() materialises a labelled multiPhylo's shared tip labels onto each tree.
62+
trees <- c(trees)
6363

6464
repeat {
6565
nTip <- NTip(trees)

R/RcppExports.R

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -101,6 +101,10 @@ renumber_tips_batch <- function(trees, perm, n_tip, new_labels) {
101101
.Call(`_TreeTools_renumber_tips_batch`, trees, perm, n_tip, new_labels)
102102
}
103103

104+
renumber_tips_to <- function(trees, target) {
105+
.Call(`_TreeTools_renumber_tips_to`, trees, target)
106+
}
107+
104108
cpp_edge_to_splits <- function(edge, order, nTip) {
105109
.Call(`_TreeTools_cpp_edge_to_splits`, edge, order, nTip)
106110
}

R/tree_numbering.R

Lines changed: 31 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -601,6 +601,25 @@ TntOrder.NULL <- function(tree) NULL
601601
#' @export
602602
RenumberTips <- function(tree, tipOrder) UseMethod("RenumberTips")
603603

604+
# Fast path shared by the unlabelled multiPhylo and list methods: relabel an
605+
# entire forest to a single target order in one C++ call, instead of dispatching
606+
# RenumberTips.phylo per tree in R. Returns the relabelled list, or NULL to
607+
# signal that the caller should fall back to the per-tree R path (the C++ helper
608+
# declines anything it would not handle identically: see renumber_tips_to()).
609+
.RenumberForestToC <- function(tree, tipOrder) {
610+
if (is.numeric(tipOrder)) {
611+
# Numeric tipOrder indexes each tree's OWN labels, so the target differs per
612+
# tree; only the R path handles that.
613+
return(NULL)
614+
}
615+
newOrder <- TipLabels(tipOrder, single = TRUE)
616+
if (anyDuplicated(newOrder)) {
617+
# Let RenumberTips.phylo raise its "repeated in tipOrder" error.
618+
return(NULL)
619+
}
620+
renumber_tips_to(tree, newOrder)
621+
}
622+
604623
#' @rdname RenumberTips
605624
#' @export
606625
RenumberTips.phylo <- function(tree, tipOrder) {
@@ -667,9 +686,16 @@ RenumberTips.multiPhylo <- function(tree, tipOrder) {
667686
labelled <- !is.null(at[["TipLabel"]])
668687

669688
if (!labelled) {
670-
# Unlabelled: each tree may have different tip orderings,
671-
# so must process individually
672-
tree <- lapply(tree, RenumberTips.phylo, tipOrder)
689+
# Unlabelled: each tree may have its own tip ordering. Relabel the whole
690+
# forest in one C++ pass; fall back to the per-tree R path for inputs the
691+
# fast path declines (numeric tipOrder, differing label sets, non-phylo
692+
# elements, ...).
693+
relabelled <- .RenumberForestToC(tree, tipOrder)
694+
tree <- if (is.null(relabelled)) {
695+
lapply(tree, RenumberTips.phylo, tipOrder)
696+
} else {
697+
relabelled
698+
}
673699
attributes(tree) <- at
674700
return(tree)
675701
}
@@ -726,7 +752,8 @@ RenumberTips.multiPhylo <- function(tree, tipOrder) {
726752
#' @rdname RenumberTips
727753
#' @export
728754
RenumberTips.list <- function(tree, tipOrder) {
729-
lapply(tree, RenumberTips, tipOrder)
755+
relabelled <- .RenumberForestToC(tree, tipOrder)
756+
if (is.null(relabelled)) lapply(tree, RenumberTips, tipOrder) else relabelled
730757
}
731758

732759
#' @rdname RenumberTips

dev/profiling/DECISION.md

Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
# Consensus: profiling outcome & the Jansson decision
2+
3+
**For MRS — read this first.** You asked me to implement Jansson's algorithm
4+
alongside the current O(kn) consensus, optimise both, and let data pick a winner.
5+
6+
## TL;DR
7+
8+
1. **I did NOT implement the full Jansson (Maj_Rule_Plus) algorithm.** The data
9+
say it cannot beat the (now-optimised) hashed counter in any realistic,
10+
time-meaningful regime, and it is high-risk to get right (the authors' own
11+
reference impl, FACT, ships **broken** majority code). The exact algorithm and
12+
what a future implementation would need are written down below so it can be
13+
built on request.
14+
2. **I optimised the existing hashed counter** (deferred split materialisation):
15+
**up to 13× faster** at high n (tall trees), median **1.23×**, with **zero
16+
change to results** (split sets identical to the deterministic `exact` path,
17+
verified at n up to 3000). This is the real, shipping win.
18+
3. **hashed stays the default; `exact` stays the deterministic fallback.** No
19+
crossover that warrants switching algorithms; one approach (hashed) wins
20+
across the board — so, per your "avoid redundant code" steer, no third path.
21+
22+
## Why not Jansson — the data
23+
24+
Jansson's deterministic O(kn) majority algorithm exists to **count cluster
25+
frequencies without hashing**, by matching each input tree against an evolving
26+
O(n)-cluster candidate tree (Day's algorithm + `One_Way_Compatible` +
27+
`Merge_Trees` + delete/insert). Its only advantage over the current code is
28+
*determinism* (no ~1e-30 hash-collision risk — which you've explicitly waived)
29+
and avoiding `exact`'s O(k·n·h) on tall trees.
30+
31+
**Measured lower bound (rigorous):** Jansson ≥ 2× the strict-consensus path
32+
(Phase 2 ≈ strict's k Day-matchings; Phase 1 ≥ Phase 2, adding
33+
One_Way_Compatible + Merge_Trees + per-iteration table rebuilds). Strict's inner
34+
loop is *tighter* than Jansson's, so this under-counts Jansson → the verdict is
35+
conservative. Where `2×strict ≥ hashed`, **Jansson provably loses**. Results vs
36+
the optimised hashed (`drivers/jansson-bound.R`):
37+
38+
| regime | concordance | 2×strict / hashed | verdict |
39+
|--------|-------------|-------------------|---------|
40+
| high-k/low-n (all) | any | 1.1–1.8 | **Jansson loses (proven)** |
41+
| high-k/high-n (2000,500) | concordant / moderate | 1.24 / 1.54 | **loses (proven)** |
42+
| high-k/high-n (1000,1000) | concordant / moderate | 1.32 / 1.42 | **loses (proven)** |
43+
| low-k/high-n (5000,10) | concordant / moderate | 0.99 / 1.09 | loses / borderline |
44+
| low-k/high-n (10000,5) | any | 0.45–1.07 | not proven — but <60 ms absolute |
45+
| high-k/high-n (≥2000,≥500) | **extreme** (rand/tall) | 0.44–0.56 | not proven |
46+
47+
**The only cells where Jansson is not proven to lose are** (a) sub-60 ms
48+
absolute times (low-k/very-high-n — a 2× gap is <30 ms, irrelevant), or
49+
(b) **extreme-conflict** input (≈ independent random trees), whose majority
50+
consensus is a near-empty star — not something anyone computes. Realistic
51+
consensus inputs (bootstrap / Bayesian / MPT sets, even conflicting gene-tree
52+
sets) are concordant-to-moderate, and there Jansson provably loses.
53+
54+
This is consistent with the authors' own C++ `Maj_Rule_Plus` timings (~10×
55+
slower than TreeTools' hashed at small n).
56+
57+
**Net:** building, proving, and CRAN-hardening `One_Way_Compatible` +
58+
`Merge_Trees` + a dynamic delete/insert tree — the exact machinery FACT got
59+
wrong — to maybe win an unrealistic corner by a few ms is a bad trade against
60+
"correctness paramount / avoid redundant code". **Re-openable**: if you want the
61+
deterministic-O(kn) guarantee regardless, the algorithm is
62+
Maj_Rule_Plus Phase 1 (Fig. 2 of arXiv:1307.7821) + a standard-majority Phase 2
63+
(keep `count > k·p` instead of `K(v) > Q(v)`); subroutine specs in
64+
PLAN-consensus.md. Say the word.
65+
66+
## The optimisation that shipped (C-001)
67+
68+
`count_splits_hashed` no longer materialises every distinct split's bit pattern
69+
eagerly. Each distinct split keeps a 12-byte witness `(tree, L, R)`; the packed
70+
pattern is rebuilt only for splits that reach the consensus threshold (or all,
71+
for `SplitFrequency`). At high n the wasted materialisation of non-surviving
72+
splits was the dominant cost. Verified speedups (min of reps,
73+
`drivers/compare-grids.R`): tall(10000,5) **×13.0**, tall(5000,10) ×8.6,
74+
rand(10000,5) ×2.5, high-k/high-n ×1.5–2.9, median ×1.23. **Results identical to
75+
the deterministic `exact` path in every cell.** Both rewired paths are gated at
76+
shipping scale: `correctness-gate.R` (590 checks) verifies consensus split SETS
77+
at n≤3000 and `SplitFrequency` split sets AND counts at n=2000/5000 (hashed ==
78+
exact); package `test-consensus.R` (8/8) and `test-Support.R` (6/6) pass;
79+
`verify-consensus.R` green.
80+
81+
## What else the profiling found (not yet actioned)
82+
83+
- **C-002 (open):** at **high-k/low-n** the R wrapper is **57%** of `Consensus()`
84+
wall time; `RenumberTips` is 54% of that. A safe fast-path (batch C++ relabel
85+
/ skip when labels already consistent) is the next throughput win in that
86+
regime. Touches shared code → needs the full test suite as a gate. Deferred to
87+
avoid shipping a risky shared-code change while you're away.
88+
- **C-003 (low priority):** hashed's `unordered_map` churn dominates the
89+
low-height extreme-conflict case; only matters for degenerate inputs.
90+
- **Threshold convention (FYI, unchanged):** for 0.5<p<1 TreeTools keeps
91+
`count > k·p`; ape keeps `>= k·p`; roxygen says "p or more". Flagging — your call.
92+
93+
## Correctness fixes made en route
94+
95+
- `dev/red-team/verify-consensus.R` was **dead**: it called
96+
`consensus_tree(..., hash=FALSE)` but the arg is `exact` → "unused argument".
97+
Fixed (`exact = TRUE`); now green (0 failures).
98+
- Built a method-pluggable gate (`correctness-gate.R`) — hashed==exact==ape@0.5,
99+
588 checks. (Found & fixed a `which.min(<character>)` bug in it that had been
100+
silently skipping every ape comparison.)

0 commit comments

Comments
 (0)