You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Round-4 reviewer surfaced 8 inline findings and SonarCloud independently
flagged 5 cognitive-complexity violations + 4 type-mismatch warnings + 2
hardcoded-credential warnings + 1 weak-hash warning on the same surface.
Three independent signals on the same code = signal, not noise. The
ingest_path / _audit_split refactors I previously declined twice are
applied this round.
Inline-comment findings (small)
- Finding 1+2 (data_audit.md / -tr.md L61-70): cross_split_overlap JSON
example used the OLD `leak_rate` schema — updated to the dual-direction
shape the code now emits (leaked_rows_in_<split>, leak_rate_<split>).
Both EN and TR mirrors fixed.
- Finding 3 (cli.py L238 + L302): `forgelm ingest --help` and the main
parser epilog mentioned PDF/DOCX/EPUB/TXT but not Markdown, even though
SUPPORTED_EXTENSIONS includes .md. Both strings now read
"PDF / DOCX / EPUB / TXT / Markdown".
- Finding 4 (ingestion.md L99-110): unfenced directory-tree code block
got `text` language tag for markdownlint MD040.
- Finding 5 (cli.py L1151): `except (FileNotFoundError, OSError)` →
`except OSError` (FileNotFoundError is a subclass; the catch was
redundant). Comment refreshed.
- Finding 6 (data_audit.py L497-545): _cross_split_overlap had a dead
`matched_b` accumulator + redundant set conversions. Extracted
`_count_leaked_rows` + `_pair_leak_payload` helpers; the orchestrator
is now a clean nested loop. Behavioural identity verified by tests.
- Finding 7 (ingest_path refactor): previously declined twice. SonarCloud
raised cog complexity 30 → 15 cap, third independent signal. Applied
this round — _process_one_file helper extracts the per-file
extract→chunk→mask→write block; ingest_path is now an aggregation
loop over _FileOutcome dataclasses. Behaviour identical.
- Finding 8 (ingestion.py L373): inline comment now states why
ImportError must propagate (missing extras → EXIT_TRAINING_ERROR via
the CLI wrapper, not a per-file skip).
SonarCloud cognitive-complexity refactors
- _audit_split (cog 31 → ~10): extracted four helpers — _compute_schema
(modal-keyset drift detection + non_object_rows count),
_compute_payloads (text + null/empty), _compute_top_languages,
_compute_fingerprints (with progress logging), _aggregate_pii.
- audit_dataset (cog 19 → ~10): extracted _process_split returning a
_SplitOutcome dataclass + _pii_summary_notes + _cross_split_leak_notes.
- _resolve_input (cog 21 → ~5): extracted _scan_canonical_split_files
+ _scan_pseudo_split_files + _resolve_directory_splits orchestrator.
- _cross_split_overlap (cog 19 → ~6): see Finding 6 above.
- generate_data_governance_report (cog 20 → ~5): extracted
_build_text_length_stats + _build_split_info + _governance_section
+ _maybe_inline_audit_report. Each helper owns one slice.
Each helper is private (underscore prefix) and individually unit-
testable; existing tests pass unchanged.
SonarCloud security/typing warnings
- Weak hash (data_audit.py L258): MD5 → BLAKE2b. Use is non-crypto
(simhash bit-mixing) — `usedforsecurity=False` was set but Sonar
doesn't read that flag. BLAKE2b is on the modern allowlist, faster
than SHA-256, and natively supports digest_size truncation (no slice
needed). Fingerprint collisions identical in distribution; existing
hamming-distance tests still pass.
- Type mismatch (test_data_audit L77/78/121/435): detect_pii / mask_pii
signatures changed from `str` → `Any`. The functions ALREADY guarded
`if not text or not isinstance(text, str): return ...` defensively
for arbitrary JSONL row payloads — the type signature now matches
reality, so callers passing None / int / list don't trip mypy /
Sonar typing checks. Test `# type: ignore[arg-type]` suppressions
removed (no longer needed).
- Hardcoded credentials (test_ingestion.py L274): renamed PDF fixture
passwords from "secret" / "owner" to "fx-user" / "fx-owner" + added
`# noqa: S105` comments + explanatory note that these are PDF
authoring inputs to PdfWriter.encrypt, not real credentials.
Verification
- ruff format/check: clean
- pytest tests/: 737 passed, 8 skipped (no regressions; refactors are
behaviour-preserving)
- End-to-end smoke: multi-split JSONL with parse error + non-dict row
+ cross-split leak produces a complete report with the new schema
on cross_split_overlap.pairs.<a>__<b>; sliding chunker confirmed to
emit no runt trailing chunks; mask_pii(int) defensively passes
through unchanged.
0 commit comments