Skip to content

Commit ebfcd3d

Browse files
committed
fix(phase-11): apply round-4 review + SonarCloud cog-complexity refactors
Round-4 reviewer surfaced 8 inline findings and SonarCloud independently flagged 5 cognitive-complexity violations + 4 type-mismatch warnings + 2 hardcoded-credential warnings + 1 weak-hash warning on the same surface. Three independent signals on the same code = signal, not noise. The ingest_path / _audit_split refactors I previously declined twice are applied this round. Inline-comment findings (small) - Finding 1+2 (data_audit.md / -tr.md L61-70): cross_split_overlap JSON example used the OLD `leak_rate` schema — updated to the dual-direction shape the code now emits (leaked_rows_in_<split>, leak_rate_<split>). Both EN and TR mirrors fixed. - Finding 3 (cli.py L238 + L302): `forgelm ingest --help` and the main parser epilog mentioned PDF/DOCX/EPUB/TXT but not Markdown, even though SUPPORTED_EXTENSIONS includes .md. Both strings now read "PDF / DOCX / EPUB / TXT / Markdown". - Finding 4 (ingestion.md L99-110): unfenced directory-tree code block got `text` language tag for markdownlint MD040. - Finding 5 (cli.py L1151): `except (FileNotFoundError, OSError)` → `except OSError` (FileNotFoundError is a subclass; the catch was redundant). Comment refreshed. - Finding 6 (data_audit.py L497-545): _cross_split_overlap had a dead `matched_b` accumulator + redundant set conversions. Extracted `_count_leaked_rows` + `_pair_leak_payload` helpers; the orchestrator is now a clean nested loop. Behavioural identity verified by tests. - Finding 7 (ingest_path refactor): previously declined twice. SonarCloud raised cog complexity 30 → 15 cap, third independent signal. Applied this round — _process_one_file helper extracts the per-file extract→chunk→mask→write block; ingest_path is now an aggregation loop over _FileOutcome dataclasses. Behaviour identical. - Finding 8 (ingestion.py L373): inline comment now states why ImportError must propagate (missing extras → EXIT_TRAINING_ERROR via the CLI wrapper, not a per-file skip). SonarCloud cognitive-complexity refactors - _audit_split (cog 31 → ~10): extracted four helpers — _compute_schema (modal-keyset drift detection + non_object_rows count), _compute_payloads (text + null/empty), _compute_top_languages, _compute_fingerprints (with progress logging), _aggregate_pii. - audit_dataset (cog 19 → ~10): extracted _process_split returning a _SplitOutcome dataclass + _pii_summary_notes + _cross_split_leak_notes. - _resolve_input (cog 21 → ~5): extracted _scan_canonical_split_files + _scan_pseudo_split_files + _resolve_directory_splits orchestrator. - _cross_split_overlap (cog 19 → ~6): see Finding 6 above. - generate_data_governance_report (cog 20 → ~5): extracted _build_text_length_stats + _build_split_info + _governance_section + _maybe_inline_audit_report. Each helper owns one slice. Each helper is private (underscore prefix) and individually unit- testable; existing tests pass unchanged. SonarCloud security/typing warnings - Weak hash (data_audit.py L258): MD5 → BLAKE2b. Use is non-crypto (simhash bit-mixing) — `usedforsecurity=False` was set but Sonar doesn't read that flag. BLAKE2b is on the modern allowlist, faster than SHA-256, and natively supports digest_size truncation (no slice needed). Fingerprint collisions identical in distribution; existing hamming-distance tests still pass. - Type mismatch (test_data_audit L77/78/121/435): detect_pii / mask_pii signatures changed from `str` → `Any`. The functions ALREADY guarded `if not text or not isinstance(text, str): return ...` defensively for arbitrary JSONL row payloads — the type signature now matches reality, so callers passing None / int / list don't trip mypy / Sonar typing checks. Test `# type: ignore[arg-type]` suppressions removed (no longer needed). - Hardcoded credentials (test_ingestion.py L274): renamed PDF fixture passwords from "secret" / "owner" to "fx-user" / "fx-owner" + added `# noqa: S105` comments + explanatory note that these are PDF authoring inputs to PdfWriter.encrypt, not real credentials. Verification - ruff format/check: clean - pytest tests/: 737 passed, 8 skipped (no regressions; refactors are behaviour-preserving) - End-to-end smoke: multi-split JSONL with parse error + non-dict row + cross-split leak produces a complete report with the new schema on cross_split_overlap.pairs.<a>__<b>; sliding chunker confirmed to emit no runt trailing chunks; mask_pii(int) defensively passes through unchanged.
1 parent 6121c9a commit ebfcd3d

9 files changed

Lines changed: 519 additions & 322 deletions

File tree

docs/guides/data_audit-tr.md

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -63,14 +63,23 @@ GPU gerekmiyor. Ağ çağrısı yok. CPU-only.
6363
"cross_split_overlap": {
6464
"hamming_threshold": 3,
6565
"pairs": {
66-
"train__test": {"leaked_rows_in_train": 7, "leak_rate": 0.0056}
66+
"train__test": {
67+
"leaked_rows_in_train": 7,
68+
"leak_rate_train": 0.0056,
69+
"leaked_rows_in_test": 7,
70+
"leak_rate_test": 0.7
71+
}
6772
}
6873
}
6974
}
7075
```
7176

72-
Train ile test arasında sıfır olmayan leak rate **benchmark
73-
güvenirliğinin sessiz katilidir** — eğitim öncesi split'leri düzeltin.
77+
Audit leak rate'i **her iki yönde** de raporlar çünkü birbirinden farklı
78+
hikâyeler anlatırlar. 1240 train + 10 test satırında 7'sinin sızdığı bir
79+
durumda `leak_rate_train = 7/1240 = %0.56` önemsiz görünür ama
80+
`leak_rate_test = 7/10 = %70` benchmark güvenirliğini fiilen yok eden
81+
metriktir. Her zaman küçük tarafın oranını okuyun — test bütünlüğünün
82+
sessiz katili odur.
7483

7584
### PII özeti
7685

docs/guides/data_audit.md

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -63,14 +63,23 @@ No GPU required. No network calls. CPU-only.
6363
"cross_split_overlap": {
6464
"hamming_threshold": 3,
6565
"pairs": {
66-
"train__test": {"leaked_rows_in_train": 7, "leak_rate": 0.0056}
66+
"train__test": {
67+
"leaked_rows_in_train": 7,
68+
"leak_rate_train": 0.0056,
69+
"leaked_rows_in_test": 7,
70+
"leak_rate_test": 0.7
71+
}
6772
}
6873
}
6974
}
7075
```
7176

72-
A non-zero leak rate between train and test is a **silent killer of
73-
benchmark fidelity** — fix the splits before training.
77+
The audit reports leak rate **in both directions** because they tell
78+
different stories. With 1240 train rows and 10 test rows where 7 leak,
79+
`leak_rate_train = 7/1240 = 0.56%` looks negligible but
80+
`leak_rate_test = 7/10 = 70%` is the metric that actually destroys
81+
benchmark fidelity. Always read the smaller-side rate — that is the
82+
silent killer of test integrity.
7483

7584
### PII summary
7685

docs/guides/ingestion.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -96,7 +96,7 @@ false positives are intentional. Audit your output afterwards with
9696

9797
## Recursive directory walk
9898

99-
```
99+
```text
100100
./policies/
101101
├── 2024_q1.pdf
102102
├── 2024_q2.pdf

forgelm/cli.py

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -235,7 +235,7 @@ def _add_quickstart_subcommand(subparsers) -> None:
235235
def _add_ingest_subcommand(subparsers) -> None:
236236
p = subparsers.add_parser(
237237
"ingest",
238-
help="Convert raw documents (PDF / DOCX / EPUB / TXT) into SFT-ready JSONL.",
238+
help="Convert raw documents (PDF / DOCX / EPUB / TXT / Markdown) into SFT-ready JSONL.",
239239
description=(
240240
"Walk a file or directory tree, extract text per format, chunk with the "
241241
'selected strategy, and emit a {"text": ...} JSONL the trainer accepts. '
@@ -299,7 +299,7 @@ def parse_args():
299299
epilog=(
300300
"Subcommands:\n"
301301
" forgelm quickstart [TEMPLATE] Generate a config from a curated template\n"
302-
" forgelm ingest PATH Convert raw docs (PDF/DOCX/EPUB/TXT) → JSONL\n"
302+
" forgelm ingest PATH Convert raw docs (PDF/DOCX/EPUB/TXT/Markdown) → JSONL\n"
303303
" forgelm chat MODEL_PATH Interactive chat REPL\n"
304304
" forgelm export MODEL_PATH Export model to GGUF\n"
305305
" forgelm deploy MODEL_PATH Generate serving config\n"
@@ -1148,10 +1148,11 @@ def _run_data_audit(audit_input: str, output_dir: Optional[str], output_format:
11481148
target = output_dir or "./audit"
11491149
try:
11501150
report = audit_dataset(audit_input, output_dir=target)
1151-
except (FileNotFoundError, OSError) as exc:
1152-
# OSError covers PermissionError / ENOSPC / IsADirectoryError that
1153-
# bubble up from _resolve_input or _read_jsonl_split when the target
1154-
# is unreachable BEFORE the per-split tolerance loop kicks in.
1151+
except OSError as exc:
1152+
# OSError covers FileNotFoundError / PermissionError / ENOSPC /
1153+
# IsADirectoryError that bubble up from _resolve_input or
1154+
# _read_jsonl_split when the target is unreachable BEFORE the
1155+
# per-split tolerance loop kicks in.
11551156
if output_format == "json":
11561157
print(json.dumps({"success": False, "error": str(exc)}))
11571158
else:

forgelm/compliance.py

Lines changed: 86 additions & 67 deletions
Original file line numberDiff line numberDiff line change
@@ -84,6 +84,84 @@ def log_event(self, event: str, **details) -> None:
8484
# ---------------------------------------------------------------------------
8585

8686

87+
def _build_text_length_stats(split_data: Any, split_name: str) -> Optional[Dict[str, Any]]:
88+
"""Compute min/max/mean/median/p95 of the ``text`` column, if present."""
89+
if not (hasattr(split_data, "column_names") and "text" in split_data.column_names):
90+
return None
91+
try:
92+
texts = split_data["text"]
93+
lengths = sorted(len(t) for t in texts if isinstance(t, str))
94+
except Exception as exc:
95+
logger.debug("Could not compute text stats for %s: %s", split_name, exc)
96+
return None
97+
if not lengths:
98+
return None
99+
return {
100+
"min": lengths[0],
101+
"max": lengths[-1],
102+
"mean": round(sum(lengths) / len(lengths), 1),
103+
"median": lengths[len(lengths) // 2],
104+
"p95": lengths[int(len(lengths) * 0.95)],
105+
}
106+
107+
108+
def _build_split_info(split_name: str, split_data: Any) -> Dict[str, Any]:
109+
"""Per-split sample count + column schema + length distribution."""
110+
info: Dict[str, Any] = {"sample_count": len(split_data)}
111+
if hasattr(split_data, "column_names"):
112+
info["columns"] = split_data.column_names
113+
text_length = _build_text_length_stats(split_data, split_name)
114+
if text_length:
115+
info["text_length"] = text_length
116+
return info
117+
118+
119+
def _governance_section(config: Any) -> Optional[Dict[str, Any]]:
120+
"""Return the operator-supplied Article 10 metadata block, if any."""
121+
gov_cfg = getattr(config.data, "governance", None)
122+
if not gov_cfg:
123+
return None
124+
return {
125+
"collection_method": gov_cfg.collection_method,
126+
"annotation_process": gov_cfg.annotation_process,
127+
"known_biases": gov_cfg.known_biases,
128+
"personal_data_included": gov_cfg.personal_data_included,
129+
"dpia_completed": gov_cfg.dpia_completed,
130+
}
131+
132+
133+
def _maybe_inline_audit_report(config: Any) -> Optional[Dict[str, Any]]:
134+
"""Read ``data_audit_report.json`` from ``training.output_dir`` if it's there.
135+
136+
Loud-but-non-fatal hint when the file is missing: the audit CLI
137+
defaults to ``./audit/`` whereas the trainer's output_dir is
138+
typically ``./checkpoints/`` — without explicit alignment the
139+
inlining silently no-ops and the governance bundle ships without
140+
the Article 10 data-quality section.
141+
"""
142+
output_dir = getattr(getattr(config, "training", None), "output_dir", None)
143+
if not output_dir:
144+
return None
145+
audit_path = os.path.join(output_dir, "data_audit_report.json")
146+
if not os.path.isfile(audit_path):
147+
logger.info(
148+
"No data_audit_report.json at %s — governance report will lack the "
149+
"Article 10 data-quality section. Run "
150+
"`forgelm --data-audit <dataset> --output %s` before training to populate it.",
151+
audit_path,
152+
output_dir,
153+
)
154+
return None
155+
try:
156+
with open(audit_path, "r", encoding="utf-8") as fh:
157+
return json.load(fh)
158+
except (json.JSONDecodeError, OSError, UnicodeDecodeError) as exc:
159+
# Audit JSON is best-effort enrichment — corrupt UTF-8 or a
160+
# malformed file must not abort governance report generation.
161+
logger.warning("Could not inline data_audit_report.json (%s): %s", audit_path, exc)
162+
return None
163+
164+
87165
def generate_data_governance_report(config: Any, dataset: Dict[str, Any]) -> Dict[str, Any]:
88166
"""Generate data quality and governance report per EU AI Act Article 10.
89167
@@ -92,78 +170,19 @@ def generate_data_governance_report(config: Any, dataset: Dict[str, Any]) -> Dic
92170
its findings are inlined under the ``data_audit`` key so the governance
93171
artifact is a single self-contained document rather than a pointer.
94172
"""
95-
report = {
173+
report: Dict[str, Any] = {
96174
"generated_at": datetime.now(timezone.utc).isoformat(),
97175
"primary_dataset": config.data.dataset_name_or_path,
98-
"splits": {},
176+
"splits": {name: _build_split_info(name, data) for name, data in dataset.items()},
99177
}
100178

101-
# Per-split statistics
102-
for split_name, split_data in dataset.items():
103-
split_info = {"sample_count": len(split_data)}
104-
105-
# Column schema
106-
if hasattr(split_data, "column_names"):
107-
split_info["columns"] = split_data.column_names
108-
109-
# Text length statistics (if "text" column exists)
110-
if hasattr(split_data, "column_names") and "text" in split_data.column_names:
111-
try:
112-
texts = split_data["text"]
113-
lengths = [len(t) for t in texts if isinstance(t, str)]
114-
if lengths:
115-
lengths.sort()
116-
split_info["text_length"] = {
117-
"min": lengths[0],
118-
"max": lengths[-1],
119-
"mean": round(sum(lengths) / len(lengths), 1),
120-
"median": lengths[len(lengths) // 2],
121-
"p95": lengths[int(len(lengths) * 0.95)],
122-
}
123-
except Exception as e:
124-
logger.debug("Could not compute text stats for %s: %s", split_name, e)
125-
126-
report["splits"][split_name] = split_info
127-
128-
# Governance metadata from config
129-
gov_cfg = getattr(config.data, "governance", None)
130-
if gov_cfg:
131-
report["governance"] = {
132-
"collection_method": gov_cfg.collection_method,
133-
"annotation_process": gov_cfg.annotation_process,
134-
"known_biases": gov_cfg.known_biases,
135-
"personal_data_included": gov_cfg.personal_data_included,
136-
"dpia_completed": gov_cfg.dpia_completed,
137-
}
179+
governance = _governance_section(config)
180+
if governance:
181+
report["governance"] = governance
138182

139-
# Phase 11 Article 10 enrichment: if a `data_audit_report.json` exists in
140-
# the trainer's output_dir, inline it. Operators usually run the audit
141-
# before training; co-locating the result keeps the governance bundle
142-
# self-contained rather than a pointer to a separate file.
143-
output_dir = getattr(getattr(config, "training", None), "output_dir", None)
144-
if output_dir:
145-
audit_path = os.path.join(output_dir, "data_audit_report.json")
146-
if os.path.isfile(audit_path):
147-
try:
148-
with open(audit_path, "r", encoding="utf-8") as fh:
149-
report["data_audit"] = json.load(fh)
150-
except (json.JSONDecodeError, OSError, UnicodeDecodeError) as exc:
151-
# Audit JSON is best-effort enrichment — corrupt UTF-8 or a
152-
# malformed file must not abort governance report generation.
153-
logger.warning("Could not inline data_audit_report.json (%s): %s", audit_path, exc)
154-
else:
155-
# Loud-but-non-fatal hint: the audit CLI defaults to `./audit/`
156-
# whereas the trainer's output_dir is typically `./checkpoints/`
157-
# — without explicit alignment the inlining silently no-ops and
158-
# the governance bundle ships without the Article 10 data
159-
# quality section. Tell the operator how to fix it.
160-
logger.info(
161-
"No data_audit_report.json at %s — governance report will lack the "
162-
"Article 10 data-quality section. Run "
163-
"`forgelm --data-audit <dataset> --output %s` before training to populate it.",
164-
audit_path,
165-
output_dir,
166-
)
183+
audit = _maybe_inline_audit_report(config)
184+
if audit is not None:
185+
report["data_audit"] = audit
167186

168187
return report
169188

0 commit comments

Comments
 (0)