feat: move data normalization from Python consumers to Rust extractor (#290) by SimplicityGuy · Pull Request #294 · SimplicityGuy/discogsography

SimplicityGuy · 2026-04-12T00:23:04Z

Summary

Add normalize.rs module to the Rust extractor that transforms XML-shaped JSON (@id, #text, nested containers) into flat, consumer-ready JSON — normalization now runs once at extraction time instead of redundantly in every consumer
Simplify common/data_normalizer.py from ~475 lines to ~65 lines — only _parse_year_int() and a thin normalize_record() wrapper remain
Remove extract_format_names() from graphinator, replaced with inline list comprehension on the already-flat formats array
Pipeline ordering: parse → filters → rules → normalize → hash → publish — rules still operate on XML-shaped data, hash reflects normalized output

Test plan

45 new Rust tests in normalize_tests.rs covering all entity types and edge cases
3,007 Python tests pass (including rewritten normalizer tests and updated consumer fixtures)
1,122 Rust tests pass
Ruff + mypy clean
Verify next extraction re-processes all records (expected: sha256 changes due to new JSON shape)

Closes #290

🤖 Generated with Claude Code

…290) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add strip_at_prefixes, unwrap_container, and ensure_list functions for transforming XML-style JSON conventions into flat format. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add normalize_item_list helper and normalize_artist function to flatten members, groups, and aliases from XML container format to flat arrays. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add normalize_label function to handle parentLabel strip_at_prefixes and sublabels container flattening. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@id

Add normalize_string_list helper and normalize_master function to handle @id stripping, artists container, genres, and styles flattening. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add normalize_release function handling artists, labels, master_id extraction, genres, styles, extraartists, and formats with @-prefix stripping. Includes full pipeline integration test. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… to Rust extractor - Gut data_normalizer.py to only retain year parsing and normalize_record() - Replace extract_format_names with inline list comprehension in graphinator - Rewrite normalizer tests for simplified module - Update all test fixtures to flat extractor output format Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…le path resolution Rules use dot-notation paths like "genres.genre" that match the XML structure. Moving normalize_record after evaluate_rules ensures rules operate on the pre-normalized shape while the content hash still reflects the normalized output consumers see. Also updates stale comment in tableinator. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

codecov · 2026-04-12T00:25:56Z

Codecov Report

❌ Patch coverage is 98.74214% with 2 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
extractor/src/normalize.rs	98.67%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

…ze.rs Covers: non-object inputs to normalizers, string/number items in normalize_item_list, bare string in unwrap_container, non-object format items. Raises line coverage from 93.4% to 97.7%. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-12T00:45:19Z

E2E Coverage (chromium)

Totals
Statements:	47.64% ( 1241 / 2605 )
Lines:	47.64% ( 1241 / 2605 )

github-actions · 2026-04-12T00:45:24Z

E2E Coverage (webkit)

Totals
Statements:	47.64% ( 1241 / 2605 )
Lines:	47.64% ( 1241 / 2605 )

github-actions · 2026-04-12T00:45:28Z

E2E Coverage (firefox)

Totals
Statements:	47.64% ( 1241 / 2605 )
Lines:	47.64% ( 1241 / 2605 )

github-actions · 2026-04-12T00:47:12Z

E2E Coverage (webkit - iPhone 15)

Totals
Statements:	47.64% ( 1241 / 2605 )
Lines:	47.64% ( 1241 / 2605 )

github-actions · 2026-04-12T00:47:54Z

E2E Coverage (webkit - iPad Pro 11)

Totals
Statements:	47.64% ( 1241 / 2605 )
Lines:	47.64% ( 1241 / 2605 )

SimplicityGuy and others added 11 commits April 11, 2026 16:11

docs: add design spec for moving normalizer logic to Rust extractor (#…

6471b56

…290) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs: add implementation plan for normalizer-to-extractor migration (#…

df4d1c1

…290) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat(extractor): add normalize.rs with generic helpers

dfffad6

Add strip_at_prefixes, unwrap_container, and ensure_list functions for transforming XML-style JSON conventions into flat format. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat(extractor): add artist normalization to normalize.rs

d09d45c

Add normalize_item_list helper and normalize_artist function to flatten members, groups, and aliases from XML container format to flat arrays. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat(extractor): add label normalization to normalize.rs

a81f466

Add normalize_label function to handle parentLabel strip_at_prefixes and sublabels container flattening. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat(extractor): add master normalization to normalize.rs

4f3a446

Add normalize_string_list helper and normalize_master function to handle @id stripping, artists container, genres, and styles flattening. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor(extractor): address code review — clarify doc, add edge case…

b913e2b

… tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat(extractor): wire normalize_record into validator pipeline

8b34f99

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

SimplicityGuy and others added 2 commits April 11, 2026 17:28

Merge branch 'main' into worktree-290-normalizer

838082a

SimplicityGuy merged commit accb054 into main Apr 12, 2026
57 checks passed

SimplicityGuy deleted the worktree-290-normalizer branch April 12, 2026 01:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: move data normalization from Python consumers to Rust extractor (#290)#294

feat: move data normalization from Python consumers to Rust extractor (#290)#294
SimplicityGuy merged 13 commits intomainfrom
worktree-290-normalizer

SimplicityGuy commented Apr 12, 2026

Uh oh!

codecov bot commented Apr 12, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

SimplicityGuy commented Apr 12, 2026

Summary

Test plan

Uh oh!

codecov bot commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Apr 12, 2026

E2E Coverage (chromium)

Uh oh!

github-actions bot commented Apr 12, 2026

E2E Coverage (webkit)

Uh oh!

github-actions bot commented Apr 12, 2026

E2E Coverage (firefox)

Uh oh!

github-actions bot commented Apr 12, 2026

E2E Coverage (webkit - iPhone 15)

Uh oh!

github-actions bot commented Apr 12, 2026

E2E Coverage (webkit - iPad Pro 11)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov bot commented Apr 12, 2026 •

edited

Loading