Refactor entity matching name cleaner to be more efficient #3953

katie-lamb · 2024-11-08T23:05:15Z

Overview

As part of the SEC to EIA record linkage development, I had to make some changes to the PUDL company name cleaning module to make it more efficient and useful. The code for this module was originally pulled OS Climate's repo, but it was no longer maintained there. I didn't make significant changes when I pulled out that module, and thus it had some quirks and inefficiencies.

What problem does this address?

The name cleaner was very slow. It's still pretty slow on big datasets, but is about 3x faster than previously

What did you change?

Instead of using apply to apply the regex replacement rules, I used pd.Series.replace so that this replacement is vectorized.
Removed some coupling in the cleaning rules and restructured the CompanyNameCleaner class
Made some updates to the regex rules to be more effective

Documentation

Make sure to update relevant aspects of the documentation.

Tasks

Give feedback

Update the release notes: reference the PR and related issues.
Update relevant Data Source jinja templates (see docs/data_sources/templates).
Update relevant table or source description metadata (see src/metadata).
Review and update any other aspects of the documentation that might be affected by this PR.
Options

Testing

How did you make sure this worked? How can a reviewer verify this?

To-do list

Give feedback

If updating analyses or data processing functions: make sure to update or write data validation tests (e.g. test_minmax_rows()).
Run make pytest-coverage locally to ensure that the merge queue will accept your PR.
Review the PR yourself and call out any questions or issues you have.
For minor ETL changes or data additions, once make pytest-coverage passes, make sure you have a fresh full PUDL DB downloaded locally, materialize new/changed assets and all their downstream assets and run relevant data validation tests using pytest and --live-dbs.
For bigger ETL or data changes run the full ETL locally and then run the data validations using make pytest-validate.
Alternatively, run the build-deploy-pudl GitHub Action manually.
Options

katie-lamb added 2 commits November 8, 2024 14:59

refactor name cleaner

2e9fd96

fix up

e88d6d8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor entity matching name cleaner to be more efficient #3953

Refactor entity matching name cleaner to be more efficient #3953

katie-lamb commented Nov 8, 2024

Tasks

To-do list

Refactor entity matching name cleaner to be more efficient #3953

Are you sure you want to change the base?

Refactor entity matching name cleaner to be more efficient #3953

Conversation

katie-lamb commented Nov 8, 2024

Overview

What problem does this address?

What did you change?

Documentation

Tasks

Testing

How did you make sure this worked? How can a reviewer verify this?

To-do list