Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor entity matching name cleaner to be more efficient #3953

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

katie-lamb
Copy link
Member

Overview

As part of the SEC to EIA record linkage development, I had to make some changes to the PUDL company name cleaning module to make it more efficient and useful. The code for this module was originally pulled OS Climate's repo, but it was no longer maintained there. I didn't make significant changes when I pulled out that module, and thus it had some quirks and inefficiencies.

What problem does this address?

  • The name cleaner was very slow. It's still pretty slow on big datasets, but is about 3x faster than previously

What did you change?

  • Instead of using apply to apply the regex replacement rules, I used pd.Series.replace so that this replacement is vectorized.
  • Removed some coupling in the cleaning rules and restructured the CompanyNameCleaner class
  • Made some updates to the regex rules to be more effective

Documentation

Make sure to update relevant aspects of the documentation.

Tasks

Testing

How did you make sure this worked? How can a reviewer verify this?

To-do list

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: New
Development

Successfully merging this pull request may close these issues.

1 participant