Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Header field names should all be lower-cased in Version model #1194

Open
Mr0grog opened this issue Feb 14, 2025 · 0 comments
Open

Header field names should all be lower-cased in Version model #1194

Mr0grog opened this issue Feb 14, 2025 · 0 comments

Comments

@Mr0grog
Copy link
Member

Mr0grog commented Feb 14, 2025

Header field names are not case-sensitive in HTTP, so to make analysis of our data easier, we should just ensure we store them as lower-case in the Version model. Today, we’ve got some that are mixed, which is not great.

We should probably do this as a normalize or before_save hook. Then we’ll need to go through and update all the existing versions imported in the past few weeks (the older IA import script lower-cases everything as a consequence of using the Wayback package’s memento.headers dict, but our newer WARC import script did not do this until yesterday).

Mr0grog added a commit to edgi-govdata-archiving/web-monitoring-task-sheets that referenced this issue Feb 14, 2025
We can't currently depend on them already being lower-case: edgi-govdata-archiving/web-monitoring-db#1194
But hopefully in the future!
@Mr0grog Mr0grog moved this to Inbox in Web Monitoring Feb 17, 2025
@Mr0grog Mr0grog moved this from Inbox to Prioritized in Web Monitoring Feb 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Prioritized
Development

No branches or pull requests

1 participant