Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warcs have useful content too #858

Open
wants to merge 31 commits into
base: main
Choose a base branch
from
Open

Conversation

Mr0grog
Copy link
Member

@Mr0grog Mr0grog commented Feb 4, 2025

We now have some of our own crawls, and need to import data from the resulting WARCs. This scripts runs through a WARC file and uploads the bodies to our hash storage and the creates an import job with the resulting "versions".

This is a little biased towards WARCs output by Browsertrix.crawler, because that's what we're using right now. It will probably need some adapting to work with other WARCs, if and when we encounter them (hopefully from IA soon!).

While I was at it, I updated the import code to use the “new” import format (from several years ago!) that we never updated to before. It makes things much clearer.

@Mr0grog Mr0grog force-pushed the warcs-have-useful-content-too branch from 93f5b38 to f4a9960 Compare February 9, 2025 21:45
@Mr0grog Mr0grog force-pushed the warcs-have-useful-content-too branch from 5dc3c83 to dc9c039 Compare February 17, 2025 21:24
Mr0grog added a commit that referenced this pull request Feb 17, 2025
Many years ago, we made a bunch of changes to the versions table in the DB and the corresponding format of import records in order to make them clearer and more definitely resemble a record of an HTTP response (since that is effectively what they wound up being). We never updated the corresponding import code here, though! That means we've had backwards-compatibility code sitting around in the DB import script for years. This finally updates things, and after shipping, we can also clean up the DB.

This is extracted from #858, which I thought would land sooner.
Mr0grog added a commit that referenced this pull request Feb 18, 2025
Many years ago, we made a bunch of changes to the versions table in the DB and the corresponding format of import records in order to make them clearer and more definitely resemble a record of an HTTP response (since that is effectively what they wound up being). We never updated the corresponding import code here, though! That means we've had backwards-compatibility code sitting around in the DB import script for years. This finally updates things, and after shipping, we can also clean up the DB.

This is extracted from #858, which I thought would land sooner.
We now have some of our own crawls, and need to import data from the resulting WARCs. This scripts runs through a WARC file and uploads the bodies to our hash storage and the creates an import job with the resulting "versions".

This is a little biased towards WARCs output by Browsertrix.crawler, because that's what we're using right now. It will probably need some adapting to work with other WARCs, if and when we encounter them (hopefully from IA soon!).
I messed this up when I got record offsets. Turns out you have to read the body *first*, otherwise the stream is consumed and the body is lost (you can probably seek back on the file, but I have no idea what havoc that would wreak on the ArchiveIterator instance).
Index all the entries we might care about, then go through and put them together in chains and read their bodies.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

1 participant