-
-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Warcs have useful content too #858
Open
Mr0grog
wants to merge
31
commits into
main
Choose a base branch
from
warcs-have-useful-content-too
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
93f5b38
to
f4a9960
Compare
5dc3c83
to
dc9c039
Compare
Mr0grog
added a commit
that referenced
this pull request
Feb 17, 2025
Many years ago, we made a bunch of changes to the versions table in the DB and the corresponding format of import records in order to make them clearer and more definitely resemble a record of an HTTP response (since that is effectively what they wound up being). We never updated the corresponding import code here, though! That means we've had backwards-compatibility code sitting around in the DB import script for years. This finally updates things, and after shipping, we can also clean up the DB. This is extracted from #858, which I thought would land sooner.
Mr0grog
added a commit
that referenced
this pull request
Feb 18, 2025
Many years ago, we made a bunch of changes to the versions table in the DB and the corresponding format of import records in order to make them clearer and more definitely resemble a record of an HTTP response (since that is effectively what they wound up being). We never updated the corresponding import code here, though! That means we've had backwards-compatibility code sitting around in the DB import script for years. This finally updates things, and after shipping, we can also clean up the DB. This is extracted from #858, which I thought would land sooner.
We now have some of our own crawls, and need to import data from the resulting WARCs. This scripts runs through a WARC file and uploads the bodies to our hash storage and the creates an import job with the resulting "versions". This is a little biased towards WARCs output by Browsertrix.crawler, because that's what we're using right now. It will probably need some adapting to work with other WARCs, if and when we encounter them (hopefully from IA soon!).
I messed this up when I got record offsets. Turns out you have to read the body *first*, otherwise the stream is consumed and the body is lost (you can probably seek back on the file, but I have no idea what havoc that would wreak on the ArchiveIterator instance).
Index all the entries we might care about, then go through and put them together in chains and read their bodies.
d6633ec
to
22f2f59
Compare
24 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We now have some of our own crawls, and need to import data from the resulting WARCs. This scripts runs through a WARC file and uploads the bodies to our hash storage and the creates an import job with the resulting "versions".
This is a little biased towards WARCs output by Browsertrix.crawler, because that's what we're using right now. It will probably need some adapting to work with other WARCs, if and when we encounter them (hopefully from IA soon!).
While I was at it, I updated the import code to use the “new” import format (from several years ago!) that we never updated to before. It makes things much clearer.