Warcs have useful content too #858

Mr0grog · 2025-02-04T03:51:11Z

We now have some of our own crawls, and need to import data from the resulting WARCs. This scripts runs through a WARC file and uploads the bodies to our hash storage and the creates an import job with the resulting "versions".

This is a little biased towards WARCs output by Browsertrix.crawler, because that's what we're using right now. It will probably need some adapting to work with other WARCs, if and when we encounter them (hopefully from IA soon!).

While I was at it, I updated the import code to use the “new” import format (from several years ago!) that we never updated to before. It makes things much clearer.

Many years ago, we made a bunch of changes to the versions table in the DB and the corresponding format of import records in order to make them clearer and more definitely resemble a record of an HTTP response (since that is effectively what they wound up being). We never updated the corresponding import code here, though! That means we've had backwards-compatibility code sitting around in the DB import script for years. This finally updates things, and after shipping, we can also clean up the DB. This is extracted from #858, which I thought would land sooner.

We now have some of our own crawls, and need to import data from the resulting WARCs. This scripts runs through a WARC file and uploads the bodies to our hash storage and the creates an import job with the resulting "versions". This is a little biased towards WARCs output by Browsertrix.crawler, because that's what we're using right now. It will probably need some adapting to work with other WARCs, if and when we encounter them (hopefully from IA soon!).

I messed this up when I got record offsets. Turns out you have to read the body *first*, otherwise the stream is consumed and the body is lost (you can probably seek back on the file, but I have no idea what havoc that would wreak on the ArchiveIterator instance).

Index all the entries we might care about, then go through and put them together in chains and read their bodies.

Mr0grog force-pushed the warcs-have-useful-content-too branch from 93f5b38 to f4a9960 Compare February 9, 2025 21:45

Mr0grog force-pushed the warcs-have-useful-content-too branch from 5dc3c83 to dc9c039 Compare February 17, 2025 21:24

Mr0grog mentioned this pull request Feb 17, 2025

Update import options for versions #861

Merged

Mr0grog added 25 commits February 18, 2025 17:15

A little upload completion cleanup

60b5d01

Live upload should still show progress

1c74009

Fix dry run in S3 storage

d4d292d

Work around tqdm/partition_all incompatibility

845cf1d

Fix typo

002ecc9

Add a little debug info

0208693

Simplify and make logic more literal

ee6e738

Add more metadata about crawl names, record offsets

2b25922

Fix mistake calculating total

b835bbc

Add option to update existing DB records

970ad57

Add some TODO notes for the future

389b01d

Add graceful exit support

bc68f47

Ugly fix for out-of-order records in redirects

5f3a3df

More checks

fec405d

D'oh, forgot to seek

6dcc557

More fixes

b43b0ec

Switch to simpler but slower two-pass approach

9111cd5

Index all the entries we might care about, then go through and put them together in chains and read their bodies.

Fix iteration for non-captured seeds

dc7d6d9

Remove debug logging

20e0690

Delint

23b762a

Ensure absolute URLs for redirect lookups

cefb7e2

Fall back from http to https, fix potential loop issue

cc98612

Minor cleanup

8e0dceb

Mr0grog added 6 commits February 18, 2025 17:15

Slim metadata down a bit

e1d5b58

Support multiple WARC files

1b098cb

Ensure header names are lower-cased

5268696

Add FIXME note about how we look for WARC records

03a4d20

Fix lint error

dc59a3d

Ensure Brotli support for WARC importing

22f2f59

Mr0grog force-pushed the warcs-have-useful-content-too branch from d6633ec to 22f2f59 Compare February 19, 2025 01:15

Mr0grog mentioned this pull request Feb 19, 2025

2025 Q1 Roadmap edgi-govdata-archiving/web-monitoring#174

Open

24 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Warcs have useful content too #858

Warcs have useful content too #858

Mr0grog commented Feb 4, 2025

Warcs have useful content too #858

Are you sure you want to change the base?

Warcs have useful content too #858

Conversation

Mr0grog commented Feb 4, 2025