Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warcs have useful content too #858

Open
wants to merge 31 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
3afcad7
Add WARC import script
Mr0grog Feb 4, 2025
60b5d01
A little upload completion cleanup
Mr0grog Feb 4, 2025
1c74009
Live upload should still show progress
Mr0grog Feb 4, 2025
d4d292d
Fix dry run in S3 storage
Mr0grog Feb 4, 2025
845cf1d
Work around tqdm/partition_all incompatibility
Mr0grog Feb 4, 2025
002ecc9
Fix typo
Mr0grog Feb 4, 2025
0208693
Add a little debug info
Mr0grog Feb 4, 2025
ee6e738
Simplify and make logic more literal
Mr0grog Feb 5, 2025
2b25922
Add more metadata about crawl names, record offsets
Mr0grog Feb 6, 2025
b835bbc
Fix mistake calculating `total`
Mr0grog Feb 6, 2025
f9e03ec
Fix response body reading
Mr0grog Feb 6, 2025
970ad57
Add option to update existing DB records
Mr0grog Feb 6, 2025
389b01d
Add some TODO notes for the future
Mr0grog Feb 6, 2025
bc68f47
Add graceful exit support
Mr0grog Feb 7, 2025
5f3a3df
Ugly fix for out-of-order records in redirects
Mr0grog Feb 9, 2025
fec405d
More checks
Mr0grog Feb 9, 2025
6dcc557
D'oh, forgot to seek
Mr0grog Feb 10, 2025
b43b0ec
More fixes
Mr0grog Feb 10, 2025
9111cd5
Switch to simpler but slower two-pass approach
Mr0grog Feb 10, 2025
dc7d6d9
Fix iteration for non-captured seeds
Mr0grog Feb 10, 2025
20e0690
Remove debug logging
Mr0grog Feb 10, 2025
23b762a
Delint
Mr0grog Feb 10, 2025
cefb7e2
Ensure absolute URLs for redirect lookups
Mr0grog Feb 10, 2025
cc98612
Fall back from http to https, fix potential loop issue
Mr0grog Feb 10, 2025
8e0dceb
Minor cleanup
Mr0grog Feb 11, 2025
e1d5b58
Slim metadata down a bit
Mr0grog Feb 11, 2025
1b098cb
Support multiple WARC files
Mr0grog Feb 11, 2025
5268696
Ensure header names are lower-cased
Mr0grog Feb 11, 2025
03a4d20
Add FIXME note about how we look for WARC records
Mr0grog Feb 14, 2025
dc59a3d
Fix lint error
Mr0grog Feb 17, 2025
22f2f59
Ensure Brotli support for WARC importing
Mr0grog Feb 19, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,5 @@ requests ~=2.32.3
urllib3 ~=2.3.0
toolz ~=1.0.0
tqdm ~=4.67.1
warcio[all] ~=1.7.5
wayback ~=0.4.5
6 changes: 6 additions & 0 deletions scripts/warc_import
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
#!/usr/bin/env python
from web_monitoring.cli.warc_import import main


if __name__ == '__main__':
main()
3 changes: 1 addition & 2 deletions web_monitoring/cli/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,6 @@
import re
import requests
import sentry_sdk
import signal
import threading
import time
from tqdm import tqdm
Expand Down Expand Up @@ -667,7 +666,7 @@ def import_ia_urls(urls, *, from_date=None, to_date=None,
worker_count = worker_count if worker_count > 0 else PARALLEL_REQUESTS
unplaybackable = load_unplaybackable_mementos(unplaybackable_path)

with utils.QuitSignal((signal.SIGINT, signal.SIGTERM)) as stop_event:
with utils.QuitSignal() as stop_event:
cdx_records = utils.FiniteQueue()
cdx_thread = threading.Thread(target=lambda: utils.iterate_into_queue(
cdx_records,
Expand Down
Loading