Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up botched WARC imports from 2025-02-01 to 2025-02-04 #1201

Open
Mr0grog opened this issue Feb 19, 2025 · 0 comments
Open

Clean up botched WARC imports from 2025-02-01 to 2025-02-04 #1201

Mr0grog opened this issue Feb 19, 2025 · 0 comments

Comments

@Mr0grog
Copy link
Member

Mr0grog commented Feb 19, 2025

An early version of the WARC import script did a poor job handling revisits and wound up recording some redirects as the final response instead of following through to the actual final response and recording that. This has since been fixed in the import script, but it looks like we have some versions that need remediation from crawls on 2025-02-01 and 2025-02-04.

A good example is the left-hand side of this comparison: https://monitoring.envirodatagov.org/page/8b6cbc6d-ee2b-45af-8f8c-c005d997e800/f731df95-f9f5-44ce-9b24-a43a53166634..daf1d7c7-7739-491c-b346-b67de202af20

See the API for this version: https://api.monitoring.envirodatagov.org/api/v0/versions/f731df95-f9f5-44ce-9b24-a43a53166634

{
    "links": {
        "page": "https://api.monitoring.envirodatagov.org/api/v0/pages/8b6cbc6d-ee2b-45af-8f8c-c005d997e800",
        "previous": "https://api.monitoring.envirodatagov.org/api/v0/versions/1592d114-2588-41cf-a90d-91c497990409",
        "next": "https://api.monitoring.envirodatagov.org/api/v0/versions/e6703420-0776-4eb0-8efa-461a39f2a545"
    },
    "data": {
        "uuid": "f731df95-f9f5-44ce-9b24-a43a53166634",
        "page_uuid": "8b6cbc6d-ee2b-45af-8f8c-c005d997e800",
        "capture_time": "2025-02-02T07:57:47.456Z",
        "body_url": "https://edgi-wm-archive.s3.amazonaws.com/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
        "body_hash": "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855",
        "source_type": "edgi_crawl_v0",
        "source_metadata": {
            "warc_name": "rec-8dfe9ae16db6-20250202010120189-0.warc.gz",
            "user_agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.1.73.104 Safari/537.36",
            "warc_page_id": "bbe297af-cd28-4dfd-b4ac-f6d8cdb01a09",
            "warc_record_ids": [
                "<urn:uuid:43a510da-54ef-43d5-9740-541250a4199d>",
                "<urn:uuid:47bab6e8-c1da-46de-910f-87df746f8885>"
            ],
            "warc_record_meta": {
                "cert": {
                    "ctc": "1",
                    "issuer": "GeoTrust RSA CA 2018"
                },
                "ipType": "Public"
            }
        },
        "created_at": "2025-02-04T08:58:11.544Z",
        "updated_at": "2025-02-04T08:58:11.544Z",
        "title": "",
        "url": "https://www.transportation.gov/careers/dot-deia-strategic-plan",
        "different": true,
        "status": 301,  // <------------------------------------------------ BAD!
        "content_length": 0,
        "media_type": "text/html",
        "headers": {
            "date": "Sun, 02 Feb 2025 07:57:47 GMT",
            "etag": "\"1738483067\"",
            "x-age": "0",
            "server": "nginx",
            "expires": "Sun, 02 Feb 2025 08:57:47 GMT",
            "location": "https://www.transportation.gov/",
            "x-generator": "Drupal 10 (https://www.drupal.org)",
            "content-type": "text/html; charset=utf-8",
            "x-request-id": "v-646efb1e-e13b-11ef-8a0d-7b5e4a4e7615",
            "cache-control": "public, max-age=3600",
            "last-modified": "Sun, 02 Feb 2025 07:57:47 GMT",
            "x-redirect-id": "155591",
            "content-length": "370",
            "x-frame-options": "SAMEORIGIN",
            "content-language": "en",
            "x-ah-environment": "prod",
            "x-content-type-options": "nosniff",
            "strict-transport-security": "max-age=31536000 ; includeSubDomains ; preload"
        },
        "network_error": null
    }
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Prioritized
Development

No branches or pull requests

1 participant