Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sorry message on sab.epa.gov should have 404 or 500 effective_status #1193

Open
Mr0grog opened this issue Feb 14, 2025 · 1 comment
Open

Comments

@Mr0grog
Copy link
Member

Mr0grog commented Feb 14, 2025

A whole bunch of pages we track at sab.epa.gov, e.g. https://sab.epa.gov/ords/sab/f?p=100:18:16490947993:::18:P18_ID:2601 now display a “Sorry this page isn’t available” message with a 200 status code. These should have an effective status that is an error (probably 404).

Version bb2b117f-cefb-4a61-92a1-a69fde42bda2 is a good example of this (right side of this comparison):

Comparison including the example bad version

Raw data:

{
    "uuid": "bb2b117f-cefb-4a61-92a1-a69fde42bda2",
    "page_uuid": "d1620a7d-557c-4517-89f7-53577d5d4e34",
    "capture_time": "2025-02-11T06:34:59.381Z",
    "body_url": "https://edgi-wm-archive.s3.amazonaws.com/38c575894d7cb32b1ddf42e85b2917a5b4e2c9dd06f1a86a85d79b31c76d998f",
    "body_hash": "38c575894d7cb32b1ddf42e85b2917a5b4e2c9dd06f1a86a85d79b31c76d998f",
    "source_type": "edgi_crawl_v0",
    "source_metadata": {
        "crawl": "edgi-active-urls--20250211033454--epa",
        "warc_name": "rec-259783d037a7-20250211033500606-0.warc.gz",
        "user_agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.1.75.175 Safari/537.36",
        "warc_records": [
            {
                "id": "<urn:uuid:3a6ec853-8b2d-4558-8748-b8a062252dcf>",
                "type": "response",
                "length": 2843,
                "offset": 845644033
            }
        ],
        "warc_record_meta": {
            "cert": {
                "ctc": "1",
                "issuer": "DigiCert Global G2 TLS RSA SHA256 2020 CA1"
            },
            "ipType": "Public"
        }
    },
    "created_at": "2025-02-11T16:34:40.869Z",
    "updated_at": "2025-02-13T08:29:06.042Z",
    "title": "Sorry, this page isn't available",
    "url": "https://sab.epa.gov/ords/sab/f?p=100:18:16490947993:::18:P18_ID:2601",
    "different": false,
    "status": 200,
    "content_length": 4983,
    "media_type": "text/html",
    "headers": {
        "Date": "Tue, 11 Feb 2025 06:34:59 GMT",
        "Server": "Apache",
        "Connection": "Keep-Alive",
        "Keep-Alive": "timeout=5, max=100",
        "X-Hostname": "wamwebprd1.epa.gov",
        "Content-Type": "text/html;charset=UTF-8",
        "Apex-Debug-Id": "46511301,level=ERROR",
        "X-Content-Type-Options": "nosniff, nosniff",
        "x-orig-Transfer-Encoding": "chunked",
        "Strict-Transport-Security": "max-age=63072000; includeSubdomains;, max-age=63072000; includeSubdomains; preload, max-age=63072000; includeSubDomains; preload"
    },
    "network_error": null
}

See also in the API at: https://api.monitoring.envirodatagov.org/api/v0/versions/bb2b117f-cefb-4a61-92a1-a69fde42bda2?different=false

The Apex-Debug-Id header field might be good to check here. Not sure it it’s always present for Oracle APEX apps or only on errors (looks like only errors, but it’s possible the working pages I’m looking at are from a different app/server behind the scenes). In any case, the level=ERROR string is a good giveaway! Also the title, since we already have a lot of heuristics around that.

@Mr0grog
Copy link
Member Author

Mr0grog commented Feb 14, 2025

This will also need to be ported over to the task sheets analysis: https://github.com/edgi-govdata-archiving/web-monitoring-task-sheets/blob/b9ee485462113a8dabb320223b1f64bf7f879c07/analyst_sheets/analyze.py#L316-L330

(And obviously we ultimately need some way to unify these algorithms and heuristics, but that’s not as important as this.)

@Mr0grog Mr0grog moved this to Inbox in Web Monitoring Feb 17, 2025
@Mr0grog Mr0grog moved this from Inbox to Prioritized in Web Monitoring Feb 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Prioritized
Development

No branches or pull requests

1 participant