HUD Web Archive Viewer

This project provides a simple web application to view WACZ (Web ARChiveZip) files using the replayweb.page library. Its primary purpose is to host static web archives that can be used as stable testing environments for browser-based AI agents, particularly those developed with the HUD SDK.

Features

Displays WACZ files directly in the browser.
Uses clean URLs for accessing different archives (e.g., /my-archive loads archives/my-archive.wacz).
Supports a ?page=<url-encoded-page-in-archive> query parameter to open a specific page within an archive.
Supports a ?debug=true query parameter to show the full replayweb.page UI for debugging.
Archives can have a default startPage defined in archives/archive_list.json.
Includes a GitHub Actions workflow for easy deployment to GitHub Pages, making archives web-accessible.

Using Archives with the HUD SDK

Web archives provide consistent, offline-first environments for testing web-based agents. For example, if you have an archive named my-test-site deployed here, it will be accessible at https://hud-evals.github.io/page-archives/my-test-site. Here's how you might use it in a hud.Task:

from hud.task import Task

login_task = Task(
    prompt="Log into the website using username 'testuser' and password 'password123'.",
    gym="hud-browser", # Or your relevant browser-based gym
    setup=(
        "goto", "https://hud-evals.github.io/page-archives/my-test-site"
    ),
    evaluate=(
        "page_contains", "Welcome, testuser!"
    )
)

# You can then run this task with your agent:
# from hud import run_job, YourAgent
# await run_job(YourAgent, [login_task], "my-archived-site-login-test")

This allows you to create reliable test scenarios for your agents against specific, unchanging versions of web pages.

How to Create Web Archives (WACZ files)

To create the .wacz files that this viewer uses, you can use the ArchiveWeb.page browser extension or desktop application. It allows you to interactively capture websites as you browse.

Full Guide: For detailed instructions on creating archives, please refer to the official ArchiveWeb.page User Guide.
Basic Steps with ArchiveWeb.page extension:
1. Install the ArchiveWeb.page extension (Chromium-based browsers).
2. Open the extension and create a new collection.
3. Start an archiving session.
4. Browse the web pages you want to capture.
5. Stop the session.
6. Download your collection. It will typically download as a .wacz file.

For automated, large-scale crawling, consider Browsertrix.

Adding Your Archives

Place WACZ Files:
- Put your .wacz files into the archives/ directory.
- For example, if your archive is named my-cool-site.wacz, place it in archives/my-cool-site.wacz.
Update archives/archive_list.json:
- This file provides a list of your archives for the homepage and can define a default starting page for each.
- Edit archives/archive_list.json and add an entry for each of your archives. The name field must match the WACZ filename without the .wacz extension.
- Example archives/archive_list.json entry:
```
{
    "archives": [
        {
            "name": "my-cool-site",
            "displayName": "My Cool Site Archive",
            "startPage": "https://my-cool-site.com/index.html" // Optional: URL of start page within this WACZ
        },
        {
            "name": "another-one",
            "displayName": "Another Great Archive"
            // No startPage, will use archive's default
        }
        // ... other archives
    ]
}
```
- The displayName is what appears in the list on the homepage.
- The startPage is optional. If provided, accessing /my-cool-site will attempt to open this specific page from the archive. If omitted (or if a ?page= URL parameter is used), the archive's default page or the ?page= parameter will be used.

Local Development Setup

Prerequisites:
- Node.js and npm installed.

Clone the repository:

git clone https://github.com/hud-evals/page-archives.git
cd page-archives

Install dependencies:
```
npm install
```
Run the development server:
```
npm run dev
```
This will start an Express.js server (usually at http://localhost:3000) that handles the clean URLs.

Viewing Archives Locally

Homepage (List of Archives): http://localhost:3000/
Specific Archive (using its default or startPage): http://localhost:3000/my-cool-site
Specific Page within an Archive: http://localhost:3000/my-cool-site?page=https%3A%2F%2Fmy-cool-site.com%2Fspecific-article.html (ensure the page URL is URL-encoded).
Debug Mode (shows ReplayWeb.page UI): http://localhost:3000/my-cool-site?debug=true

Enjoy creating and viewing your web archives for robust agent testing!

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github/workflows		.github/workflows
archives		archives
replay		replay
.gitignore		.gitignore
404.html		404.html
README.md		README.md
favicon.ico		favicon.ico
index.html		index.html
package-lock.json		package-lock.json
package.json		package.json
server.js		server.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HUD Web Archive Viewer

Features

Using Archives with the HUD SDK

How to Create Web Archives (WACZ files)

Adding Your Archives

Local Development Setup

Viewing Archives Locally

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

hud-evals/page-archives

Folders and files

Latest commit

History

Repository files navigation

HUD Web Archive Viewer

Features

Using Archives with the HUD SDK

How to Create Web Archives (WACZ files)

Adding Your Archives

Local Development Setup

Viewing Archives Locally

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages