AI-enhanced site crawl analysis

✨ Description

The AI crawl analysis tool is a set of utilities that together, interprets data generated from site crawls to inform content migrations. The tool enhances traditional site crawl analysis by gathering additional context about a site’s pages during site crawls. It performs further analysis on the extracted data to provide relevant content insights. The goal is to efficiently and quickly gather site information and apply data-based decisions to site migrations.

🚀 Features

Crawl Analysis: Analyze and extract insights from crawled datasets.
Deduplication: Remove duplicate items from columns to ensure data integrity.
Header Cleaning: Standardize and clean crawl column headers in tabular data.
HTML Row Filtering: Filter out unwanted HTML rows from datasets.
AI Integration: Uses AI prompts during and after crawl analysis to enhance and validate crawl data.

📦 Requirements

Python 3.13+
uv (a modern Python package manager)
Install uv (if you don’t already have it)

    curl -LsSf https://astral.sh/uv/install.sh | sh
    # or
    pipx install uv
    # or
    pip install uv
    # or
    brew install uv

For more options, review the Documentation for installing uv

🔧 Getting Started

1. Set up development environment

Clone the repository:

    git clone https://github.com/civicactions/ai-migrations.git
    cd ai-migrations

Install dependencies:
- Install the required python version in a virtual env:
```
uv venv --python 3.13.0
```
- Install other dependencies directly from pyproject.toml:
```
uv pip install -r pyproject.toml
```

2. Set up site crawls to return the expected CSV for the analysis:

The assumption for using this tool is that there is a CSV file, generated from a site crawl to analyze. This site crawl file needs to have at least 2 columns, an 'address' column that lists the page urls and a column with AI-generated insights. The default name for the AI-generated column is "Gemini: JSON schema v5" but it can be changed in the CSV expansion script.

For an example of running a site crawl using AI to generate additional page insights review the Screaming Frog setup with Gemini documentation.

3. Run the crawl analysis:

There are 2 options for running the crawl analysis. It can be run as an app on the browser or with Python scripts in the command line.

Running the analysis on the browser:

The app provides a visual interface for uploading a CSV file generated from a site crawl. It runs the analysis and provides CSV file downloads of the analyzed and grouped urls.

Local development In the command line, run the following:

Initialize packages

 uv sync

Start the app:

 python -m streamlit run ai_crawl_analysis/streamlit_app.py
 - OR -
 uv run -m streamlit run ai_crawl_analysis/streamlit_app.py

This will open the app in http://localhost:8501/. Upload a CSV file to the upload field to start the analysis.

Cloud environment (This is TBD & will be updated when the tool gets deployed.) When this is deployed to the cloud, upload a CSV file to start the analysis.

Running the analysis in the command line:

The Python scripts provide a more granular method for executing the analysis. You can run all the steps or run individual steps for better control and debugging.

The processing scripts are structured as modules. You can

Run them using uv run or standard Python module syntax

uv run -m ai_crawl_analysis.main [path_to_crawl_file] (eg. data/audit-inputs/sample-seed-fund.csv)

Run individual scripts with these commands:

  uv run -m ai_crawl_analysis.expand_json_csv
  uv run -m ai_crawl_analysis.deduplicate_column_items
  uv run -m ai_crawl_analysis.crawl_analysis

OR

Run them using Python directly
- Activate the virtual environment
- Then run python -m ai_crawl_analysis.main

Environment variables

The crawl_analysis script requires an API_KEY environment variable. Edit the env.example field at the root of the project to add your AI API Key.

🔄 Data processing workflow

A visualization of the project's data processing workflow:

flowchart TB
    subgraph Input["Input"]
        A[Raw Site Crawl CSV]
    end

    subgraph Step1["Step 1: Data Expansion & Cleaning"]
        B1[Expand JSON Columns]
        B2[Filter HTML Rows]
        B3[Clean Headers]
        A --> B1 --> B2 --> B3
    end

    subgraph Step2["Step 2: AI-powered Analysis"]
        C1[Extract Informational Columns]
        C2[AI Analysis of Site Structure]
        C3[Generate Migration Groups]
        C4[Analyze Page Sidebar Content]
        B3 --> C1 --> C2 --> C3 --> C4
    end

    subgraph Step3["Step 3: Data Organization & Export"]
        D1[Group site pages by Migration Path]
        D2[Generate Statistics]
        D3[Export Individual CSV Groups]
        D4[Create CSV with all data]
        C4 --> D1 --> D2 --> D3 --> D4
    end

    subgraph Output["Output Files"]
        E1[CSV: Expanded CSV with AI-generated page insights]
        E2[JSON: Recommended Migration Groups]
        E3[CSV: One file for each migration group]
        E4[CSV: All data in one sorted and grouped file]
        E5[CSV: Top-level Summary of global site insights]

        D4 --> E1 & E2 & E3 & E4 & E5
    end

    classDef inputStyle fill:#BDDB67,stroke:#91B62B,stroke-width:2px
    classDef processStyle fill:#2AB7CA,stroke:#1c7b87,stroke-width:2px
    classDef outputStyle fill:#FB5F2E,stroke:#dc3704,stroke-width:2px
    classDef aiStyle fill:#FDE74C,stroke:#dec402,stroke-width:2px

    class A,E1,E2,E3,E4,E5 inputStyle
    class B1,B2,B3,D1,D2,D3,D4 processStyle
    class C1,C2,C3,C4 aiStyle

    style Input fill:#f6f6f6,stroke:#ccc
    style Step1 fill:#f6f6f6,stroke:#ccc
    style Step2 fill:#f6f6f6,stroke:#ccc
    style Step3 fill:#f6f6f6,stroke:#ccc
    style Output fill:#f6f6f6,stroke:#ccc

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
.streamlit		.streamlit
ai_crawl_analysis		ai_crawl_analysis
assets		assets
data		data
docs		docs
prompts		prompts
.DS_Store		.DS_Store
.env.example		.env.example
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
bandit.yaml		bandit.yaml
pyproject.toml		pyproject.toml
test_json.txt		test_json.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AI-enhanced site crawl analysis

✨ Description

🚀 Features

📦 Requirements

🔧 Getting Started

1. Set up development environment

2. Set up site crawls to return the expected CSV for the analysis:

3. Run the crawl analysis:

Running the analysis on the browser:

Running the analysis in the command line:

Environment variables

🔄 Data processing workflow

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

CivicActions/ai-crawl-analysis

Folders and files

Latest commit

History

Repository files navigation

AI-enhanced site crawl analysis

✨ Description

🚀 Features

📦 Requirements

🔧 Getting Started

1. Set up development environment

2. Set up site crawls to return the expected CSV for the analysis:

3. Run the crawl analysis:

Running the analysis on the browser:

Running the analysis in the command line:

Environment variables

🔄 Data processing workflow

License

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages