The AI crawl analysis tool is a set of utilities that together, interprets data generated from site crawls to inform content migrations. The tool enhances traditional site crawl analysis by gathering additional context about a site’s pages during site crawls. It performs further analysis on the extracted data to provide relevant content insights. The goal is to efficiently and quickly gather site information and apply data-based decisions to site migrations.
- Crawl Analysis: Analyze and extract insights from crawled datasets.
- Deduplication: Remove duplicate items from columns to ensure data integrity.
- Header Cleaning: Standardize and clean crawl column headers in tabular data.
- HTML Row Filtering: Filter out unwanted HTML rows from datasets.
- AI Integration: Uses AI prompts during and after crawl analysis to enhance and validate crawl data.
-
Python 3.13+
-
uv(a modern Python package manager) -
Install
uv(if you don’t already have it)
curl -LsSf https://astral.sh/uv/install.sh | sh
# or
pipx install uv
# or
pip install uv
# or
brew install uvFor more options, review the Documentation for installing uv
- Clone the repository:
git clone https://github.com/civicactions/ai-migrations.git
cd ai-migrations- Install dependencies:
- Install the required python version in a virtual env:
uv venv --python 3.13.0
- Install other dependencies directly from
pyproject.toml:
uv pip install -r pyproject.toml
The assumption for using this tool is that there is a CSV file, generated from a site crawl to analyze. This site crawl file needs to have at least 2 columns, an 'address' column that lists the page urls and a column with AI-generated insights. The default name for the AI-generated column is "Gemini: JSON schema v5" but it can be changed in the CSV expansion script.
For an example of running a site crawl using AI to generate additional page insights review the Screaming Frog setup with Gemini documentation.
There are 2 options for running the crawl analysis. It can be run as an app on the browser or with Python scripts in the command line.
The app provides a visual interface for uploading a CSV file generated from a site crawl. It runs the analysis and provides CSV file downloads of the analyzed and grouped urls.
Local development In the command line, run the following:
- Initialize packages
uv sync- Start the app:
python -m streamlit run ai_crawl_analysis/streamlit_app.py
- OR -
uv run -m streamlit run ai_crawl_analysis/streamlit_app.pyThis will open the app in http://localhost:8501/. Upload a CSV file to the upload field to start the analysis.
Cloud environment (This is TBD & will be updated when the tool gets deployed.) When this is deployed to the cloud, upload a CSV file to start the analysis.
The Python scripts provide a more granular method for executing the analysis. You can run all the steps or run individual steps for better control and debugging.
The processing scripts are structured as modules. You can
- Run them using uv run or standard Python module syntax
uv run -m ai_crawl_analysis.main [path_to_crawl_file] (eg. data/audit-inputs/sample-seed-fund.csv)
- Run individual scripts with these commands:
uv run -m ai_crawl_analysis.expand_json_csv uv run -m ai_crawl_analysis.deduplicate_column_items uv run -m ai_crawl_analysis.crawl_analysis
OR
- Run them using Python directly
- Activate the virtual environment
- Then run
python -m ai_crawl_analysis.main
The crawl_analysis script requires an API_KEY environment variable. Edit the env.example field at the root of the project to add your AI API Key.
A visualization of the project's data processing workflow:
flowchart TB
subgraph Input["Input"]
A[Raw Site Crawl CSV]
end
subgraph Step1["Step 1: Data Expansion & Cleaning"]
B1[Expand JSON Columns]
B2[Filter HTML Rows]
B3[Clean Headers]
A --> B1 --> B2 --> B3
end
subgraph Step2["Step 2: AI-powered Analysis"]
C1[Extract Informational Columns]
C2[AI Analysis of Site Structure]
C3[Generate Migration Groups]
C4[Analyze Page Sidebar Content]
B3 --> C1 --> C2 --> C3 --> C4
end
subgraph Step3["Step 3: Data Organization & Export"]
D1[Group site pages by Migration Path]
D2[Generate Statistics]
D3[Export Individual CSV Groups]
D4[Create CSV with all data]
C4 --> D1 --> D2 --> D3 --> D4
end
subgraph Output["Output Files"]
E1[CSV: Expanded CSV with AI-generated page insights]
E2[JSON: Recommended Migration Groups]
E3[CSV: One file for each migration group]
E4[CSV: All data in one sorted and grouped file]
E5[CSV: Top-level Summary of global site insights]
D4 --> E1 & E2 & E3 & E4 & E5
end
classDef inputStyle fill:#BDDB67,stroke:#91B62B,stroke-width:2px
classDef processStyle fill:#2AB7CA,stroke:#1c7b87,stroke-width:2px
classDef outputStyle fill:#FB5F2E,stroke:#dc3704,stroke-width:2px
classDef aiStyle fill:#FDE74C,stroke:#dec402,stroke-width:2px
class A,E1,E2,E3,E4,E5 inputStyle
class B1,B2,B3,D1,D2,D3,D4 processStyle
class C1,C2,C3,C4 aiStyle
style Input fill:#f6f6f6,stroke:#ccc
style Step1 fill:#f6f6f6,stroke:#ccc
style Step2 fill:#f6f6f6,stroke:#ccc
style Step3 fill:#f6f6f6,stroke:#ccc
style Output fill:#f6f6f6,stroke:#ccc
MIT License