A powerful Flask-based website crawler that allows you to extract and analyze content from websites with custom instructions.
- Website Crawling: Input any URL and extract structured content including text, links, and metadata
- Custom Instructions: Provide specific instructions to tailor the crawling process to your needs
- Depth Control: Configure how deep the crawler should go into linked pages
- Crawl History: Keep track of your previous crawls and easily access results
- Structured Results: View and analyze crawled content in a clean, structured format
- Export Options: Download crawl results as CSV or JSON for further analysis or integration
- Backend: Flask (Python)
- Database: PostgreSQL
- UI: Bootstrap CSS with dark theme
- Text Extraction: Trafilatura
- HTML Parsing: BeautifulSoup4
-
Clone the repository:
git clone <repository-url> cd web-crawler -
Install dependencies:
pip install beautifulsoup4 email-validator flask flask-sqlalchemy gunicorn psycopg2-binary requests sqlalchemy trafilatura werkzeug -
Set up environment variables:
export DATABASE_URL=postgresql://username:password@localhost/dbname export FLASK_SECRET_KEY=your_secret_key -
Initialize the database:
flask db upgrade -
Run the application:
python main.py -
Visit
http://localhost:5000in your browser
- Enter a URL: Provide the website URL you want to crawl
- Add Instructions (Optional): Specify any custom instructions for the crawl
- Submit: Click "Start Crawling" to begin the process
- View Results: Analyze the structured results including extracted text, links, and metadata
- Access History: View and revisit previous crawls from the history section
- Export Data: Download crawl results in CSV or JSON format for further analysis
-
CSV Export: Provides a well-structured, human-readable format with multiple sections:
- Crawl Information (URL, Date, Instructions)
- Site Metadata (Title, Description, Keywords)
- Crawl Statistics (Pages, Links, Text Length, Crawl Time)
- Links Discovered (URL, Text, Depth, Type, Status)
- Pages Content (URL, Title, Depth, Text Sample)
-
JSON Export: Provides the complete raw data structure with all crawled information, ideal for programmatic access and integration with other systems
main.py: Entry point for the applicationapp.py: Core Flask application setup with routes and database initializationmodels.py: Database models for storing crawl datacrawler.py: Implementation of the web crawling functionalitytemplates/: HTML templates for the web interfacelayout.html: Base template with common elementsindex.html: Main page with crawl form and historyresults.html: Detailed results display
static/: Static assets like CSS and JavaScriptcss/custom.css: Custom styling for the applicationjs/main.js: Client-side functionality
GET /: Main page with crawl form and historyPOST /crawl: Submit a URL and instructions for crawlingGET /results: View results of the most recent crawlGET /results/<crawl_id>: View results of a specific crawl by IDGET /export/<crawl_id>/json: Download crawl results as JSON fileGET /export/<crawl_id>/csv: Download crawl results as CSV filePOST /api/check-url: API endpoint to validate URL format
- Be specific with your URLs - include the full path for more targeted results
- Use custom instructions to tailor the crawl to your needs
- For large websites, limit the depth and number of pages
- Some websites may block or limit crawling - respect their terms of service
- For better text extraction, include specific instructions
The crawler implements several features to ensure responsible web crawling:
- Respects robots.txt rules
- Implements rate limiting to avoid overloading target websites
- Keeps track of visited URLs to avoid duplicate requests
- Properly identifies itself with user-agent headers
The application uses a PostgreSQL database with the following main table:
crawl_history: Stores information about crawls with the following columns:id: Unique identifier for each crawlurl: The URL that was crawledinstructions: Custom instructions provided for the crawlresult_summary: Brief summary of crawl results (number of links, word count)result_data: Complete crawl results stored as JSONtimestamp: When the crawl was performed
The application uses database storage instead of session cookies for crawl results, which allows for:
- Persistence of large crawl results
- History tracking
- Sharing of results via URLs
- Improved performance with large datasets
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.