Skip to content

apache-superset/stats-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Stats Scraper

This project automatically collects and stores statistics about GitHub repositories, Slack workspaces, and Matomo analytics data.

Features

  • Scrapes GitHub data:
    • Star ranking data
    • Repository visitor statistics
    • First-time PR contributors
    • First-time issue creators
    • Repository activity metrics (open PRs, open issues, open discussions)
    • Repository events (all GitHub events for the repository)
    • Repository releases (release information including tags, assets, and SHA)
    • Open issues analysis (detailed statistics about open issues)
    • Open PRs analysis (detailed statistics about open pull requests)
    • Slack workspace statistics:
      • Member count tracking
      • Active channel count
      • Channel-specific statistics (member counts per channel)
    • Matomo analytics data:
      • Visits and unique visitors
      • Visitor map data by country/region
      • Top pages by visits and engagement metrics
  • Supports multiple database backends:
    • MotherDuck (default)
    • PostgreSQL
    • SQLite
    • Amazon RDS
  • Configurable via YAML configuration file or environment variables
  • Runs automatically via GitHub Actions on a daily schedule

Requirements

  • Python 3.10+
  • GitHub API token
  • Database credentials (MotherDuck token by default)
  • Slack API token (for Slack workspace statistics) with the following scopes:
    • users:read (for member count)
    • channels:read (for channel list and member counts)
  • Matomo API token (optional, for Matomo analytics)

Setup

  1. Clone this repository

  2. Initialize git repository (if not already done):

    git init
    
  3. Install dependencies:

    pip install -r requirements.txt
    
  4. Create a configuration file:

    cp config.yaml.example config.yaml
    

    Edit the config.yaml file to configure your target repository and database settings.

  5. Set up environment variables for API tokens:

    export GITHUB_TOKEN=your_github_token
    export MOTHERDUCK_TOKEN=your_motherduck_token
    export SLACK_API_TOKEN=your_slack_api_token
    export MATOMO_KEY=your_matomo_key
    

    Alternatively, you can create a .env file by copying the template:

    cp .env.example .env
    

    Then edit the .env file with your actual tokens.

    Important Note about MotherDuck Token: The MotherDuck token must be in JWT format, which contains two dots separating three sections (Header.Payload.Signature). If you're getting authentication errors, check that your token is in the correct format.

Configuration

The project can be configured using a config.yaml file or environment variables. The configuration file allows you to specify:

  • Target GitHub repository (owner and repo name)
  • Logging level and format
  • Database type and connection details
  • Table names
  • Matomo site ID and URL
  • Slack workspace ID

Example configuration:

# Logging configuration
logging:
  level: "ERROR"  # Options: DEBUG, INFO, WARNING, ERROR, CRITICAL
  format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"

# GitHub configuration
github:
  owner: "apache"
  repo: "superset"
  api_url: "https://api.github.com"

# Database configuration
# Database configuration
database:
  type: "motherduck"  # Options: motherduck, postgresql, sqlite, rds
  connection:
    motherduck:
      database: "superset_stats"
    postgresql:
      host: "localhost"
      port: 5432
      database: "superset_stats"
      username: "postgres"
      password: ""
    sqlite:
      path: "superset_stats.db"
    rds:
      host: "your-rds-instance.amazonaws.com"
      port: 5432
      database: "superset_stats"
      username: "admin"
      password: ""

Environment variables can be used to override configuration values:

  • LOG_LEVEL: Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
  • GITHUB_OWNER: GitHub repository owner
  • GITHUB_REPO: GitHub repository name
  • DATABASE_TYPE: Database type (motherduck, postgresql, sqlite, rds)
  • MOTHERDUCK_DATABASE: MotherDuck database name
  • MATOMO_BASE_URL: Matomo API base URL
  • MATOMO_SITE_ID: Matomo site ID

Running Locally

You can run each scraper individually:

# Star ranking scraper
python scripts/star_ranking.py

# Repository visitors scraper
python scripts/repo_visitors.py

# New contributors scraper
python scripts/new_contributors.py

# New issue creators scraper
python scripts/new_issue_creators.py

# Repository activity scraper
python scripts/repo_activity.py

# Repository events scraper
python scripts/repo_events.py

# Repository releases scraper
python scripts/repo_releases.py

# Open issues analysis scraper
python scripts/open_issues.py

# Open PRs analysis scraper
python scripts/open_prs.py
# Slack workspace stats scraper
python scripts/slack_workspace_stats.py

# Slack channel stats scraper
python scripts/slack_channel_stats.py

# Matomo analytics scraper
python scripts/matomo_analytics.py

# Matomo visitor map scraper
python scripts/matomo_visitor_map.py

# Matomo top pages scraper
python scripts/matomo_top_pages.py

# Community calendar scraper
python scripts/community_calendar.py

# Kapa activity scraper
python scripts/kapa_activity.py

Each script will:

  1. Load configuration from config.yaml and environment variables
  2. Connect to the configured database
  3. Fetch data from the appropriate API
  4. Process and analyze the data
  5. Store the results in the database

GitHub Actions

The GitHub Actions workflow runs automatically on a daily schedule and can also be triggered manually.

Setting up GitHub Secrets

For the GitHub Actions workflow to run successfully, you need to set up the following secrets in your GitHub repository:

  1. Go to your repository on GitHub
  2. Navigate to Settings > Secrets and variables > Actions
  3. Add the following secrets:
    • GITHUB_TOKEN: Your GitHub personal access token
    • MOTHERDUCK_TOKEN: Your MotherDuck token (in JWT format)
    • SLACK_API_TOKEN: Your Slack API token with the required scopes:
      • users:read (for member count)
      • channels:read (for channel list)
    • MATOMO_KEY: Your Matomo API token (optional)

Local Testing with Act

You can test the GitHub Actions workflow locally using act and the provided helper script:

  1. Install act following the instructions in their repository
  2. Run the helper script:
    ./test_workflow.sh
    

The script runs in quiet mode by default, suppressing irrelevant warnings and logs. If you want to see all output, use:

./test_workflow.sh --verbose

or

./test_workflow.sh -v

The script will:

  • Check for tokens in your shell environment
  • Fall back to tokens in your .env file if they exist
  • Validate that tokens are not placeholder values
  • Check if the MotherDuck token is in the correct JWT format
  • Create a .secrets file for act to use (simulating GitHub secrets)
  • Run the workflow using act with proper secret handling

If you don't have the tokens set up, the script will prompt you to add your actual tokens.

Security Best Practices

Handling API Tokens

  • Never commit tokens to version control: The .env and .secrets files are included in .gitignore to prevent accidental commits
  • Use environment variables: Set tokens as environment variables rather than hardcoding them in files
  • Use GitHub Secrets: For GitHub Actions, always use repository secrets
  • Rotate tokens regularly: If you suspect a token has been exposed, rotate it immediately
  • Limit token permissions: Use tokens with the minimum required permissions

Troubleshooting

Database Connection Errors

If you're having trouble connecting to the database, check:

  1. That you've set the correct database type in your configuration
  2. That all required connection parameters are provided
  3. That your database credentials are correct
  4. That your database is accessible from your current network

MotherDuck Authentication Errors

If you see an error like:

Error: Invalid Input Error: Initialization function "motherduck_init" ... Request failed: Your request is not authenticated. Please check your MotherDuck token. (Jwt is not in the form of Header.Payload.Signature with two dots and 3 sections...)

This means your MotherDuck token is not in the correct JWT format. Make sure:

  1. You're using the correct token from the MotherDuck dashboard
  2. The token contains two dots (e.g., eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiaWF0IjoxNTE2MjM5MDIyfQ.SflKxwRJSMeKKF2QT4fwpMeJf36POk6yJV_adQssw5c)
  3. The token is properly set in your environment or .env file

Adding New Scrapers

To add a new scraper:

  1. Create a new Python script in the scripts/ directory
  2. Add any new dependencies to requirements.txt
  3. Update the .github/workflows/stats_scraper.yml file to include your new script
  4. Follow the pattern of existing scrapers:
    • Use the configuration system for settings
    • Use the database abstraction layer for data storage
    • Use the appropriate API client for data fetching
    • Include proper error handling
    • Add an "updated_at" column to track when records are modified

Project Structure

  • config.py: Central configuration system
  • database.py: Database abstraction layer
  • github_client.py: GitHub API client
  • slack_client.py: Slack API client
  • matomo_client.py: Matomo API client
  • utils.py: Utility functions
  • scripts/: Individual scraper scripts
  • config.yaml.example: Example configuration file
  • .env.example: Example environment variables file
  • .github/workflows/: GitHub Actions workflow definitions

Development

Code Style and Linting

This project uses the following tools to maintain code quality:

  • flake8: For code linting and style checking
  • black: For code formatting
  • isort: For import sorting

Setup

Install the development dependencies:

pip install -r requirements.txt

Running Linters

To check your code for style issues:

./lint.sh

To automatically format your code:

./format.sh

Configuration Files

  • .flake8: Configuration for flake8
  • pyproject.toml: Configuration for black and isort

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published