This project automatically collects and stores statistics about GitHub repositories, Slack workspaces, and Matomo analytics data.
- Scrapes GitHub data:
- Star ranking data
- Repository visitor statistics
- First-time PR contributors
- First-time issue creators
- Repository activity metrics (open PRs, open issues, open discussions)
- Repository events (all GitHub events for the repository)
- Repository releases (release information including tags, assets, and SHA)
- Open issues analysis (detailed statistics about open issues)
- Open PRs analysis (detailed statistics about open pull requests)
- Slack workspace statistics:
- Member count tracking
- Active channel count
- Channel-specific statistics (member counts per channel)
- Matomo analytics data:
- Visits and unique visitors
- Visitor map data by country/region
- Top pages by visits and engagement metrics
- Supports multiple database backends:
- MotherDuck (default)
- PostgreSQL
- SQLite
- Amazon RDS
- Configurable via YAML configuration file or environment variables
- Runs automatically via GitHub Actions on a daily schedule
- Python 3.10+
- GitHub API token
- Database credentials (MotherDuck token by default)
- Slack API token (for Slack workspace statistics) with the following scopes:
users:read
(for member count)channels:read
(for channel list and member counts)
- Matomo API token (optional, for Matomo analytics)
-
Clone this repository
-
Initialize git repository (if not already done):
git init
-
Install dependencies:
pip install -r requirements.txt
-
Create a configuration file:
cp config.yaml.example config.yaml
Edit the
config.yaml
file to configure your target repository and database settings. -
Set up environment variables for API tokens:
export GITHUB_TOKEN=your_github_token export MOTHERDUCK_TOKEN=your_motherduck_token export SLACK_API_TOKEN=your_slack_api_token export MATOMO_KEY=your_matomo_key
Alternatively, you can create a
.env
file by copying the template:cp .env.example .env
Then edit the
.env
file with your actual tokens.Important Note about MotherDuck Token: The MotherDuck token must be in JWT format, which contains two dots separating three sections (Header.Payload.Signature). If you're getting authentication errors, check that your token is in the correct format.
The project can be configured using a config.yaml
file or environment variables. The configuration file allows you to specify:
- Target GitHub repository (owner and repo name)
- Logging level and format
- Database type and connection details
- Table names
- Matomo site ID and URL
- Slack workspace ID
Example configuration:
# Logging configuration
logging:
level: "ERROR" # Options: DEBUG, INFO, WARNING, ERROR, CRITICAL
format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
# GitHub configuration
github:
owner: "apache"
repo: "superset"
api_url: "https://api.github.com"
# Database configuration
# Database configuration
database:
type: "motherduck" # Options: motherduck, postgresql, sqlite, rds
connection:
motherduck:
database: "superset_stats"
postgresql:
host: "localhost"
port: 5432
database: "superset_stats"
username: "postgres"
password: ""
sqlite:
path: "superset_stats.db"
rds:
host: "your-rds-instance.amazonaws.com"
port: 5432
database: "superset_stats"
username: "admin"
password: ""
Environment variables can be used to override configuration values:
LOG_LEVEL
: Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)GITHUB_OWNER
: GitHub repository ownerGITHUB_REPO
: GitHub repository nameDATABASE_TYPE
: Database type (motherduck, postgresql, sqlite, rds)MOTHERDUCK_DATABASE
: MotherDuck database nameMATOMO_BASE_URL
: Matomo API base URLMATOMO_SITE_ID
: Matomo site ID
You can run each scraper individually:
# Star ranking scraper
python scripts/star_ranking.py
# Repository visitors scraper
python scripts/repo_visitors.py
# New contributors scraper
python scripts/new_contributors.py
# New issue creators scraper
python scripts/new_issue_creators.py
# Repository activity scraper
python scripts/repo_activity.py
# Repository events scraper
python scripts/repo_events.py
# Repository releases scraper
python scripts/repo_releases.py
# Open issues analysis scraper
python scripts/open_issues.py
# Open PRs analysis scraper
python scripts/open_prs.py
# Slack workspace stats scraper
python scripts/slack_workspace_stats.py
# Slack channel stats scraper
python scripts/slack_channel_stats.py
# Matomo analytics scraper
python scripts/matomo_analytics.py
# Matomo visitor map scraper
python scripts/matomo_visitor_map.py
# Matomo top pages scraper
python scripts/matomo_top_pages.py
# Community calendar scraper
python scripts/community_calendar.py
# Kapa activity scraper
python scripts/kapa_activity.py
Each script will:
- Load configuration from config.yaml and environment variables
- Connect to the configured database
- Fetch data from the appropriate API
- Process and analyze the data
- Store the results in the database
The GitHub Actions workflow runs automatically on a daily schedule and can also be triggered manually.
For the GitHub Actions workflow to run successfully, you need to set up the following secrets in your GitHub repository:
- Go to your repository on GitHub
- Navigate to Settings > Secrets and variables > Actions
- Add the following secrets:
GITHUB_TOKEN
: Your GitHub personal access tokenMOTHERDUCK_TOKEN
: Your MotherDuck token (in JWT format)SLACK_API_TOKEN
: Your Slack API token with the required scopes:users:read
(for member count)channels:read
(for channel list)
MATOMO_KEY
: Your Matomo API token (optional)
You can test the GitHub Actions workflow locally using act and the provided helper script:
- Install act following the instructions in their repository
- Run the helper script:
./test_workflow.sh
The script runs in quiet mode by default, suppressing irrelevant warnings and logs. If you want to see all output, use:
./test_workflow.sh --verbose
or
./test_workflow.sh -v
The script will:
- Check for tokens in your shell environment
- Fall back to tokens in your .env file if they exist
- Validate that tokens are not placeholder values
- Check if the MotherDuck token is in the correct JWT format
- Create a .secrets file for act to use (simulating GitHub secrets)
- Run the workflow using act with proper secret handling
If you don't have the tokens set up, the script will prompt you to add your actual tokens.
- Never commit tokens to version control: The
.env
and.secrets
files are included in.gitignore
to prevent accidental commits - Use environment variables: Set tokens as environment variables rather than hardcoding them in files
- Use GitHub Secrets: For GitHub Actions, always use repository secrets
- Rotate tokens regularly: If you suspect a token has been exposed, rotate it immediately
- Limit token permissions: Use tokens with the minimum required permissions
If you're having trouble connecting to the database, check:
- That you've set the correct database type in your configuration
- That all required connection parameters are provided
- That your database credentials are correct
- That your database is accessible from your current network
If you see an error like:
Error: Invalid Input Error: Initialization function "motherduck_init" ... Request failed: Your request is not authenticated. Please check your MotherDuck token. (Jwt is not in the form of Header.Payload.Signature with two dots and 3 sections...)
This means your MotherDuck token is not in the correct JWT format. Make sure:
- You're using the correct token from the MotherDuck dashboard
- The token contains two dots (e.g.,
eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiaWF0IjoxNTE2MjM5MDIyfQ.SflKxwRJSMeKKF2QT4fwpMeJf36POk6yJV_adQssw5c
) - The token is properly set in your environment or .env file
To add a new scraper:
- Create a new Python script in the
scripts/
directory - Add any new dependencies to
requirements.txt
- Update the
.github/workflows/stats_scraper.yml
file to include your new script - Follow the pattern of existing scrapers:
- Use the configuration system for settings
- Use the database abstraction layer for data storage
- Use the appropriate API client for data fetching
- Include proper error handling
- Add an "updated_at" column to track when records are modified
config.py
: Central configuration systemdatabase.py
: Database abstraction layergithub_client.py
: GitHub API clientslack_client.py
: Slack API clientmatomo_client.py
: Matomo API clientutils.py
: Utility functionsscripts/
: Individual scraper scriptsconfig.yaml.example
: Example configuration file.env.example
: Example environment variables file.github/workflows/
: GitHub Actions workflow definitions
This project uses the following tools to maintain code quality:
- flake8: For code linting and style checking
- black: For code formatting
- isort: For import sorting
Install the development dependencies:
pip install -r requirements.txt
To check your code for style issues:
./lint.sh
To automatically format your code:
./format.sh
.flake8
: Configuration for flake8pyproject.toml
: Configuration for black and isort
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.