Stats Scraper

This project automatically collects and stores statistics about GitHub repositories, Slack workspaces, and Matomo analytics data.

Features

Scrapes GitHub data:
- Star ranking data
- Repository visitor statistics
- First-time PR contributors
- First-time issue creators
- Repository activity metrics (open PRs, open issues, open discussions)
- Repository events (all GitHub events for the repository)
- Repository releases (release information including tags, assets, and SHA)
- Open issues analysis (detailed statistics about open issues)
- Open PRs analysis (detailed statistics about open pull requests)
- Slack workspace statistics:
  - Member count tracking
  - Active channel count
  - Channel-specific statistics (member counts per channel)
- Matomo analytics data:
  - Visits and unique visitors
  - Visitor map data by country/region
  - Top pages by visits and engagement metrics
Supports multiple database backends:
- MotherDuck (default)
- PostgreSQL
- SQLite
- Amazon RDS
Configurable via YAML configuration file or environment variables
Runs automatically via GitHub Actions on a daily schedule

Requirements

Python 3.10+
GitHub API token
Database credentials (MotherDuck token by default)
Slack API token (for Slack workspace statistics) with the following scopes:
- users:read (for member count)
- channels:read (for channel list and member counts)
Matomo API token (optional, for Matomo analytics)

Setup

Clone this repository
Initialize git repository (if not already done):
```
git init
```
Install dependencies:
```
pip install -r requirements.txt
```
Create a configuration file:
```
cp config.yaml.example config.yaml
```
Edit the config.yaml file to configure your target repository and database settings.
Set up environment variables for API tokens:
```
export GITHUB_TOKEN=your_github_token
export MOTHERDUCK_TOKEN=your_motherduck_token
export SLACK_API_TOKEN=your_slack_api_token
export MATOMO_KEY=your_matomo_key
```
Alternatively, you can create a .env file by copying the template:
```
cp .env.example .env
```
Then edit the .env file with your actual tokens.

Important Note about MotherDuck Token: The MotherDuck token must be in JWT format, which contains two dots separating three sections (Header.Payload.Signature). If you're getting authentication errors, check that your token is in the correct format.

Configuration

The project can be configured using a config.yaml file or environment variables. The configuration file allows you to specify:

Target GitHub repository (owner and repo name)
Logging level and format
Database type and connection details
Table names
Matomo site ID and URL
Slack workspace ID

Example configuration:

# Logging configuration
logging:
  level: "ERROR"  # Options: DEBUG, INFO, WARNING, ERROR, CRITICAL
  format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"

# GitHub configuration
github:
  owner: "apache"
  repo: "superset"
  api_url: "https://api.github.com"

# Database configuration
# Database configuration
database:
  type: "motherduck"  # Options: motherduck, postgresql, sqlite, rds
  connection:
    motherduck:
      database: "superset_stats"
    postgresql:
      host: "localhost"
      port: 5432
      database: "superset_stats"
      username: "postgres"
      password: ""
    sqlite:
      path: "superset_stats.db"
    rds:
      host: "your-rds-instance.amazonaws.com"
      port: 5432
      database: "superset_stats"
      username: "admin"
      password: ""

Environment variables can be used to override configuration values:

LOG_LEVEL: Logging level (DEBUG, INFO, WARNING, ERROR, CRITICAL)
GITHUB_OWNER: GitHub repository owner
GITHUB_REPO: GitHub repository name
DATABASE_TYPE: Database type (motherduck, postgresql, sqlite, rds)
MOTHERDUCK_DATABASE: MotherDuck database name
MATOMO_BASE_URL: Matomo API base URL
MATOMO_SITE_ID: Matomo site ID

Running Locally

You can run each scraper individually:

# Star ranking scraper
python scripts/star_ranking.py

# Repository visitors scraper
python scripts/repo_visitors.py

# New contributors scraper
python scripts/new_contributors.py

# New issue creators scraper
python scripts/new_issue_creators.py

# Repository activity scraper
python scripts/repo_activity.py

# Repository events scraper
python scripts/repo_events.py

# Repository releases scraper
python scripts/repo_releases.py

# Open issues analysis scraper
python scripts/open_issues.py

# Open PRs analysis scraper
python scripts/open_prs.py
# Slack workspace stats scraper
python scripts/slack_workspace_stats.py

# Slack channel stats scraper
python scripts/slack_channel_stats.py

# Matomo analytics scraper
python scripts/matomo_analytics.py

# Matomo visitor map scraper
python scripts/matomo_visitor_map.py

# Matomo top pages scraper
python scripts/matomo_top_pages.py

# Community calendar scraper
python scripts/community_calendar.py

# Kapa activity scraper
python scripts/kapa_activity.py

Each script will:

Load configuration from config.yaml and environment variables
Connect to the configured database
Fetch data from the appropriate API
Process and analyze the data
Store the results in the database

GitHub Actions

The GitHub Actions workflow runs automatically on a daily schedule and can also be triggered manually.

Setting up GitHub Secrets

For the GitHub Actions workflow to run successfully, you need to set up the following secrets in your GitHub repository:

Go to your repository on GitHub
Navigate to Settings > Secrets and variables > Actions
Add the following secrets:
- GITHUB_TOKEN: Your GitHub personal access token
- MOTHERDUCK_TOKEN: Your MotherDuck token (in JWT format)
- SLACK_API_TOKEN: Your Slack API token with the required scopes:
  - users:read (for member count)
  - channels:read (for channel list)
- MATOMO_KEY: Your Matomo API token (optional)

Local Testing with Act

You can test the GitHub Actions workflow locally using act and the provided helper script:

Install act following the instructions in their repository
Run the helper script:
```
./test_workflow.sh
```

The script runs in quiet mode by default, suppressing irrelevant warnings and logs. If you want to see all output, use:

./test_workflow.sh --verbose

or

./test_workflow.sh -v

The script will:

Check for tokens in your shell environment
Fall back to tokens in your .env file if they exist
Validate that tokens are not placeholder values
Check if the MotherDuck token is in the correct JWT format
Create a .secrets file for act to use (simulating GitHub secrets)
Run the workflow using act with proper secret handling

If you don't have the tokens set up, the script will prompt you to add your actual tokens.

Security Best Practices

Handling API Tokens

Never commit tokens to version control: The .env and .secrets files are included in .gitignore to prevent accidental commits
Use environment variables: Set tokens as environment variables rather than hardcoding them in files
Use GitHub Secrets: For GitHub Actions, always use repository secrets
Rotate tokens regularly: If you suspect a token has been exposed, rotate it immediately
Limit token permissions: Use tokens with the minimum required permissions

Troubleshooting

Database Connection Errors

If you're having trouble connecting to the database, check:

That you've set the correct database type in your configuration
That all required connection parameters are provided
That your database credentials are correct
That your database is accessible from your current network

MotherDuck Authentication Errors

If you see an error like:

Error: Invalid Input Error: Initialization function "motherduck_init" ... Request failed: Your request is not authenticated. Please check your MotherDuck token. (Jwt is not in the form of Header.Payload.Signature with two dots and 3 sections...)

This means your MotherDuck token is not in the correct JWT format. Make sure:

You're using the correct token from the MotherDuck dashboard
The token contains two dots (e.g., eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiaWF0IjoxNTE2MjM5MDIyfQ.SflKxwRJSMeKKF2QT4fwpMeJf36POk6yJV_adQssw5c)
The token is properly set in your environment or .env file

Adding New Scrapers

To add a new scraper:

Create a new Python script in the scripts/ directory
Add any new dependencies to requirements.txt
Update the .github/workflows/stats_scraper.yml file to include your new script
Follow the pattern of existing scrapers:
- Use the configuration system for settings
- Use the database abstraction layer for data storage
- Use the appropriate API client for data fetching
- Include proper error handling
- Add an "updated_at" column to track when records are modified

Project Structure

config.py: Central configuration system
database.py: Database abstraction layer
github_client.py: GitHub API client
slack_client.py: Slack API client
matomo_client.py: Matomo API client
utils.py: Utility functions
scripts/: Individual scraper scripts
config.yaml.example: Example configuration file
.env.example: Example environment variables file
.github/workflows/: GitHub Actions workflow definitions

Development

Code Style and Linting

This project uses the following tools to maintain code quality:

flake8: For code linting and style checking
black: For code formatting
isort: For import sorting

Setup

Install the development dependencies:

pip install -r requirements.txt

Running Linters

To check your code for style issues:

./lint.sh

To automatically format your code:

./format.sh

Configuration Files

.flake8: Configuration for flake8
pyproject.toml: Configuration for black and isort

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stats Scraper

Features

Requirements

Setup

Configuration

Running Locally

GitHub Actions

Setting up GitHub Secrets

Local Testing with Act

Security Best Practices

Handling API Tokens

Troubleshooting

Database Connection Errors

MotherDuck Authentication Errors

Adding New Scrapers

Project Structure

Development

Code Style and Linting

Setup

Running Linters

Configuration Files

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.github/workflows		.github/workflows
scripts		scripts
.env.example		.env.example
.flake8		.flake8
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
TESTING.md		TESTING.md
config.py		config.py
config.yaml		config.yaml
config.yaml.example		config.yaml.example
database.py		database.py
format.sh		format.sh
github_client.py		github_client.py
lint.sh		lint.sh
matomo_client.py		matomo_client.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
simple_slack_test.py		simple_slack_test.py
slack_client.py		slack_client.py
test_refactored_code.py		test_refactored_code.py
test_slack_permissions.py		test_slack_permissions.py
test_workflow.sh		test_workflow.sh
utils.py		utils.py

License

apache-superset/stats-scraper

Folders and files

Latest commit

History

Repository files navigation

Stats Scraper

Features

Requirements

Setup

Configuration

Running Locally

GitHub Actions

Setting up GitHub Secrets

Local Testing with Act

Security Best Practices

Handling API Tokens

Troubleshooting

Database Connection Errors

MotherDuck Authentication Errors

Adding New Scrapers

Project Structure

Development

Code Style and Linting

Setup

Running Linters

Configuration Files

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages