GitHub Keyword Crawler

Running locally 💻

You will need Python 3.10 or later, MongoDB 6.0.X, and preferably a GitHub account¹.

First install the required libraries via the Python package installer:

pip install -r requirements.txt

Next, make sure that MongoDB is up and running:

lsof | grep mongod

With the packages installed and database running, you can start the mining:

python3 main.py --token {gh_pat} --target {target} {keyword}

Where:

gh_pat: Your personal GitHub access token (ghp_\w{36}), with repo read privileges;
target: The target endpoint (commits, issues or pull-requests);
keyword: The term to search for.

Note that --token is an optional parameter, and can be supplied alternatively via the GITHUB_TOKEN environment variable. If you need to further configure the MongoDB host and port settings, you can use the DATABASE_HOST and DATABASE_PORT environment variables respectively. Mined data is stored in a database whose name corresponds to the provided keyword, split across collections for each of the three target endpoints.

Running on Docker 🐳

Assuming you have the latest versions of both Docker and Docker Compose, we provide a ready configuration to jump-start the mining.

docker-compose -f deployment/docker-compose.yml up gh-keyword-crawler-{target} -d

Substituting target with one of the aforementioned options. Note that there are two methods of configuring your own GitHub access token:

Creating a .env file in deployment with a GITHUB_TOKEN entry:
```
GITHUB_TOKEN=#your token goes here
```

Creating a docker-compose.override.yml file in deployment, for example:

version: '3.9'

services:

  gh-keyword-crawler-commits:
    environment:
      GITHUB_TOKEN: # A token for commits

  gh-keyword-crawler-issues:
    environment:
      GITHUB_TOKEN: # A token for issues

  gh-keyword-crawler-pull-requests:
    environment:
      GITHUB_TOKEN: # A token for pull-requests

If you employ the second approach, you must also provide the override file as an argument:

docker-compose -f deployment/docker-compose.yml -f deployment/docker-compose.override.yml up gh-keyword-crawler-{target} -d

The advantage of this approach is that it allows us to define separate access tokens for distinct endpoints, and run their miners in parallel:

docker-compose -f deployment/docker-compose.yml -f deployment/docker-compose.override.yml up -d

Regardless of how you deploy, the database data will be kept in the gh-keyword-crawler-data volume. Running crawler logs are bound to the deployment/logs/{target} directory.

While the account is not mandatory, the mining will be performed significantly faster if a personal access token (PAT) is provided. ↩

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
deployment		deployment
.dockerignore		.dockerignore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
logger.ini		logger.ini
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GitHub Keyword Crawler

Running locally 💻

Running on Docker 🐳

About

Releases

Packages

Languages

License

seart-group/github-keyword-crawler

Folders and files

Latest commit

History

Repository files navigation

GitHub Keyword Crawler

Running locally 💻

Running on Docker 🐳

Footnotes

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages