GH Elephant is a tool to download GitHub activity data from the GitHub Archive and store it in a PostgreSQL database for further analysis. For the full reference and a concrete use case of GH Elephant, see the Master's thesis for which GH Elephant was created.
pip3 install -r requirements.txt- complete
variables.pywith your database information and a path where to temporarily store thejsonandcsvfiles. - make sure you have psql running with an empty database as specified in
variables.py
- make sure you have about 100 GB of free storage for the temporary files; if that's out of reach, make the queues in
manager.pysmaller. - run
./ghelephant.pywith the required options-sand-especifying start and end date for the downloads in the format "YYYY-MM-DD" - run
./ghelephant.pywith option-ito create indices for faster queries
If you want to add additional information like user data or get commit details, you can use the GitHub API directly
through GH Elephant to enrich your tables.
To do so, you first need to export a table in csv format with header (e.g. copy (select actor_login, repo_name, sha, created_at from archive join commit on payload_id=push_id where type = 'PushEvent' limit 10) to '/my_path/table.csv' (format csv, header);).
Then, you can use the following two commands to extend your table with user data or commit information in JSON form.
You should also use a Personal GitHub Access Token
and provide it to GH Elephant with the -t flag.
- run
./ghelephant.py -u /my_path/table.csvto add user information into thecsv(requires the presence of theactor_logincolumn) - run
./ghelephant.py -c /my_path/table.csvto add commit information into thecsv(requires the presence of therepo_nameandshacolumns) - run
./ghelephant.py -l /my_path/table.csvto convert the locations added with the-uoption into uniform country codes
If you want to clone some repos you have in your database, export them to a csv file with header (see example above).
- run
./ghelephant.py -r /my_path/table.csv -o /path/to/folder(requires the presence of therepo_namecolumn)