Skip to content

A serverless web-scraper for GitHub profiles.

Notifications You must be signed in to change notification settings

HWTechClub/gitscraper

Repository files navigation

GitScraper

A "self-hosted", serverless web-scraper for GitHub profiles. A middle-man between your site and GitHub's API.

[under construction]

Developers!

Running for no-firebase tasks

Simply run npm start in the functions directory. Of course, you'll have to run npm install once after cloning.

Notes

  • We will use TypeScript.
  • Modularize your code, we will integrate later.
  • All code to be written within functions/src. Treat it as the root directory.
  • Ask if not sure.
  • DO NOT rename the functions directory.

This repository is "self-hosted", by this I mean you will have to set up a Firebase Account and deploy the functions on your account. Here is how you can do so:

Create a Firebase project

First, you need to make a firebase account and then a new project. We do not need Google Analytics for this.

Then, click on the Functions tab and select "Get started"

Upgrade your account

To deploy and use functions, you need the Blaze plan, this is a pay-as-you-go plan and the free limit is MASSIVE for this kind of small project. So it's effectively free but you're free to scale how you want.

Firebase CLI

Install the Firebase CLI

npm i -g firebase-tools

Then, you must log in to the CLI

firebase login

Test locally

# > Fork the project

$ git clone <your fork url>
$ cd ./gitscraper
$ git add upstream https://github.com/HWTechClub/gitscraper.git

$ cd functions
$ firebase use --add
# > Select the project you created

# TO RUN EMULATOR
$ npm serve 

# TO RUN NODEJS MODE
$ npm start 

# TO CONTRIBUTE
$ git fetch upstream
$ git checkout -b <feature_branch> upstream/main
# > Make changes
# > Commit to feature branch
# > PR feature branch into upstream/main

Why?

The problem

The official GitHub API rate limits you to about 60 requests an hour for core and 20 for search. Furthermore, some data simply requires some API gymnastics to retrieve.

Generally, when you attempt real-time GitHub stats using the Official API, you need to make more than 1 request to get all the information you would need to make an appealing UI. For example, to display the latest repository, you need to first query the search API then get a URL from the response, then query and get the languages used.

You'd also typically want to query information about more than 1 repo, so you can see how quickly the rate limit will be reached especially if you refresh a couple of times or during the development of your site. Once it is reached a 401 will crash your app or it will make your UI look ugly unless you provide fallback data.

Yeah, you can increase your limit by providing a key but you can’t really hide your key on static sites.

Yes, the GraphQL API does exist and is better but do you really want to set up GraphQL for static sites? I don't. Besides, it's a cool little side project to spend a week on.

The Solution

This will use Firebase Cloud Functions to run a function every couple of hours (or whatever interval) and scrape the contents of a GitHub profile via ether good ol' Web Scraping or the GitHub API itself. After that, it will store all the data as one or two documents in Firebase Realtime Database.

The user can then run another Cloud Function to fetch the data from the database. Something like this:

Group 1 (3)

About

A serverless web-scraper for GitHub profiles.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published