Web Scraping

Web scraping done by Wikipedia Graph Mapper is handled by the script process_link.py written in Python, and is powered by libraries such as Requests and Beautiful Soup.

Receiving the Command

Upon receiving a user request, the function generate_lists is first called to handle and process user input.

def generate_lists(self_url, max_level, isMobileBrowser, search_mode):
    # patches requests as it has compatibility issues with Google App Engine/ comment this out to test on development server
    requests_toolbelt.adapters.appengine.monkeypatch() 

    global MAX_LEVEL
    MAX_LEVEL = int(max_level)

    global SEARCH_MODE
    SEARCH_MODE = int(search_mode)

    # clears list for each new request made
    del entryList[:]

    nodeList, linkList = proc_data(scrape(self_url, currentLevel = 0), isMobileBrowser)
    return nodeList, linkList

The line containing the function monkeypatch() should only be used when deploying the application on Google App Engine. Comment it out when testing on your development server as it might create other errors while running.

Besides calling all the other dependent functions scrape() and proc_data() to actually carry out web scraping and data processing respectively, the generate_lists() function also sets the user input maximum search depth max_level as a global variable to force the recursive function scrape() to return prematurely upon arriving at the maximum search depth. Similarly, search_mode, which indicates 1 of the 3 various search settings*, is also set as a global variable to make scrape() use the same hyperlink selection criteria for each page scraped.

Refer to the page FSDFASFAF and the following section Scraping Data off Wikipedia for more details.

Scraping Data off Wikipedia

The recursive function scrape

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Web Scraping

Table of Contents

Receiving the Command

Scraping Data off Wikipedia

Processing all those Data

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally