Skip to content

Information retrieval system that crawls, preprocesses, and indexes articles from Towards Data Science website, featuring TF-IDF scoring, cosine similarity search, and a Django web interface for querying documents. The system implements both Boolean and vector space models with support for stemming, stopword removal, inverted index

Notifications You must be signed in to change notification settings

danschnurp/IRS-Towards-Data-Science

Repository files navigation

Information Retrieval System of Towards Data Science

Components

Web Application

  • before start: pip install -r requirements.txt

  • usage: python ./web_app/manage.py runserver

  • usage with docker:

    • download nltk_data, preprocessed_data and indexed data folders to root poject directory (extract them) from my onedrive: here
    • then run:
    docker-compose up
    

alt text

Simple Crawler

  • crawling website: Towards Data Science posts(articles) read from sitemap.xml and for each post saving title and content in <p>...</p> by using simple xpath expressions

  • usage: python main_crawler.py

  • or with custom parameters:

usage: main.py [-h] [-u MAIN_SITE_URL] [-o OUTPUT_DIR] [-p PREPARED_URLS]

SImple Crawler.

options:
  -h, --help            show this help message and exit
  -u MAIN_SITE_URL, --main_site_url MAIN_SITE_URL
                        main site that contains file robots.txt...
  -o OUTPUT_DIR, --output_dir OUTPUT_DIR
                        path to output dir where crawled_data directory is
                        created...
  -p PREPARED_URLS, --prepared_urls PREPARED_URLS
                        crawl prepared urls? True/False
  • prefetch data from this app on my onedrive: here

  • extract to "./crawled_data"

  • if needed, dataset can be easily extended

  • parallelization can be added as well but due to politeness of the crawler is not implemented

NLTK preprocessor

  • usage: python main_preprocessor.py
usage: main_preprocessor.py [-h] -i INPUT_FILE_PATH [-o MAKE_CSV_ONLY]

preprocessor using NLTK lib

options:
  -h, --help            show this help message and exit
  -i INPUT_FILE_PATH, --input_file_path INPUT_FILE_PATH
  -o MAKE_CSV_ONLY, --make_csv_only MAKE_CSV_ONLY
                        reformat to csv only? True/False

Indexer (inverted index creator)

  • usage: python main_indexer.py
usage: main_indexer.py [-h] -i INPUT_FILE_PATH [-t INDEX_TITLES] [-c INDEX_CONTENTS]

Simple indexer

options:
  -h, --help            show this help message and exit
  -i INPUT_FILE_PATH, --input_file_path INPUT_FILE_PATH
  -t INDEX_TITLES, --index_titles INDEX_TITLES True/False
  -c INDEX_CONTENTS, --index_contents INDEX_CONTENTS True/False

About

Information retrieval system that crawls, preprocesses, and indexes articles from Towards Data Science website, featuring TF-IDF scoring, cosine similarity search, and a Django web interface for querying documents. The system implements both Boolean and vector space models with support for stemming, stopword removal, inverted index

Topics

Resources

Stars

Watchers

Forks

Contributors 2

  •  
  •