Information Retrieval System of Towards Data Science

Components

Web Application

before start: pip install -r requirements.txt
usage: python ./web_app/manage.py runserver
usage with docker:
- download nltk_data, preprocessed_data and indexed data folders to root poject directory (extract them) from my onedrive: here
- then run:
```
docker-compose up
```

Simple Crawler

crawling website: Towards Data Science posts(articles) read from sitemap.xml and for each post saving title and content in <p>...</p> by using simple xpath expressions
usage: python main_crawler.py
or with custom parameters:

usage: main.py [-h] [-u MAIN_SITE_URL] [-o OUTPUT_DIR] [-p PREPARED_URLS]

SImple Crawler.

options:
  -h, --help            show this help message and exit
  -u MAIN_SITE_URL, --main_site_url MAIN_SITE_URL
                        main site that contains file robots.txt...
  -o OUTPUT_DIR, --output_dir OUTPUT_DIR
                        path to output dir where crawled_data directory is
                        created...
  -p PREPARED_URLS, --prepared_urls PREPARED_URLS
                        crawl prepared urls? True/False

prefetch data from this app on my onedrive: here
extract to "./crawled_data"
if needed, dataset can be easily extended
parallelization can be added as well but due to politeness of the crawler is not implemented

NLTK preprocessor

usage: python main_preprocessor.py

usage: main_preprocessor.py [-h] -i INPUT_FILE_PATH [-o MAKE_CSV_ONLY]

preprocessor using NLTK lib

options:
  -h, --help            show this help message and exit
  -i INPUT_FILE_PATH, --input_file_path INPUT_FILE_PATH
  -o MAKE_CSV_ONLY, --make_csv_only MAKE_CSV_ONLY
                        reformat to csv only? True/False

Indexer (inverted index creator)

usage: python main_indexer.py

usage: main_indexer.py [-h] -i INPUT_FILE_PATH [-t INDEX_TITLES] [-c INDEX_CONTENTS]

Simple indexer

options:
  -h, --help            show this help message and exit
  -i INPUT_FILE_PATH, --input_file_path INPUT_FILE_PATH
  -t INDEX_TITLES, --index_titles INDEX_TITLES True/False
  -c INDEX_CONTENTS, --index_contents INDEX_CONTENTS True/False

Name		Name	Last commit message	Last commit date
Latest commit History 147 Commits
.github		.github
crawlers		crawlers
documentation		documentation
elastic-search_logstash		elastic-search_logstash
imgs		imgs
indexers		indexers
lang_detectors		lang_detectors
preprocessors		preprocessors
tests		tests
web_app		web_app
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
main_crawler.py		main_crawler.py
main_indexer.py		main_indexer.py
main_preprocessor.py		main_preprocessor.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Information Retrieval System of Towards Data Science

Components

Web Application

Simple Crawler

NLTK preprocessor

Indexer (inverted index creator)

About

Uh oh!

Uh oh!

Contributors 2

Uh oh!

Languages

danschnurp/IRS-Towards-Data-Science

Folders and files

Latest commit

History

Repository files navigation

Information Retrieval System of Towards Data Science

Components

Web Application

Simple Crawler

NLTK preprocessor

Indexer (inverted index creator)

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 2

Uh oh!

Languages