SLUB Document Classification and Similarity Analysis

This project provides a library for bibliographic document classification and similarity analysis.

It contains a selection of methods that support:

pre-processing of bibliographic meta data and full-text documents,
training of multi-label multi-class classification models,
integrating and using hierarchical subject classifications (pruning methods, performance scores),
similarity analysis and clustering.

A detailed description including tutorials and examples can be found in the API documentation, which needs to be generated as described below.

Installation

This projects requires Python v3.8 or above and uses pip for dependency management. Besides, this package uses pyTorch to train Artificial Neural Networks via GPUs. Make sure to install the latest Nvidia graphics drivers and check further requirements.

Via Python Package Installer (not available yet)

Once published to PyPI (not available yet), install via:

python3 -m pip install slub_docsa

From Source

Download the source code by checking out the repository:

git clone https://git.slub-dresden.de/lod/maschinelle-klassifizierung/docsa.git

Use make to install python dependencies by executing the following commands:

make install or make install-test
(installs slub_docsa package and downloads all required runtime / test dependencies via pip)
make test
(runs tests to verify correct installation, requires test dependencies)
make docs
(generate API documentation, requires test dependencies)

From Source using Ubuntu 20.04

Install essentials like python3, pip and make:

apt-get update
(update the Ubuntu package installer index)
apt-get install -y make python3 python3-pip
(install python3, pip and make)

Optionally, set up a python virtual environment:

apt-get install -y python3-venv
python3 -m venv /path/to/venv
source /path/to/venv/bin/activate

Run make commands as provided above:

make install-test
make test

Documentation

Further documentation of this project can be found at the following locations:

API documentation needs to be generated via make docs and is than provided in the directory docs/python/slub_docsa.html.
Tutorials and examples are included in the API documentation
Developer meeting notes can be found in a separate Gitlab Wiki.
Results of various experiments related to the Qucosa and k10plus datasets can be found in a separate Gitlab repository.

Development

Python Virtual Environment

Download all developer dependencies and install the slub_docsa package via pip in development mode:

make install-dev

This will link your local project such that changes to source files are immediately reflected, see pip install -e.

Container Environment

This project also provides container images for development. You can use docker, but also other container runtimes, e.g., podman.

Install a Container Runtime

Either, install docker and docker-compose:
- Install docker, see https://docs.docker.com/get-docker/
- Install docker-compose, see https://docs.docker.com/compose/install/
Or, setup podman in Fedora 34 including the Nvidia container runtime:
- Install nvidia graphics driver, and check they are working by running nvidia-smi
- Install podman and podman-compose from repositories via dnf install podman podman-compose
- Install the nvidia container runtime using the centos8 repositories via dnf install nvidia-container-runtime, see installation instructions
- Set no-cgroups = true in /etc/nvidia-container-runtime/config.toml, which is required since Nvidia does not yet support cgroups v2
- Check your CUDA version with nvidia-smi, e.g., 11.4
- Identify the matching cuda docker image, e.g., nvidia/cuda:11.4.1-base-centos8
- Verify gpu support in podman via podman run --security-opt=label=disable --rm nvidia/cuda:11.4.1-base-centos8 nvidia-smi

Setup the Development Environment

Docker images for development can be found in the code/docker/devel directory.
Run build.sh gpu to build these docker images with gpu support.
Run up.sh gpu and down.sh gpu to start and shutdown the development container.
Run shell_python.sh gpu to enter the python container with gpu support.
Run shell_annif.sh to enter the Annif container

Setup Visual Studio Code, which supports many useful features during development:

Python Integration, including auto complete, linting, debugging
Remote Container, which allows to use the Python environment provided by a docker container

Continuous Integration

The CI pipeline can be triggered by running make coverage and make lint. Both commands run automated tests using pytest, ensure code guidelines by using pylint and flake8, and check for common security issues using bandit.

Name		Name	Last commit message	Last commit date
Latest commit History 263 Commits
.github/workflows		.github/workflows
.vscode		.vscode
code		code
data		data
docs		docs
.dockerignore		.dockerignore
.gitlab-ci.yml		.gitlab-ci.yml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SLUB Document Classification and Similarity Analysis

Installation

Via Python Package Installer (not available yet)

From Source

From Source using Ubuntu 20.04

Documentation

Development

Python Virtual Environment

Container Environment

About

Contributors 2

Languages

License

slub/docsa

Folders and files

Latest commit

History

Repository files navigation

SLUB Document Classification and Similarity Analysis

Installation

Via Python Package Installer (not available yet)

From Source

From Source using Ubuntu 20.04

Documentation

Development

Python Virtual Environment

Container Environment

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

Languages