This project provides a library for bibliographic document classification and similarity analysis.
It contains a selection of methods that support:
- pre-processing of bibliographic meta data and full-text documents,
- training of multi-label multi-class classification models,
- integrating and using hierarchical subject classifications (pruning methods, performance scores),
- similarity analysis and clustering.
A detailed description including tutorials and examples can be found in the API documentation, which needs to be generated as described below.
This projects requires Python v3.8 or above and uses pip for dependency management. Besides, this package uses pyTorch to train Artificial Neural Networks via GPUs. Make sure to install the latest Nvidia graphics drivers and check further requirements.
Once published to PyPI (not available yet), install via:
python3 -m pip install slub_docsa
Download the source code by checking out the repository:
git clone https://git.slub-dresden.de/lod/maschinelle-klassifizierung/docsa.git
Use make to install python dependencies by executing the following commands:
make install
ormake install-test
(installs slub_docsa package and downloads all required runtime / test dependencies via pip)make test
(runs tests to verify correct installation, requires test dependencies)make docs
(generate API documentation, requires test dependencies)
Install essentials like python3, pip and make:
apt-get update
(update the Ubuntu package installer index)apt-get install -y make python3 python3-pip
(install python3, pip and make)
Optionally, set up a python virtual environment:
apt-get install -y python3-venv
python3 -m venv /path/to/venv
source /path/to/venv/bin/activate
Run make commands as provided above:
make install-test
make test
Further documentation of this project can be found at the following locations:
- API documentation needs to be generated via
make docs
and is than provided in the directorydocs/python/slub_docsa.html
. - Tutorials and examples are included in the API documentation
- Developer meeting notes can be found in a separate Gitlab Wiki.
- Results of various experiments related to the Qucosa and k10plus datasets can be found in a separate Gitlab repository.
Download all developer dependencies and install the slub_docsa package via pip in development mode:
make install-dev
This will link your local project such that changes to source files are immediately reflected, see pip install -e.
This project also provides container images for development. You can use docker, but also other container runtimes, e.g., podman.
Install a Container Runtime
-
Either, install
docker
anddocker-compose
:- Install docker, see https://docs.docker.com/get-docker/
- Install
docker-compose
, see https://docs.docker.com/compose/install/
-
Or, setup
podman
in Fedora 34 including the Nvidia container runtime:- Install nvidia graphics driver, and check they are working by running
nvidia-smi
- Install
podman
andpodman-compose
from repositories viadnf install podman podman-compose
- Install the nvidia container runtime using the
centos8
repositories viadnf install nvidia-container-runtime
, see installation instructions - Set
no-cgroups = true
in/etc/nvidia-container-runtime/config.toml
, which is required since Nvidia does not yet support cgroups v2 - Check your CUDA version with
nvidia-smi
, e.g.,11.4
- Identify the matching cuda docker image, e.g.,
nvidia/cuda:11.4.1-base-centos8
- Verify gpu support in podman via
podman run --security-opt=label=disable --rm nvidia/cuda:11.4.1-base-centos8 nvidia-smi
- Install nvidia graphics driver, and check they are working by running
Setup the Development Environment
- Docker images for development can be found in the
code/docker/devel
directory. - Run
build.sh gpu
to build these docker images with gpu support. - Run
up.sh gpu
anddown.sh gpu
to start and shutdown the development container. - Run
shell_python.sh gpu
to enter the python container with gpu support. - Run
shell_annif.sh
to enter the Annif container
Setup Visual Studio Code, which supports many useful features during development:
- Python Integration, including auto complete, linting, debugging
- Remote Container, which allows to use the Python environment provided by a docker container
Continuous Integration