GitHub - jtom95/scientific-papers-ocr: Performs Optical Character Recognition on Scientific Articles and Publications.

This is a wrapper around facebook's NOUGAT model for scientific paper OCR.

Improvements on Regular Nougat:

For lower end GPUs, it is necessary to set lower precision and use the smaller model 0.1.0-small. As a consequence the model is prone to skip parts of the papers. In this implementation, tesseract is used to double check if some parts of the paper are skipped. It does this by chosing representative sequences of words sampled from the text and verifying that they appear in the nougat extraction.
By default nougat returns .mmd files corresponding to each page. This project returns large .json files that collect, not only the raw text, but also useful metadata of the pdf paper (such as references information, authors, year etc.). The .json files are named "rich_documents" in the code.
Functions have been developed to write the full markdown text (.md) or write on a Notion page from the rich document.

Setup

Download the nougat transformer model of your choice from ... [small model is recommended for lower end GPUs].
Install pytorch with access to cuda: https://pytorch.org/get-started/locally/
- verify cuda is installed an available on your pc by opening the CMD and typing: nvidia-smi
- verify that pytorch is intalled with cuda available by opening a CMD and typing:
```
python

import torch
torch.cuda.is_available()
```
  should return True
Install tesseract on your PC (for windows: https://github.com/UB-Mannheim/tesseract/wiki)
- verify it is installed by opening a CMD and typing: tesseract
- After you have installed tesseract, install pytesseract with pip.
install the requirements listed in the pyproject.toml file.

Setup for Notion

In order to use the integration with Notion you must:

setup a new integration on [notion/integrations]https://www.notion.com/my-integrations.
save the Notion API key: "secret...."
go to the page of interest. Allow the integration on this page by clicking on ... then on + Add Connections and selcting your integration.
save the page ID (you can obtain the page id by copying the page link).

Full tutorial at: https://developers.notion.com/docs/create-a-notion-integration

Suggestion

In order to run the provided high level functions you need to provide the:

path to the NOUGAT tranformer model
notion API key (and optionally notion version)

While it is possible to pass these in the functions themselves, the simpler solution is to set them as environment variables. If you're working from this directory you can achieve this simply by uncommenting and filling out the .env file.

High-Level Functions Description and Examples

1. `generate_rich_documents.py`

Description: This script processes PDF files using the NOUGAT OCR model to generate rich document databases. These databases contain structured data extracted from the PDFs, including raw text and metadata such as references, authors, and publication year. The script supports batch processing of multiple PDFs, allows specifying the model size (small or base), and can start processing from scratch or continue from a previous state.

Arguments:

pdf_paths: List of paths to PDF files to be processed.
--output_dir (-o): The directory where the output database will be stored. Defaults to the current directory.
--start_from_scratch: Flag to indicate whether to start processing from scratch. Defaults to True.
--model_directory: The directory where the NOUGAT model is stored. If not provided, it attempts to use the NOUGAT_MODEL_DIR environment variable.
--model_size: The size of the NOUGAT model to use (small or base). Defaults to small.
--batch_size: The number of PDFs to process in a batch. Defaults to 1.

Example:

python generate_rich_documents.py "path/to/pdf1.pdf" "path/to/pdf2.pdf" -o "path/to/output_dir" --model_directory "path/to/model_dir" --model_size small --batch_size 2

2. `rich_documents_to_markdown.py`

Description: This script converts rich documents (in JSON format) into Markdown files. It reads the structured data from the rich documents and generates Markdown files that can include references and other metadata. The script supports processing multiple rich documents in a batch.

Arguments:

rich_document_paths: List of paths to rich document JSON files to be converted.
--output_dir (-o): The directory where the Markdown files will be stored. Defaults to the current directory.

Example:

python rich_documents_to_markdown.py "path/to/rich_document1.json" "path/to/rich_document2.json" -o "path/to/output_dir"

3. `rich_documents_to_notion.py`

Description: This script uploads rich documents (in JSON format) to a specified Notion page. It uses the Notion API to create or update pages with the content from the rich documents, including text, references, and metadata. The script supports uploading multiple rich documents to the same or different Notion pages.

Arguments:

rich_document_paths: List of paths to rich document JSON files to be uploaded.
--page_id (-p): The ID of the Notion page where the documents will be uploaded. If not provided, the script will prompt for it.
--notion_api_key (-k): The Notion API key used for authentication. If not provided, the script attempts to use the NOTION_API_KEY environment variable.
--notion_version (-v): The version of the Notion API to use. Defaults to "2022-06-28".

Example:

python rich_documents_to_notion.py "path/to/rich_document1.json" "path/to/rich_document2.json" -p "notion_page_id" -k "your_notion_api_key" -v "2022-06-28"

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.vscode		.vscode
extract_OCR		extract_OCR
helper_functions		helper_functions
notion_interface		notion_interface
rich_docs		rich_docs
.env		.env
.gitignore		.gitignore
LICENCE		LICENCE
NOTICE		NOTICE
README.md		README.md
generate_rich_documents.py		generate_rich_documents.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
rich_document_to_markdown.py		rich_document_to_markdown.py
rich_document_to_notion.py		rich_document_to_notion.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Improvements on Regular Nougat:

Setup

Setup for Notion

Suggestion

High-Level Functions Description and Examples

1. `generate_rich_documents.py`

2. `rich_documents_to_markdown.py`

3. `rich_documents_to_notion.py`

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

jtom95/scientific-papers-ocr

Folders and files

Latest commit

History

Repository files navigation

Improvements on Regular Nougat:

Setup

Setup for Notion

Suggestion

High-Level Functions Description and Examples

1. generate_rich_documents.py

2. rich_documents_to_markdown.py

3. rich_documents_to_notion.py

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

1. `generate_rich_documents.py`

2. `rich_documents_to_markdown.py`

3. `rich_documents_to_notion.py`

Packages