Skip to content

Add preprocessing module. #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/package.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ jobs:
id: meta
uses: docker/metadata-action@v5
with:
images: mzdotai/blueprint
images: mzdotai/structured_qa
flavor: |
latest=auto

Expand Down
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -162,3 +162,5 @@ cython_debug/

.idea/
.vscode/

example_outputs
1 change: 1 addition & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ repos:
rev: v5.0.0
hooks:
- id: check-added-large-files
exclude: example_data
- id: check-case-conflict
- id: check-json
- id: check-merge-conflict
Expand Down
6 changes: 3 additions & 3 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,10 @@ RUN apt-get update && apt-get install -y \
git


COPY . /home/appuser/blueprint
WORKDIR /home/appuser/blueprint
COPY . /home/appuser/structured_qa
WORKDIR /home/appuser/structured_qa

RUN pip3 install /home/appuser/blueprint
RUN pip3 install /home/appuser/structured_qa

RUN groupadd --gid 1000 appuser \
&& useradd --uid 1000 --gid 1000 -ms /bin/bash appuser
Expand Down
30 changes: 27 additions & 3 deletions demo/app.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,31 @@
from io import BytesIO
from pathlib import Path

import streamlit as st
from docling_core.types.io import DocumentStream
from structured_qa.preprocessing import document_to_sections_dir

st.title("Structured Q&A")

st.header("Uploading Data")

uploaded_file = st.file_uploader(
"Choose a file", type=["pdf", "html", "txt", "docx", "md"]
)

from blueprint.hello import hello
if uploaded_file is not None:
st.divider()
st.header("Loading and converting to sections")
st.markdown("[Docs for this Step]()")
st.divider()

st.title("Blueprint Demo")
with st.spinner("Converting to sections..."):
document_to_sections_dir(
DocumentStream(
name=uploaded_file.name, stream=BytesIO(uploaded_file.read())
),
"output",
)

st.write(hello())
sections = [f.stem for f in Path("output").iterdir()]
st.json(sections)
2 changes: 1 addition & 1 deletion docs/api.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
# API Reference

"::: blueprint.hello"
::: structured_qa.preprocessing
26 changes: 26 additions & 0 deletions docs/cli.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Command Line Interface

Once you have [installed the blueprint](./getting-started.md), you can use it from the CLI.

You can either provide the path to a configuration file:

```bash
structured-qa --from_config "example_data/config.yaml"
```

Or provide values to the arguments directly:


```bash
structured-aq \
--input_file "example_data/EU_AI_ACT_CHAPTER_V.pdf" \
--output_folder "example_outputs/EU_AI_ACT_CHAPTER_V"
```

---

::: structured_qa.cli.structured_qa

---

::: structured_qa.config.Config
8 changes: 4 additions & 4 deletions docs/future-features-contributions.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,18 +7,18 @@ This Blueprint is an evolving project designed to grow with the help of the open
## 🌟 **How You Can Contribute**

### 🛠️ **Enhance the Blueprint**
- Check the [Issues](https://github.com/mozilla-ai/blueprint-template/issues) page to see if there are feature requests you'd like to implement
- Refer to our [Contribution Guide](https://github.com/mozilla-ai/blueprint-template/blob/main/CONTRIBUTING.md) for more details on contributions
- Check the [Issues](https://github.com/mozilla-ai/structured-qa/issues) page to see if there are feature requests you'd like to implement
- Refer to our [Contribution Guide](https://github.com/mozilla-ai/structured-qa/blob/main/CONTRIBUTING.md) for more details on contributions

### 🎨 **Extensibility Ideas**

This Blueprint is designed to be a foundation you can build upon. By extending its capabilities, you can open the door to new applications, improve user experience, and adapt the Blueprint to address other use cases. Here are a few ideas for how you can expand its potential:


We’d love to see how you can enhance this Blueprint! If you create improvements or extend its capabilities, consider contributing them back to the project so others in the community can benefit from your work. Check out our [Contributions Guide](https://github.com/mozilla-ai/blueprint-template/blob/main/CONTRIBUTING.md) to get started!
We’d love to see how you can enhance this Blueprint! If you create improvements or extend its capabilities, consider contributing them back to the project so others in the community can benefit from your work. Check out our [Contributions Guide](https://github.com/mozilla-ai/structured-qa/blob/main/CONTRIBUTING.md) to get started!

### 💡 **Share Your Ideas**
Got an idea for how this Blueprint could be improved? You can share your suggestions through [GitHub Discussions](https://github.com/mozilla-ai/blueprint-template/discussions).
Got an idea for how this Blueprint could be improved? You can share your suggestions through [GitHub Discussions](https://github.com/mozilla-ai/structured-qa/discussions).

### 🌍 **Build New Blueprints**
This project is part of a larger initiative to create a collection of reusable starter code solutions that use open-source AI tools. If you’re inspired to create your own Blueprint, you can use the [Blueprint-template](https://github.com/new?template_name=Blueprint-template&template_owner=mozilla-ai) to get started.
Expand Down
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ These docs are your companion to mastering this Blueprint.
- **[Future Features & Contributions](future-features-contributions.md):** Learn about exciting upcoming features and how to contribute to the project.


Have more questions? Reach out to us on [GitHub Discussions](https://github.com/mozilla-ai/blueprint-template/discussions).
Have more questions? Reach out to us on [GitHub Discussions](https://github.com/mozilla-ai/structured-qa/discussions).

---

Expand Down
Binary file added example_data/1706.03762v7.pdf
Binary file not shown.
2 changes: 2 additions & 0 deletions example_data/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
input_file: example_data/EU_AI_ACT_CHAPTER_V.pdf
output_dir: example_outputs/EU_AI_ACT_CHAPTER_V
5 changes: 3 additions & 2 deletions mkdocs.yml
Original file line number Diff line number Diff line change
@@ -1,12 +1,13 @@
site_name: Blueprints Docs
repo_url: https://github.com/mozilla-ai/blueprint-template
repo_name: blueprint-template
repo_url: https://github.com/mozilla-ai/structured-qa
repo_name: structured-qa

nav:
- Home: index.md
- Getting Started: getting-started.md
- Step-by-Step Guide: step-by-step-guide.md
- Customization Guide: customization.md
- Command Line Interface: cli.md
- API Reference: api.md
- Future Features & Contributions: future-features-contributions.md

Expand Down
15 changes: 11 additions & 4 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,17 @@ requires = ["setuptools>=48", "setuptools_scm[toml]>=6.3.1"]
build-backend = "setuptools.build_meta"

[project]
name = "blueprint"
name = "structured-qa"
readme = "README.md"
license = {text = "Apache-2.0"}
requires-python = ">=3.10"
dynamic = ["version"]
dependencies = [
"fire",
"loguru",
"langchain-text-splitters",
"pymupdf4llm",
"pyyaml",
"streamlit",
]

Expand All @@ -26,13 +30,16 @@ tests = [
]

[project.urls]
Documentation = "https://mozilla-ai.github.io/Blueprint-template/"
Issues = "https://github.com/mozilla-ai/Blueprint-template/issues"
Source = "https://github.com/mozilla-ai/Blueprint-template"
Documentation = "https://mozilla-ai.github.io/structured-qa/"
Issues = "https://github.com/mozilla-ai/structured-qa/issues"
Source = "https://github.com/mozilla-ai/structured-qa"

[tool.setuptools.packages.find]
exclude = ["tests", "tests.*"]
where = ["src"]
namespaces = false

[tool.setuptools_scm]

[project.scripts]
structured-qa = "structured_qa.cli:main"
8 changes: 0 additions & 8 deletions src/blueprint/hello.py

This file was deleted.

File renamed without changes.
47 changes: 47 additions & 0 deletions src/structured_qa/cli.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
from pathlib import Path

from fire import Fire
from loguru import logger

import yaml

from structured_qa.config import Config
from structured_qa.preprocessing import document_to_sections_dir


@logger.catch(reraise=True)
def structured_qa(
input_file: str | None = None,
output_dir: str | None = None,
from_config: str | None = None,
):
"""

Args:
input_file: Path to the input document.
output_dir: Path to the output directory.
Structure of the output directory:

```
output_dir/
section_1.txt
section_2.txt
...
```
from_config: The path to the config file.

If provided, all other arguments will be ignored.
"""
if from_config:
config = Config.model_validate(yaml.safe_load(Path(from_config).read_text()))
else:
Path(output_dir).mkdir(exist_ok=True, parents=True)
config = Config(input_file=input_file, output_dir=output_dir)

logger.info("Loading and converting to sections")
document_to_sections_dir(config.input_file, config.output_dir)
logger.success("Done")


def main():
Fire(structured_qa)
6 changes: 6 additions & 0 deletions src/structured_qa/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
from pydantic import BaseModel, DirectoryPath, FilePath


class Config(BaseModel):
input_file: FilePath
output_dir: DirectoryPath
56 changes: 56 additions & 0 deletions src/structured_qa/preprocessing.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
from pathlib import Path

import pymupdf4llm
from langchain_text_splitters import MarkdownHeaderTextSplitter

from loguru import logger


@logger.catch(reraise=True)
def document_to_sections_dir(input_file: str, output_dir: str) -> list[str]:
"""
Convert a document to a directory of sections.

Uses [pymupdf4llm](https://ds4sd.github.io/docling/) to convert input_file to markdown.
Then uses [langchain_text_splitters] to split the markdown into sections based on the headers.

Args:
input_file: Path to the input document.
output_dir: Path to the output directory.
Structure of the output directory:

```
output_dir/
section_1.txt
section_2.txt
...
```

Returns:
List of section names.
"""

logger.info(f"Converting {input_file}")
md_text = pymupdf4llm.to_markdown("example_data/1706.03762v7.pdf")
logger.success("Converted")

logger.info("Extracting sections")
splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=[("#", "Header 1"), ("##", "Header 2"), ("###", "Header 3")]
)
sections = splitter.split_text(md_text)
logger.success(f"Found {len(sections)} sections")

logger.info(f"Writing sections to {output_dir}")
output_dir = Path(output_dir)
output_dir.mkdir(exist_ok=True, parents=True)
section_names = []
for section in sections:
section_name = list(section.metadata.values())[-1]
section_names.append(section_name)
(output_dir / f"{section_name.replace('/', '_')}.txt").write_text(
section.page_content
)
logger.success("Done")

return section_names
8 changes: 8 additions & 0 deletions tests/conftest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
from pathlib import Path

import pytest


@pytest.fixture(scope="session")
def example_data():
return Path(__file__).parent.parent / "example_data"
5 changes: 0 additions & 5 deletions tests/unit/test_hello.py

This file was deleted.

9 changes: 9 additions & 0 deletions tests/unit/test_preprocessing.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
from structured_qa.preprocessing import document_to_sections_dir


def test_document_to_sections_dir(tmp_path, example_data):
output_dir = tmp_path / "output"
document_to_sections_dir(example_data / "1706.03762v7.pdf", output_dir)
sections = list(output_dir.iterdir())
assert all(section.is_file() and section.suffix == ".txt" for section in sections)
assert len(sections) == 10
Loading