Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions .cursor/rules/python-developer.mdc
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
---
description:
globs:
alwaysApply: true
---
You are an AI assistant specialized in Python development. Your approach emphasizes:

1. Clear project structure with separate directories for source code, tests, docs, and config.
2. Modular design with distinct files for models, services, controllers, and utilities.
3. Configuration management using environment variables.
4. Robust error handling and logging, including context capture.
5. Comprehensive testing with pytest. All tests should be fully annotated and should contain docstrings.
6. Detailed documentation using docstrings and README files.
7. Dependency management via uv.
8. Code style consistency using pylint and black.
9. CI/CD implementation with GitHub Actions.
10. AI-friendly coding practices:
- Descriptive variable and function names
- Type hints using the built-in types if possible
- Detailed comments for complex logic
- Rich error context for debugging

Follow the following rules:
- For any python file, be sure to ALWAYS add typing annotations to each function or class.
- Be sure to include return types when necessary.
- Add descriptive google docstrings to all python functions and classes as well. Please use pep257 convention. Update existing docstrings if need be.
- Make sure you keep any comments that exist in a file.

You provide code snippets and explanations tailored to these principles, optimizing for clarity and AI-assisted development.
81 changes: 81 additions & 0 deletions .cursor/rules/python-type-hints.mdc
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
---
description:
globs:
alwaysApply: true
---
# Python Type Hints Rule

This rule enforces the use of built-in types for type hints in Python code when applicable. This follows modern Python practices (Python 3.9+) and makes the code more intuitive and maintainable.

## Rules

1. Use built-in types for type hints instead of their typing module counterparts when possible:
- Use `list[T]` instead of `List[T]`
- Use `dict[K, V]` instead of `Dict[K, V]`
- Use `set[T]` instead of `Set[T]`
- Use `tuple[T, ...]` instead of `Tuple[T, ...]`
- Use `frozenset[T]` instead of `FrozenSet[T]`

2. Only use typing module types when:
- You need to support Python versions older than 3.9
- You're using special types like `Union`, `Optional`, `Any`, etc.
- You're using generic types that don't have built-in equivalents

## Examples

✅ Good:
```python
def process_data(data: list[str]) -> dict[str, int]:
return {"count": len(data)}

def get_items() -> set[int]:
return {1, 2, 3}

def create_pairs() -> tuple[str, int]:
return ("key", 42)
```

❌ Bad:
```python
from typing import List, Dict, Set, Tuple

def process_data(data: List[str]) -> Dict[str, int]:
return {"count": len(data)}

def get_items() -> Set[int]:
return {1, 2, 3}

def create_pairs() -> Tuple[str, int]:
return ("key", 42)
```

## Exceptions

1. Keep using typing module types for:
- `Any`
- `Union`
- `Optional`
- `Literal`
- `TypeVar`
- `Generic`
- `Protocol`
- Other special typing constructs

2. When backward compatibility is required (Python < 3.9), use the typing module types.

## Benefits

1. More intuitive code that uses actual types rather than special typing module versions
2. Reduced imports from typing module
3. Better IDE support and type checking
4. Follows modern Python best practices
5. Makes the code more maintainable and easier to understand

## Implementation

When reviewing or modifying code:
1. Check for typing module imports
2. Replace typing module types with built-in types where applicable
3. Keep special typing module types when needed
4. Update docstrings to reflect the use of built-in types
5. Ensure type hints are consistent throughout the codebase
71 changes: 71 additions & 0 deletions .cursor/rules/uv-dependency-manager.mdc
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
---
description:
globs:
alwaysApply: true
---
# UV Dependency Manager Rule

This rule enforces the use of `uv` as the primary dependency management tool for Python projects.

## Requirements

1. All Python dependencies must be managed using `uv`:
- Use `uv.lock` for lockfile
- Use `pyproject.toml` for project metadata and dependencies
- Use `requirements.txt` only for development dependencies or when explicitly required

2. Project Structure:
- Must have a `pyproject.toml` file
- Must have a `uv.lock` file
- Should not use `poetry.lock` or `Pipfile.lock`

3. Commands:
- Use `uv pip install` instead of `pip install`
- Use `uv pip freeze` instead of `pip freeze`
- Use `uv pip compile` for generating requirements files

4. Virtual Environment:
- Use `uv venv` for creating virtual environments
- Use `uv pip install -r requirements.txt` for installing dependencies

## Examples

✅ Correct:
```bash
# Creating a virtual environment
uv venv

# Installing dependencies
uv pip install -r requirements.txt

# Updating dependencies
uv pip install --upgrade -r requirements.txt
```

❌ Incorrect:
```bash
# Using pip directly
pip install -r requirements.txt

# Using poetry
poetry install

# Using pipenv
pipenv install
```

## Benefits

1. Faster dependency resolution and installation
2. Better reproducibility with lockfile
3. Modern Python packaging standards
4. Improved security with dependency scanning
5. Better integration with modern Python tooling

## Implementation

When implementing this rule:
1. Ensure all dependency management commands use `uv`
2. Keep `uv.lock` and `pyproject.toml` in version control
3. Document the use of `uv` in project README
4. Set up CI/CD to use `uv` for dependency installation
15 changes: 15 additions & 0 deletions .cursorignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Ignore all .env files
.env*

# Ignore all config files
config*

# Ignore all settings files
settings*

# Ignore all input files
input*

# Ignore all output files
output*

46 changes: 46 additions & 0 deletions .github/workflows/lint.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
name: Lint and Test

on:
push:
branches: [main]
pull_request:
branches: [main]

jobs:
lint:
runs-on: ubuntu-latest
permissions:
contents: read

steps:
- uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.13"

- name: Install uv
run: |
curl -LsSf https://astral.sh/uv/install.sh | sh
echo "$HOME/.cargo/bin" >> $GITHUB_PATH

- name: Install dependencies
run: |
make install

- name: Run pylint
run: |
make check-lint

- name: Run black
run: |
make check-format

- name: Run mypy
run: |
make check-types

- name: Run tests with coverage
run: |
make test
3 changes: 0 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,6 @@ wheels/
# Ignore macOS files
.DS_Store

# Ignore cursor rules
*.cursor

# Ignore mypy cache
.mypy_cache/

Expand Down
33 changes: 33 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
#!/usr/bin/make -f

.PHONY: help check test

help: ## Show this help text
@grep -E '^[a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | sort | awk 'BEGIN {FS = ":.*?## "}; {printf "\033[36m%-30s\033[0m %s\n", $$1, $$2}'

install:
@echo "Installing dependencies..."
@uv sync --all-groups

check: check-lint check-format check-types
@echo "All checks passed"

check-lint:
@echo "Checking lint..."
@uv run python -m pylint src/ tests/

check-format:
@echo "Checking format..."
@uv run python -m black src/ tests/

format:
@echo "Formatting..."
@uv run python -m black src/ tests/

check-types:
@echo "Checking types..."
@uv run python -m mypy src/ tests/ --ignore-missing-imports --explicit-package-bases

test:
@echo "Running tests..."
@uv run pytest src/ tests/ --cov=src --cov-report=term-missing
38 changes: 37 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ source .venv/bin/activate # On Unix/macOS
3. Install dependencies:

```bash
uv pip install .
uv pip install -e ".[dev]"
```

## Configuration
Expand Down Expand Up @@ -133,6 +133,42 @@ In order to evaluate only the OCR part of the output file, we can remove all mar
uv run src/evaluate.py -f <filename> -m
```

## Running Tests

### Unit Tests

To run all unit tests:
```bash
uv run pytest tests/ -v
```

To run tests with coverage:
```bash
uv run pytest tests/ -v --cov=src --cov-report=term-missing
```

### Integration Tests

Integration tests require external services to be configured. To run them:
```bash
uv run pytest tests/ -v -m integration
```

### Test Categories

- `test_base.py`: Tests for the base parser functionality
- `test_run.py`: Tests for the main script functionality
- `test_integration.py`: Integration tests requiring external services

## Test Configuration

The test suite is configured in `pyproject.toml` with the following settings:

- Test paths: `tests/`
- Test file pattern: `test_*.py`
- Coverage reporting: Enabled
- Integration test marker: `@pytest.mark.integration`

## Development

### Code Style
Expand Down
9 changes: 8 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -23,10 +23,17 @@ dependencies = [
"pymupdf>=1.25.4",
"tenacity>=9.0.0",
"rouge>=1.0.1",
"huggingface-hub>=0.30.2",
]

[dependency-groups]
dev = ["black>=25.1.0", "mypy>=1.15.0", "pylint>=3.3.6"]
dev = [
"black>=25.1.0",
"mypy>=1.15.0",
"pylint>=3.3.6",
"pytest>=8.3.5",
"pytest-cov>=6.1.1",
]
eval = [
"python-levenshtein>=0.27.1",
"pandas>=2.2.3",
Expand Down
28 changes: 28 additions & 0 deletions pytest.ini
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
[pytest]
# Set log level
log_cli = true
log_cli_level = INFO
log_cli_format = %(asctime)s [%(levelname)8s] %(message)s (%(filename)s:%(lineno)s)
log_cli_date_format = %Y-%m-%d %H:%M:%S

# Add src directory to Python path
pythonpath = ./src

# Define test markers
markers =
unit: Unit tests that do not require external services
integration: Integration tests that require external services

# Test collection settings
testpaths = tests
python_files = test_*.py
python_classes = Test*
python_functions = test_*

# Test execution settings
addopts = -v --tb=short

# Filter warnings
filterwarnings =
ignore::DeprecationWarning
ignore::UserWarning
Empty file added src/__init__.py
Empty file.
Loading