Skip to content
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
model: "gpt-4o-2024-11-20"
toc_check_page_num: 20
max_page_num_each_node: 10
max_token_num_each_node: 20000
if_add_node_id: true
if_add_node_summary: true
if_add_doc_description: false
if_add_node_text: false
Comment thread
denis-samatov marked this conversation as resolved.
147 changes: 147 additions & 0 deletions pageindex.egg-info/PKG-INFO
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
Metadata-Version: 2.4
Name: pageindex
Version: 0.1.0
Summary: Vectorless, reasoning-based RAG indexer
License: MIT
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: openai==1.101.0
Requires-Dist: pymupdf==1.26.4
Requires-Dist: PyPDF2==3.0.1
Requires-Dist: python-dotenv==1.1.0
Requires-Dist: tiktoken==0.11.0
Requires-Dist: pyyaml==6.0.2
Requires-Dist: pydantic>=2.0
Provides-Extra: dev
Requires-Dist: pytest>=7.4.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
Dynamic: license-file

<div align="center">

<a href="https://vectify.ai/pageindex" target="_blank">
<img src="https://github.com/user-attachments/assets/46201e72-675b-43bc-bfbd-081cc6b65a1d" alt="PageIndex Banner" />
</a>

<br/>
<br/>

<p align="center">
<a href="https://trendshift.io/repositories/14736" target="_blank"><img src="https://trendshift.io/api/badge/repositories/14736" alt="VectifyAI%2FPageIndex | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
</p>

# PageIndex: Reasoning-Based Vectorless RAG

<p align="center"><b>Reasoning-native RAG&nbsp; ◦ &nbsp;No Vector DB&nbsp; ◦ &nbsp;No Chunking&nbsp; ◦ &nbsp;Human-like Retrieval</b></p>

<h4 align="center">
<a href="https://vectify.ai">🏠 Homepage</a>&nbsp; • &nbsp;
<a href="https://chat.pageindex.ai">🖥️ Chat Platform</a>&nbsp; • &nbsp;
<a href="https://pageindex.ai/mcp">🔌 MCP</a>&nbsp; • &nbsp;
<a href="https://docs.pageindex.ai">📚 Documentation</a>&nbsp; • &nbsp;
<a href="https://discord.com/invite/VuXuf29EUj">💬 Discord</a>&nbsp; • &nbsp;
<a href="https://ii2abc2jejf.typeform.com/to/tK3AXl8T">✉️ Contact Us</a>&nbsp;
</h4>

</div>

<details open>
Comment thread
denis-samatov marked this conversation as resolved.
Outdated
<summary><h3>📢 Latest Updates</h3></summary>

**🔥 Releases:**
- [**PageIndex Chat**](https://chat.pageindex.ai): The first human-like agentic platform for document analysis, built for professional long-context documents. Also available via [MCP](https://pageindex.ai/mcp) or [API](https://docs.pageindex.ai/quickstart) (beta).

**📝 Articles:**
- [**PageIndex Framework**](https://pageindex.ai/blog/pageindex-intro): Introduces the PageIndex framework — an *agentic, in-context tree index* that empowers LLMs to perform *reasoning-based, human-like retrieval* over long documents without a Vector DB or chunking.

**🧪 Cookbooks:**
- [Vectorless RAG](https://docs.pageindex.ai/cookbook/vectorless-rag-pageindex): A minimal, practical example of reasoning-based RAG using PageIndex. No vectors, no chunks, and human-like retrieval.
- [Vision-based Vectorless RAG](https://docs.pageindex.ai/cookbook/vision-rag-pageindex): Vision-only RAG without OCR; a reasoning-native approach that acts directly over PDF page images.
</details>

---

# 📑 Introduction to PageIndex

Tired of poor retrieval accuracy with Vector DBs on long, professional documents? Traditional vector RAG relies on semantic *similarity* rather than true *relevance*. But **similarity ≠ relevance** — what we need for retrieval is **relevance**, and relevance requires **reasoning**. When dealing with professional documents where domain knowledge and multi-step reasoning matter, similarity search often fails.

Inspired by AlphaGo, we propose **[PageIndex](https://vectify.ai/pageindex)** — a reasoning-based, **Vectorless RAG** framework that builds a **hierarchical tree index** from long documents and prompts the LLM to **reason over this index** for **agentic, context-aware retrieval**.

---

# ⚙️ Package Usage

### 1. Install Dependencies

```bash
pip3 install --upgrade -r requirements.txt
pip3 install -e .
```

### 2. Provide your OpenAI API Key

Create a `.env` file in the root directory and add your API key:

```bash
OPENAI_API_KEY=your_openai_key_here
```

### 3. Run PageIndex on your PDF

```bash
pageindex --pdf_path /path/to/your/document.pdf
```

---

# 💻 Developer Guide

This section is for developers contributing to `PageIndex` or integrating it as a library.

### Development Setup

1. **Clone the repository:**
```bash
git clone https://github.com/VectifyAI/PageIndex.git
cd PageIndex
```

2. **Install development dependencies:**
```bash
pip install -e ".[dev]"
# Or simply:
pip install pytest pytest-asyncio
```

3. **Run Tests:**
We use `pytest` for unit and integration testing.
```bash
pytest
```

### Project Structure

The project has been refactored into a modular library structure under `pageindex`.

- `pageindex/core/`: Core logic modules.
- `llm.py`: LLM interactions and token counting.
- `pdf.py`: PDF text extraction and processing.
- `tree.py`: Tree data structure manipulation and recursion.
- `logging.py`: Custom logging utilities.
- `pageindex/config.py`: Configuration loading and validation (Pydantic).
- `pageindex/cli.py`: Command Line Interface entry point.
- `pageindex/utils.py`: Facade for backward compatibility.

### Configuration

Configuration is handled via `pageindex/config.py`. You can modify default settings in `config.yaml` or override them via environment variables (`PAGEINDEX_CONFIG`) or CLI arguments.
Config validation is powered by Pydantic, ensuring type safety.

For API Reference, please see [API_REFERENCE.md](docs/API_REFERENCE.md).

---

# ⭐ Support Us

Give us a star 🌟 if you like the project. Thank you!
28 changes: 28 additions & 0 deletions pageindex.egg-info/SOURCES.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
LICENSE
README.md
pyproject.toml
pageindex/__init__.py
pageindex/cli.py
pageindex/config.py
pageindex/page_index.py
pageindex/page_index_md.py
pageindex/utils.py
pageindex.egg-info/PKG-INFO
pageindex.egg-info/SOURCES.txt
pageindex.egg-info/dependency_links.txt
pageindex.egg-info/entry_points.txt
pageindex.egg-info/requires.txt
pageindex.egg-info/top_level.txt
pageindex/core/__init__.py
pageindex/core/llm.py
pageindex/core/logging.py
pageindex/core/pdf.py
pageindex/core/tree.py
scripts/analyze_notebooks.py
scripts/local_client_adapter.py
scripts/refactor_notebooks_logic.py
scripts/verify_adapter.py
tests/conftest.py
tests/test_config.py
tests/test_llm.py
tests/test_tree.py
1 change: 1 addition & 0 deletions pageindex.egg-info/dependency_links.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

2 changes: 2 additions & 0 deletions pageindex.egg-info/entry_points.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
[console_scripts]
pageindex = pageindex.cli:main
11 changes: 11 additions & 0 deletions pageindex.egg-info/requires.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
openai==1.101.0
pymupdf==1.26.4
PyPDF2==3.0.1
python-dotenv==1.1.0
tiktoken==0.11.0
pyyaml==6.0.2
pydantic>=2.0

[dev]
pytest>=7.4.0
pytest-asyncio>=0.21.0
6 changes: 6 additions & 0 deletions pageindex.egg-info/top_level.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
data
docs
notebooks
pageindex
scripts
tests
90 changes: 90 additions & 0 deletions pageindex/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
import os
import yaml
from pathlib import Path
from typing import Any, Dict, Optional, Union
from pydantic import BaseModel, Field, ValidationError

class PageIndexConfig(BaseModel):
"""
Configuration schema for PageIndex.
"""
model: str = Field(default="gpt-4o", description="LLM model to use")

# PDF Processing
toc_check_page_num: int = Field(default=3, description="Number of pages to check for TOC")
max_page_num_each_node: int = Field(default=5, description="Maximum pages per leaf node")
max_token_num_each_node: int = Field(default=4000, description="Max tokens per node") # Approx

# Enrichment
if_add_node_id: bool = Field(default=True, description="Add unique ID to nodes")
if_add_node_summary: bool = Field(default=True, description="Generate summary for nodes")
if_add_doc_description: bool = Field(default=True, description="Generate doc-level description")
if_add_node_text: bool = Field(default=True, description="Keep raw text in nodes")

# Tree Optimization
if_thinning: bool = Field(default=True, description="Merge small adjacent nodes")
thinning_threshold: int = Field(default=500, description="Token threshold for merging")
summary_token_threshold: int = Field(default=200, description="Min tokens required to trigger summary generation")

# Additional
api_key: Optional[str] = Field(default=None, description="OpenAI API Key (optional, prefers env var)")

class Config:
arbitrary_types_allowed = True
Comment thread
denis-samatov marked this conversation as resolved.


class ConfigLoader:
def __init__(self, default_path: Optional[Union[str, Path]] = None):
if default_path is None:
env_path = os.getenv("PAGEINDEX_CONFIG")
if env_path:
default_path = Path(env_path)
else:
cwd_path = Path.cwd() / "config.yaml"
repo_path = Path(__file__).resolve().parents[2] / "config.yaml"
Comment thread
denis-samatov marked this conversation as resolved.
Outdated
default_path = cwd_path if cwd_path.exists() else repo_path

self.default_path = default_path
self._default_dict = self._load_yaml(default_path) if default_path else {}

@staticmethod
def _load_yaml(path: Optional[Path]) -> Dict[str, Any]:
if not path or not path.exists():
return {}
try:
with open(path, "r", encoding="utf-8") as f:
return yaml.safe_load(f) or {}
except Exception as e:
print(f"Warning: Failed to load config from {path}: {e}")
return {}

def load(self, user_opt: Optional[Union[Dict[str, Any], Any]] = None) -> PageIndexConfig:
"""
Load configuration, merging defaults with user overrides and validating via Pydantic.

Args:
user_opt: Dictionary or object with overrides.

Returns:
PageIndexConfig: Validated configuration object.
"""
user_dict: Dict[str, Any] = {}
if user_opt is None:
pass
elif hasattr(user_opt, '__dict__'):
# Handle SimpleNamespace or other objects
user_dict = {k: v for k, v in vars(user_opt).items() if v is not None}
elif isinstance(user_opt, dict):
user_dict = {k: v for k, v in user_opt.items() if v is not None}
else:
raise TypeError(f"user_opt must be dict or object, got {type(user_opt)}")

# Merge defaults and user overrides
# Pydantic accepts kwargs, efficiently merging
merged_data = {**self._default_dict, **user_dict}

try:
return PageIndexConfig(**merged_data)
except ValidationError as e:
# Re-raise nicely or log
raise ValueError(f"Configuration validation failed: {e}")
Empty file added pageindex/core/__init__.py
Empty file.
Loading
Loading