VectifyAI · denis-samatov · Mar 26, 2026 · Mar 26, 2026 · Mar 26, 2026 · Mar 26, 2026
diff --git a/config.yaml b/config.yaml
@@ -0,0 +1,8 @@
+model: "gpt-4o-2024-11-20"
+toc_check_page_num: 20
+max_page_num_each_node: 10
+max_token_num_each_node: 20000
+if_add_node_id: true
+if_add_node_summary: true
+if_add_doc_description: false
+if_add_node_text: false
diff --git a/pageindex.egg-info/PKG-INFO b/pageindex.egg-info/PKG-INFO
@@ -0,0 +1,147 @@
+Metadata-Version: 2.4
+Name: pageindex
+Version: 0.1.0
+Summary: Vectorless, reasoning-based RAG indexer
+License: MIT
+Requires-Python: >=3.9
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: openai==1.101.0
+Requires-Dist: pymupdf==1.26.4
+Requires-Dist: PyPDF2==3.0.1
+Requires-Dist: python-dotenv==1.1.0
+Requires-Dist: tiktoken==0.11.0
+Requires-Dist: pyyaml==6.0.2
+Requires-Dist: pydantic>=2.0
+Provides-Extra: dev
+Requires-Dist: pytest>=7.4.0; extra == "dev"
+Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
+Dynamic: license-file
+
+<div align="center">
+
+<a href="https://vectify.ai/pageindex" target="_blank">
+  <img src="https://github.com/user-attachments/assets/46201e72-675b-43bc-bfbd-081cc6b65a1d" alt="PageIndex Banner" />
+</a>
+
+<br/>
+<br/>
+
+<p align="center">
+  <a href="https://trendshift.io/repositories/14736" target="_blank"><img src="https://trendshift.io/api/badge/repositories/14736" alt="VectifyAI%2FPageIndex | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
+</p>
+
+# PageIndex: Reasoning-Based Vectorless RAG
+
+<p align="center"><b>Reasoning-native RAG&nbsp; ◦ &nbsp;No Vector DB&nbsp; ◦ &nbsp;No Chunking&nbsp; ◦ &nbsp;Human-like Retrieval</b></p>
+
+<h4 align="center">
+  <a href="https://vectify.ai">🏠 Homepage</a>&nbsp; • &nbsp;
+  <a href="https://chat.pageindex.ai">🖥️ Chat Platform</a>&nbsp; • &nbsp;
+  <a href="https://pageindex.ai/mcp">🔌 MCP</a>&nbsp; • &nbsp;
+  <a href="https://docs.pageindex.ai">📚 Documentation</a>&nbsp; • &nbsp;
+  <a href="https://discord.com/invite/VuXuf29EUj">💬 Discord</a>&nbsp; • &nbsp;
+  <a href="https://ii2abc2jejf.typeform.com/to/tK3AXl8T">✉️ Contact Us</a>&nbsp;
+</h4>
+
+</div>
+
+<details open>
+<summary><h3>📢 Latest Updates</h3></summary>
+
+ **🔥 Releases:**
+- [**PageIndex Chat**](https://chat.pageindex.ai): The first human-like agentic platform for document analysis, built for professional long-context documents. Also available via [MCP](https://pageindex.ai/mcp) or [API](https://docs.pageindex.ai/quickstart) (beta).
+
+ **📝 Articles:**
+- [**PageIndex Framework**](https://pageindex.ai/blog/pageindex-intro): Introduces the PageIndex framework — an *agentic, in-context tree index* that empowers LLMs to perform *reasoning-based, human-like retrieval* over long documents without a Vector DB or chunking.
+
+ **🧪 Cookbooks:**
+- [Vectorless RAG](https://docs.pageindex.ai/cookbook/vectorless-rag-pageindex): A minimal, practical example of reasoning-based RAG using PageIndex. No vectors, no chunks, and human-like retrieval.
+- [Vision-based Vectorless RAG](https://docs.pageindex.ai/cookbook/vision-rag-pageindex): Vision-only RAG without OCR; a reasoning-native approach that acts directly over PDF page images.
+</details>
+
+---
+
+# 📑 Introduction to PageIndex
+
+Tired of poor retrieval accuracy with Vector DBs on long, professional documents? Traditional vector RAG relies on semantic *similarity* rather than true *relevance*. But **similarity ≠ relevance** — what we need for retrieval is **relevance**, and relevance requires **reasoning**. When dealing with professional documents where domain knowledge and multi-step reasoning matter, similarity search often fails.
+
+Inspired by AlphaGo, we propose **[PageIndex](https://vectify.ai/pageindex)** — a reasoning-based, **Vectorless RAG** framework that builds a **hierarchical tree index** from long documents and prompts the LLM to **reason over this index** for **agentic, context-aware retrieval**.
+
+---
+
+# ⚙️ Package Usage
+
+### 1. Install Dependencies
+
+```bash
+pip3 install --upgrade -r requirements.txt
+pip3 install -e .
+```
+
+### 2. Provide your OpenAI API Key
+
+Create a `.env` file in the root directory and add your API key:
+
+```bash
+OPENAI_API_KEY=your_openai_key_here
+```
+
+### 3. Run PageIndex on your PDF
+
+```bash
+pageindex --pdf_path /path/to/your/document.pdf
+```
+
+---
+
+# 💻 Developer Guide
+
+This section is for developers contributing to `PageIndex` or integrating it as a library.
+
+### Development Setup
+
+1.  **Clone the repository:**
+    ```bash
+    git clone https://github.com/VectifyAI/PageIndex.git
+    cd PageIndex
+    ```
+
+2.  **Install development dependencies:**
+    ```bash
+    pip install -e ".[dev]"
+    # Or simply:
+    pip install pytest pytest-asyncio
+    ```
+
+3.  **Run Tests:**
+    We use `pytest` for unit and integration testing.
+    ```bash
+    pytest
+    ```
+
+### Project Structure
+
+The project has been refactored into a modular library structure under `pageindex`.
+
+-   `pageindex/core/`: Core logic modules.
+    -   `llm.py`: LLM interactions and token counting.
+    -   `pdf.py`: PDF text extraction and processing.
+    -   `tree.py`: Tree data structure manipulation and recursion.
+    -   `logging.py`: Custom logging utilities.
+-   `pageindex/config.py`: Configuration loading and validation (Pydantic).
+-   `pageindex/cli.py`: Command Line Interface entry point.
+-   `pageindex/utils.py`: Facade for backward compatibility.
+
+### Configuration
+
+Configuration is handled via `pageindex/config.py`. You can modify default settings in `config.yaml` or override them via environment variables (`PAGEINDEX_CONFIG`) or CLI arguments.
+Config validation is powered by Pydantic, ensuring type safety.
+
+For API Reference, please see [API_REFERENCE.md](docs/API_REFERENCE.md).
+
+---
+
+# ⭐ Support Us
+
+Give us a star 🌟 if you like the project. Thank you!
diff --git a/pageindex.egg-info/SOURCES.txt b/pageindex.egg-info/SOURCES.txt
@@ -0,0 +1,28 @@
+LICENSE
+README.md
+pyproject.toml
+pageindex/__init__.py
+pageindex/cli.py
+pageindex/config.py
+pageindex/page_index.py
+pageindex/page_index_md.py
+pageindex/utils.py
+pageindex.egg-info/PKG-INFO
+pageindex.egg-info/SOURCES.txt
+pageindex.egg-info/dependency_links.txt
+pageindex.egg-info/entry_points.txt
+pageindex.egg-info/requires.txt
+pageindex.egg-info/top_level.txt
+pageindex/core/__init__.py
+pageindex/core/llm.py
+pageindex/core/logging.py
+pageindex/core/pdf.py
+pageindex/core/tree.py
+scripts/analyze_notebooks.py
+scripts/local_client_adapter.py
+scripts/refactor_notebooks_logic.py
+scripts/verify_adapter.py
+tests/conftest.py
+tests/test_config.py
+tests/test_llm.py
+tests/test_tree.py
diff --git a/pageindex.egg-info/dependency_links.txt b/pageindex.egg-info/dependency_links.txt
@@ -0,0 +1 @@
+
diff --git a/pageindex.egg-info/entry_points.txt b/pageindex.egg-info/entry_points.txt
@@ -0,0 +1,2 @@
+[console_scripts]
+pageindex = pageindex.cli:main
diff --git a/pageindex.egg-info/requires.txt b/pageindex.egg-info/requires.txt
@@ -0,0 +1,11 @@
+openai==1.101.0
+pymupdf==1.26.4
+PyPDF2==3.0.1
+python-dotenv==1.1.0
+tiktoken==0.11.0
+pyyaml==6.0.2
+pydantic>=2.0
+
+[dev]
+pytest>=7.4.0
+pytest-asyncio>=0.21.0
diff --git a/pageindex.egg-info/top_level.txt b/pageindex.egg-info/top_level.txt
@@ -0,0 +1,6 @@
+data
+docs
+notebooks
+pageindex
+scripts
+tests
diff --git a/pageindex/config.py b/pageindex/config.py
@@ -0,0 +1,90 @@
+import os
+import yaml
+from pathlib import Path
+from typing import Any, Dict, Optional, Union
+from pydantic import BaseModel, Field, ValidationError
+
+class PageIndexConfig(BaseModel):
+    """
+    Configuration schema for PageIndex.
+    """
+    model: str = Field(default="gpt-4o", description="LLM model to use")
+
+    # PDF Processing
+    toc_check_page_num: int = Field(default=3, description="Number of pages to check for TOC")
+    max_page_num_each_node: int = Field(default=5, description="Maximum pages per leaf node")
+    max_token_num_each_node: int = Field(default=4000, description="Max tokens per node") # Approx
+
+    # Enrichment
+    if_add_node_id: bool = Field(default=True, description="Add unique ID to nodes")
+    if_add_node_summary: bool = Field(default=True, description="Generate summary for nodes")
+    if_add_doc_description: bool = Field(default=True, description="Generate doc-level description")
+    if_add_node_text: bool = Field(default=True, description="Keep raw text in nodes")
+
+    # Tree Optimization
+    if_thinning: bool = Field(default=True, description="Merge small adjacent nodes")
+    thinning_threshold: int = Field(default=500, description="Token threshold for merging")
+    summary_token_threshold: int = Field(default=200, description="Min tokens required to trigger summary generation")
+
+    # Additional
+    api_key: Optional[str] = Field(default=None, description="OpenAI API Key (optional, prefers env var)")
+
+    class Config:
+        arbitrary_types_allowed = True
+
+
+class ConfigLoader:
+    def __init__(self, default_path: Optional[Union[str, Path]] = None):
+        if default_path is None:
+            env_path = os.getenv("PAGEINDEX_CONFIG")
+            if env_path:
+                default_path = Path(env_path)
+            else:
+                cwd_path = Path.cwd() / "config.yaml"
+                repo_path = Path(__file__).resolve().parents[2] / "config.yaml"
+                default_path = cwd_path if cwd_path.exists() else repo_path
+
+        self.default_path = default_path
+        self._default_dict = self._load_yaml(default_path) if default_path else {}
+
+    @staticmethod
+    def _load_yaml(path: Optional[Path]) -> Dict[str, Any]:
+        if not path or not path.exists():
+            return {}
+        try:
+            with open(path, "r", encoding="utf-8") as f:
+                return yaml.safe_load(f) or {}
+        except Exception as e:
+            print(f"Warning: Failed to load config from {path}: {e}")
+            return {}
+
+    def load(self, user_opt: Optional[Union[Dict[str, Any], Any]] = None) -> PageIndexConfig:
+        """
+        Load configuration, merging defaults with user overrides and validating via Pydantic.
+
+        Args:
+            user_opt: Dictionary or object with overrides.
+
+        Returns:
+            PageIndexConfig: Validated configuration object.
+        """
+        user_dict: Dict[str, Any] = {}
+        if user_opt is None:
+            pass
+        elif hasattr(user_opt, '__dict__'):
+            # Handle SimpleNamespace or other objects
+            user_dict = {k: v for k, v in vars(user_opt).items() if v is not None}
+        elif isinstance(user_opt, dict):
+            user_dict = {k: v for k, v in user_opt.items() if v is not None}
+        else:
+             raise TypeError(f"user_opt must be dict or object, got {type(user_opt)}")
+
+        # Merge defaults and user overrides
+        # Pydantic accepts kwargs, efficiently merging
+        merged_data = {**self._default_dict, **user_dict}
+
+        try:
+            return PageIndexConfig(**merged_data)
+        except ValidationError as e:
+            # Re-raise nicely or log
+            raise ValueError(f"Configuration validation failed: {e}")
diff --git a/pageindex/core/__init__.py b/pageindex/core/__init__.py
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		[console_scripts]
		pageindex = pageindex.cli:main