diff --git a/your-code/main.ipynb b/your-code/main.ipynb deleted file mode 100644 index e3a225a..0000000 --- a/your-code/main.ipynb +++ /dev/null @@ -1,709 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "id": "RsnCPbdkxYZd" - }, - "source": [ - "
\n", - "

Self-Guided Lab: Retrieval-Augmented Generation (RAGs)

\n", - "
" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "tZp4BQAVxYZj" - }, - "source": [ - "
\n", - " \"NLP\n", - "
" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "gizk6HCYxYZo" - }, - "source": [ - "

Data Storage & Retrieval

\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "QW5UOI8ZxYZp" - }, - "source": [ - "

PyPDFLoader

\n", - "\n", - "`PyPDFLoader` is a lightweight Python library designed to streamline the process of loading and parsing PDF documents for text processing tasks. It is particularly useful in Retrieval-Augmented Generation workflows where text extraction from PDFs is required.\n", - "\n", - "- **What Does PyPDFLoader Do?**\n", - " - Extracts text from PDF files, retaining formatting and layout.\n", - " - Simplifies the preprocessing of document-based datasets.\n", - " - Supports efficient and scalable loading of large PDF collections.\n", - "\n", - "- **Key Features:**\n", - " - Compatible with popular NLP libraries and frameworks.\n", - " - Handles multi-page PDFs and embedded images (e.g., OCR-compatible setups).\n", - " - Provides flexible configurations for structured text extraction.\n", - "\n", - "- **Use Cases:**\n", - " - Preparing PDF documents for retrieval-based systems in RAGs.\n", - " - Automating the text extraction pipeline for document analysis.\n", - " - Creating datasets from academic papers, technical manuals, and reports.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%pip install langchain langchain_community pypdf\n", - "%pip install termcolor langchain_openai langchain-huggingface sentence-transformers chromadb langchain_chroma tiktoken openai python-dotenv\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "6heKZkQUxYZr" - }, - "outputs": [], - "source": [ - "import os\n", - "from langchain.document_loaders import PyPDFLoader\n", - "from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter\n", - "import warnings\n", - "warnings.filterwarnings('ignore')\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "sRS44B2XxYZs", - "vscode": { - "languageId": "plaintext" - } - }, - "source": [ - "

Loading the Documents

" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "cuREtJRixYZt" - }, - "outputs": [], - "source": [ - "# File path for the document\n", - "\n", - "file_path = \"LAB/ai-for-everyone.pdf\"" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "pz_8SOLxxYZt" - }, - "source": [ - "

Documents into pages

\n", - "\n", - "The `PyPDFLoader` library allows efficient loading and splitting of PDF documents into smaller, manageable parts for NLP tasks.\n", - "\n", - "This functionality is particularly useful in workflows requiring granular text processing, such as Retrieval-Augmented Generation (RAG).\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "_b5Z_45UxYZu", - "outputId": "a600d69f-14fe-4492-f236-97261d6ff36c" - }, - "outputs": [], - "source": [ - "# Load and split the document\n", - "loader = PyPDFLoader(file_path)\n", - "pages = loader.load_and_split()\n", - "len(pages)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "wt50NRQaxYZv" - }, - "source": [ - "

Pages into Chunks

\n", - "\n", - "\n", - "#### RecursiveCharacterTextSplitter in LangChain\n", - "\n", - "The `RecursiveCharacterTextSplitter` is the **recommended splitter** in LangChain when you want to break down long documents into smaller, semantically meaningful chunks — especially useful in **RAG pipelines**, where clean context chunks lead to better LLM responses.\n", - "\n", - "#### Parameters\n", - "\n", - "| Parameter | Description |\n", - "|-----------------|-----------------------------------------------------------------------------|\n", - "| `chunk_size` | The **maximum number of characters** allowed in a chunk (e.g., `1000`). |\n", - "| `chunk_overlap` | The number of **overlapping characters** between consecutive chunks (e.g., `200`). This helps preserve context continuity. |\n", - "\n", - "#### How it works\n", - "`RecursiveCharacterTextSplitter` attempts to split the text **intelligently**, trying the following separators in order:\n", - "1. Paragraphs (`\"\\n\\n\"`)\n", - "2. Lines (`\"\\n\"`)\n", - "3. Sentences or words (`\" \"`)\n", - "4. Individual characters (as a last resort)\n", - "\n", - "This makes it ideal for handling **natural language documents**, such as PDFs, articles, or long reports, without breaking sentences or paragraphs in awkward ways.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "text_splitter = RecursiveCharacterTextSplitter(\n", - " chunk_size=1000,\n", - " chunk_overlap=200\n", - ")\n", - "chunks = text_splitter.split_documents(pages)\n", - "\n", - "len(chunks)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Alternative: CharacterTextSplitter\n", - "\n", - "`CharacterTextSplitter` is a simpler splitter that breaks text into chunks based **purely on character count**, without trying to preserve any natural language structure.\n", - "\n", - "##### Example:\n", - "```python\n", - "from langchain.text_splitter import CharacterTextSplitter\n", - "\n", - "text_splitter = CharacterTextSplitter(\n", - " chunk_size=1000,\n", - " chunk_overlap=200\n", - ")\n", - "````\n", - "\n", - "This method is faster and more predictable but may split text in the middle of a sentence or paragraph, which can hurt performance in downstream tasks like retrieval or QA.\n", - "\n", - "---\n", - "\n", - "#### Comparison Table\n", - "\n", - "| Feature | RecursiveCharacterTextSplitter | CharacterTextSplitter |\n", - "| ------------------------------ | ------------------------------ | ------------------------- |\n", - "| Structure-aware splitting | Yes | No |\n", - "| Preserves sentence/paragraphs | Yes | No |\n", - "| Risk of splitting mid-sentence | Minimal | High |\n", - "| Ideal for RAG/document QA | Highly recommended | Only if structured text |\n", - "| Performance speed | Slightly slower | Faster |\n", - "\n", - "---\n", - "\n", - "#### Recommendation\n", - "\n", - "Use `RecursiveCharacterTextSplitter` for most real-world document processing tasks, especially when building RAG pipelines or working with structured natural language content like PDFs or articles." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Best Practices for Choosing Chunk Size in RAG\n", - "\n", - "### Best Practices for Chunk Size in RAG\n", - "\n", - "| Factor | Recommendation |\n", - "| ---------------------------| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n", - "| **LLM context limit** | Choose a chunk size that lets you retrieve multiple chunks **without exceeding the model’s token limit**. For example, GPT-4o supports 128k tokens, but with GPT-3.5 (16k) or GPT-4 (32k), keep it modest. |\n", - "| **Chunk size (in characters)** | Typically: **500–1,000 characters** per chunk → ~75–200 tokens. This fits well for retrieval + prompt without context overflow. |\n", - "| **Chunk size (in tokens)** | If using token-based splitter (e.g. `TokenTextSplitter`): aim for **100–300 tokens** per chunk. |\n", - "| **Chunk overlap** | Use **overlap of 10–30%** (e.g., 100–300 characters or ~50 tokens) to preserve context across chunk boundaries and avoid cutting off important ideas mid-sentence. |\n", - "| **Document structure** | Use **`RecursiveCharacterTextSplitter`** to preserve semantic boundaries (paragraphs, sentences) instead of arbitrary cuts. |\n", - "| **Task type** | For **question answering**, smaller chunks (~500–800 chars) reduce noise.
For **summarization**, slightly larger chunks (~1000–1500) are OK. |\n", - "| **Embedding model** | Some models (e.g., `text-embedding-3-large`) can handle long input. But still, smaller chunks give **finer-grained retrieval**, which improves relevance. |\n", - "| **Query type** | If users ask **very specific questions**, small focused chunks are better. For broader queries, bigger chunks might help. |\n", - "\n", - "\n", - "### Rule of Thumb\n", - "\n", - "| Use Case | Chunk Size | Overlap |\n", - "| ------------------------| --------------- | ------- |\n", - "| Factual Q&A | 500–800 chars | 100–200 |\n", - "| Summarization | 1000–1500 chars | 200–300 |\n", - "| Technical documents | 400–700 chars | 100–200 |\n", - "| Long reports/books | 800–1200 chars | 200–300 |\n", - "| Small LLMs (≤16k tokens) | ≤800 chars | 100–200 |\n", - "\n", - "\n", - "### Avoid\n", - "\n", - "- Chunks >2000 characters: risks context overflow.\n", - "- No overlap: may lose key information between chunks.\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Mg15RjVPxYZw" - }, - "source": [ - "

Embeddings

\n", - "\n", - "Embeddings transform text into dense vector representations, capturing semantic meaning and contextual relationships. They are essential for efficient document retrieval and similarity analysis.\n", - "\n", - "- **What are OpenAI Embeddings?**\n", - " - Pre-trained embeddings like `text-embedding-3-large` generate high-quality vector representations for text.\n", - " - Encapsulate semantic relationships in the text, enabling robust NLP applications.\n", - "\n", - "- **Key Features of `text-embedding-3-large`:**\n", - " - Large-scale embedding model optimized for accuracy and versatility.\n", - " - Handles diverse NLP tasks, including retrieval, classification, and clustering.\n", - " - Ideal for applications with high-performance requirements.\n", - "\n", - "- **Benefits:**\n", - " - Reduces the need for extensive custom training.\n", - " - Provides state-of-the-art performance in retrieval-augmented systems.\n", - " - Compatible with RAGs to create powerful context-aware models.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "L0xDxElwxYZw" - }, - "outputs": [], - "source": [ - "from langchain.embeddings import OpenAIEmbeddings\n", - "from dotenv import load_dotenv" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "_WRIo3_0xYZx", - "outputId": "78bfbbf3-9d25-4e31-bdbc-3e932e6bbfec" - }, - "outputs": [], - "source": [ - "load_dotenv()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "MNZfTng5xYZz", - "outputId": "db1a7c85-ef9f-447e-92cd-9d097e959847" - }, - "outputs": [], - "source": [ - "api_key = os.getenv(\"OPENAI_API_KEY\")\n", - "embeddings = OpenAIEmbeddings(model=\"text-embedding-3-large\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "EsSA7RKvxYZz" - }, - "source": [ - "

ChromaDB

\n", - "\n", - "ChromaDB is a versatile vector database designed for efficiently storing and retrieving embeddings. It integrates seamlessly with embedding models to enable high-performance similarity search and context-based retrieval.\n", - "\n", - "### Workflow Overview:\n", - "- **Step 1:** Generate embeddings using a pre-trained model (e.g., OpenAI's `text-embedding-3-large`).\n", - "- **Step 2:** Store the embeddings in ChromaDB for efficient retrieval and similarity calculations.\n", - "- **Step 3:** Use the stored embeddings to perform searches, matching, or context-based retrieval.\n", - "\n", - "### Key Features of ChromaDB:\n", - "- **Scalability:** Handles large-scale datasets with optimized indexing and search capabilities.\n", - "- **Speed:** Provides fast and accurate retrieval of embeddings for real-time applications.\n", - "- **Integration:** Supports integration with popular frameworks and libraries for embedding generation." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "brKe6wUgxYZ0" - }, - "outputs": [], - "source": [ - "from langchain.vectorstores import Chroma" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "VkjHR-RkxYZ0", - "outputId": "bc11bda9-f283-457a-f584-5a06b95c4dd9" - }, - "outputs": [], - "source": [ - "db = Chroma.from_documents(chunks, embeddings, persist_directory=\"./chroma_db_LAB\")\n", - "print(\"ChromaDB created with document embeddings.\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "27OdN1IVxYZ1" - }, - "source": [ - "

Retrieving Documents

\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Exercice1: Write a user question that someone might ask about your book’s topic or content." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "XiLv-TfrxYZ1" - }, - "outputs": [], - "source": [ - "user_question = \"\" # User question\n", - "retrieved_docs = db.similarity_search(user_question, k=10) # k is the number of documents to retrieve" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "qgWsh50JxYZ1", - "outputId": "c8640c5d-5955-471f-fdd2-37096f5f68c7" - }, - "outputs": [], - "source": [ - "# Display top results\n", - "for i, doc in enumerate(retrieved_docs[:3]): # Display top 3 results\n", - " print(f\"Document {i+1}:\\n{doc.page_content[36:1000]}\") # Display content" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "XuGK8gL6xYZ1" - }, - "source": [ - "

Preparing Content for GenAI

" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "2iB3lZqHxYZ2" - }, - "outputs": [], - "source": [ - "def _get_document_prompt(docs):\n", - " prompt = \"\\n\"\n", - " for doc in docs:\n", - " prompt += \"\\nContent:\\n\"\n", - " prompt += doc.page_content + \"\\n\\n\"\n", - " return prompt" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "2okzmuADxYZ2", - "outputId": "0aa6cdca-188d-40e0-f5b4-8888d3549ea4" - }, - "outputs": [], - "source": [ - "# Generate a formatted context from the retrieved documents\n", - "formatted_context = _get_document_prompt(retrieved_docs)\n", - "print(\"Context formatted for GPT model.\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "qzIczQNTxYZ2" - }, - "source": [ - "

ChatBot Architecture

" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Exercice2: Write a prompt that is relevant and tailored to the content and style of your book." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "tqxVh9s3xYZ3", - "outputId": "97cca95d-4ab3-44d8-a76c-5713aad387d8" - }, - "outputs": [], - "source": [ - "prompt = f\"\"\"\n", - "\n", - "\n", - "\"\"\"\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "0mjkQJ_ZxYZ3" - }, - "outputs": [], - "source": [ - "import openai" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Exercice3: Tune parameters like temperature, and penalties to control how creative, focused, or varied the model's responses are." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "ylypRWRlxYZ4" - }, - "outputs": [], - "source": [ - "# Set up GPT client and parameters\n", - "client = openai.OpenAI()\n", - "model_params = {\n", - " 'model': 'gpt-4o',\n", - " 'temperature': , # Increase creativity\n", - " 'max_tokens': , # Allow for longer responses\n", - " 'top_p': , # Use nucleus sampling\n", - " 'frequency_penalty': , # Reduce repetition\n", - " 'presence_penalty': # Encourage new topics\n", - "}" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "C8e942xDxYZ4" - }, - "source": [ - "

Response

\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "4eXZO4pIxYZ4" - }, - "outputs": [], - "source": [ - "messages = [{'role': 'user', 'content': prompt}]\n", - "completion = client.chat.completions.create(messages=messages, **model_params, timeout=120)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "wLPAcchBxYZ5", - "outputId": "976c7800-16ed-41fe-c4cf-58f60d3230d2" - }, - "outputs": [], - "source": [ - "answer = completion.choices[0].message.content\n", - "print(answer)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "VXVNXPwLxYaT" - }, - "source": [ - "\"NLP" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "ldybhlqKxYaT" - }, - "source": [ - "

Cosine Similarity

\n", - "\n", - "**Cosine similarity** is a metric used to measure the alignment or similarity between two vectors, calculated as the cosine of the angle between them. It is the **most common metric used in RAG pipelines** for vector retrieval.. It provides a scale from -1 to 1:\n", - "\n", - "- **-1**: Vectors are completely opposite.\n", - "- **0**: Vectors are orthogonal (uncorrelated or unrelated).\n", - "- **1**: Vectors are identical.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "1c1I1TNhxYaT" - }, - "source": [ - "\"NLP" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "EoEMdNgQxYaU" - }, - "source": [ - "

Keyword Highlighting

\n", - "\n", - "Highlighting important keywords helps users quickly understand the relevance of the retrieved text to their query." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "nCXL9Cz1xYaV" - }, - "outputs": [], - "source": [ - "from termcolor import colored" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "xwDyofY0xYaV" - }, - "source": [ - "The `highlight_keywords` function is designed to highlight specific keywords within a given text. It replaces each keyword in the text with a highlighted version using the `colored` function from the `termcolor` library.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "9y3E0YWExYaV" - }, - "outputs": [], - "source": [ - "def highlight_keywords(text, keywords):\n", - " for keyword in keywords:\n", - " text = text.replace(keyword, colored(keyword, 'green', attrs=['bold']))\n", - " return text" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Exercice4: add your keywords" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "i7SkWPpnxYaW", - "outputId": "28e82563-edba-4b41-acad-ec27e5ba134f" - }, - "outputs": [], - "source": [ - "query_keywords = [] # add your keywords\n", - "for i, doc in enumerate(retrieved_docs[:1]):\n", - " snippet = doc.page_content[:200]\n", - " highlighted = highlight_keywords(snippet, query_keywords)\n", - " print(f\"Snippet {i+1}:\\n{highlighted}\\n{'-'*80}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "AhV_Jf_LxYaX" - }, - "source": [ - "1. `query_keywords` is a list of keywords to be highlighted.\n", - "2. The loop iterates over the first document in retrieved_docs.\n", - "3. For each document, a snippet of the first 200 characters is extracted.\n", - "4. The highlight_keywords function is called to highlight the keywords in the snippet.\n", - "5. The highlighted snippet is printed along with a separator line." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "pBRKysAvxYaX" - }, - "source": [ - "

Bonus

" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Qj25lCybxYaX" - }, - "source": [ - "**Try loading one of your own PDF books and go through the steps again to explore how the pipeline works with your content**:\n" - ] - } - ], - "metadata": { - "colab": { - "provenance": [] - }, - "kernelspec": { - "display_name": "llm", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.12.10" - } - }, - "nbformat": 4, - "nbformat_minor": 0 -} diff --git a/your-code/rag lab.ipynb b/your-code/rag lab.ipynb new file mode 100644 index 0000000..bfa9db3 --- /dev/null +++ b/your-code/rag lab.ipynb @@ -0,0 +1,1026 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "RsnCPbdkxYZd" + }, + "source": [ + "
\n", + "

Self-Guided Lab: Retrieval-Augmented Generation (RAGs)

\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tZp4BQAVxYZj" + }, + "source": [ + "
\n", + " \"NLP\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gizk6HCYxYZo" + }, + "source": [ + "

Data Storage & Retrieval

\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QW5UOI8ZxYZp" + }, + "source": [ + "

PyPDFLoader

\n", + "\n", + "`PyPDFLoader` is a lightweight Python library designed to streamline the process of loading and parsing PDF documents for text processing tasks. It is particularly useful in Retrieval-Augmented Generation workflows where text extraction from PDFs is required.\n", + "\n", + "- **What Does PyPDFLoader Do?**\n", + " - Extracts text from PDF files, retaining formatting and layout.\n", + " - Simplifies the preprocessing of document-based datasets.\n", + " - Supports efficient and scalable loading of large PDF collections.\n", + "\n", + "- **Key Features:**\n", + " - Compatible with popular NLP libraries and frameworks.\n", + " - Handles multi-page PDFs and embedded images (e.g., OCR-compatible setups).\n", + " - Provides flexible configurations for structured text extraction.\n", + "\n", + "- **Use Cases:**\n", + " - Preparing PDF documents for retrieval-based systems in RAGs.\n", + " - Automating the text extraction pipeline for document analysis.\n", + " - Creating datasets from academic papers, technical manuals, and reports.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Requirement already satisfied: langchain in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (1.2.7)\n", + "Requirement already satisfied: langchain_community in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (0.4.1)\n", + "Requirement already satisfied: pypdf in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (6.6.0)\n", + "Requirement already satisfied: langchain-core<2.0.0,>=1.2.7 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langchain) (1.2.7)\n", + "Requirement already satisfied: langgraph<1.1.0,>=1.0.7 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langchain) (1.0.7)\n", + "Requirement already satisfied: pydantic<3.0.0,>=2.7.4 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langchain) (2.12.5)\n", + "Requirement already satisfied: langchain-classic<2.0.0,>=1.0.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langchain_community) (1.0.1)\n", + "Requirement already satisfied: SQLAlchemy<3.0.0,>=1.4.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langchain_community) (2.0.46)\n", + "Requirement already satisfied: requests<3.0.0,>=2.32.5 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langchain_community) (2.32.5)\n", + "Requirement already satisfied: PyYAML<7.0.0,>=5.3.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langchain_community) (6.0.3)\n", + "Requirement already satisfied: aiohttp<4.0.0,>=3.8.3 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langchain_community) (3.13.3)\n", + "Requirement already satisfied: tenacity!=8.4.0,<10.0.0,>=8.1.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langchain_community) (9.1.2)\n", + "Requirement already satisfied: dataclasses-json<0.7.0,>=0.6.7 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langchain_community) (0.6.7)\n", + "Requirement already satisfied: pydantic-settings<3.0.0,>=2.10.1 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langchain_community) (2.12.0)\n", + "Requirement already satisfied: langsmith<1.0.0,>=0.1.125 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langchain_community) (0.6.4)\n", + "Requirement already satisfied: httpx-sse<1.0.0,>=0.4.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langchain_community) (0.4.3)\n", + "Requirement already satisfied: numpy>=1.26.2 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langchain_community) (2.3.3)\n", + "Requirement already satisfied: aiohappyeyeballs>=2.5.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from aiohttp<4.0.0,>=3.8.3->langchain_community) (2.6.1)\n", + "Requirement already satisfied: aiosignal>=1.4.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from aiohttp<4.0.0,>=3.8.3->langchain_community) (1.4.0)\n", + "Requirement already satisfied: attrs>=17.3.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from aiohttp<4.0.0,>=3.8.3->langchain_community) (25.4.0)\n", + "Requirement already satisfied: frozenlist>=1.1.1 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from aiohttp<4.0.0,>=3.8.3->langchain_community) (1.8.0)\n", + "Requirement already satisfied: multidict<7.0,>=4.5 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from aiohttp<4.0.0,>=3.8.3->langchain_community) (6.7.0)\n", + "Requirement already satisfied: propcache>=0.2.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from aiohttp<4.0.0,>=3.8.3->langchain_community) (0.4.1)\n", + "Requirement already satisfied: yarl<2.0,>=1.17.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from aiohttp<4.0.0,>=3.8.3->langchain_community) (1.22.0)\n", + "Requirement already satisfied: marshmallow<4.0.0,>=3.18.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from dataclasses-json<0.7.0,>=0.6.7->langchain_community) (3.26.2)\n", + "Requirement already satisfied: typing-inspect<1,>=0.4.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from dataclasses-json<0.7.0,>=0.6.7->langchain_community) (0.9.0)\n", + "Requirement already satisfied: langchain-text-splitters<2.0.0,>=1.1.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langchain-classic<2.0.0,>=1.0.0->langchain_community) (1.1.0)\n", + "Requirement already satisfied: jsonpatch<2.0.0,>=1.33.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langchain-core<2.0.0,>=1.2.7->langchain) (1.33)\n", + "Requirement already satisfied: packaging<26.0.0,>=23.2.0 in c:\\users\\parte\\appdata\\roaming\\python\\python311\\site-packages (from langchain-core<2.0.0,>=1.2.7->langchain) (25.0)\n", + "Requirement already satisfied: typing-extensions<5.0.0,>=4.7.0 in c:\\users\\parte\\appdata\\roaming\\python\\python311\\site-packages (from langchain-core<2.0.0,>=1.2.7->langchain) (4.15.0)\n", + "Requirement already satisfied: uuid-utils<1.0,>=0.12.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langchain-core<2.0.0,>=1.2.7->langchain) (0.14.0)\n", + "Requirement already satisfied: langgraph-checkpoint<5.0.0,>=2.1.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langgraph<1.1.0,>=1.0.7->langchain) (4.0.0)\n", + "Requirement already satisfied: langgraph-prebuilt<1.1.0,>=1.0.7 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langgraph<1.1.0,>=1.0.7->langchain) (1.0.7)\n", + "Requirement already satisfied: langgraph-sdk<0.4.0,>=0.3.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langgraph<1.1.0,>=1.0.7->langchain) (0.3.3)\n", + "Requirement already satisfied: xxhash>=3.5.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langgraph<1.1.0,>=1.0.7->langchain) (3.6.0)\n", + "Requirement already satisfied: httpx<1,>=0.23.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langsmith<1.0.0,>=0.1.125->langchain_community) (0.28.1)\n", + "Requirement already satisfied: orjson>=3.9.14 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langsmith<1.0.0,>=0.1.125->langchain_community) (3.11.5)\n", + "Requirement already satisfied: requests-toolbelt>=1.0.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langsmith<1.0.0,>=0.1.125->langchain_community) (1.0.0)\n", + "Requirement already satisfied: zstandard>=0.23.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langsmith<1.0.0,>=0.1.125->langchain_community) (0.25.0)\n", + "Requirement already satisfied: annotated-types>=0.6.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from pydantic<3.0.0,>=2.7.4->langchain) (0.7.0)\n", + "Requirement already satisfied: pydantic-core==2.41.5 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from pydantic<3.0.0,>=2.7.4->langchain) (2.41.5)\n", + "Requirement already satisfied: typing-inspection>=0.4.2 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from pydantic<3.0.0,>=2.7.4->langchain) (0.4.2)\n", + "Requirement already satisfied: python-dotenv>=0.21.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from pydantic-settings<3.0.0,>=2.10.1->langchain_community) (1.2.1)\n", + "Requirement already satisfied: charset_normalizer<4,>=2 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from requests<3.0.0,>=2.32.5->langchain_community) (3.4.4)\n", + "Requirement already satisfied: idna<4,>=2.5 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from requests<3.0.0,>=2.32.5->langchain_community) (3.11)\n", + "Requirement already satisfied: urllib3<3,>=1.21.1 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from requests<3.0.0,>=2.32.5->langchain_community) (2.5.0)\n", + "Requirement already satisfied: certifi>=2017.4.17 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from requests<3.0.0,>=2.32.5->langchain_community) (2025.10.5)\n", + "Requirement already satisfied: greenlet>=1 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from SQLAlchemy<3.0.0,>=1.4.0->langchain_community) (3.3.0)\n", + "Requirement already satisfied: anyio in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from httpx<1,>=0.23.0->langsmith<1.0.0,>=0.1.125->langchain_community) (4.12.1)\n", + "Requirement already satisfied: httpcore==1.* in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from httpx<1,>=0.23.0->langsmith<1.0.0,>=0.1.125->langchain_community) (1.0.9)\n", + "Requirement already satisfied: h11>=0.16 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from httpcore==1.*->httpx<1,>=0.23.0->langsmith<1.0.0,>=0.1.125->langchain_community) (0.16.0)\n", + "Requirement already satisfied: jsonpointer>=1.9 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from jsonpatch<2.0.0,>=1.33.0->langchain-core<2.0.0,>=1.2.7->langchain) (3.0.0)\n", + "Requirement already satisfied: ormsgpack>=1.12.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langgraph-checkpoint<5.0.0,>=2.1.0->langgraph<1.1.0,>=1.0.7->langchain) (1.12.2)\n", + "Requirement already satisfied: mypy-extensions>=0.3.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from typing-inspect<1,>=0.4.0->dataclasses-json<0.7.0,>=0.6.7->langchain_community) (1.1.0)\n", + "Note: you may need to restart the kernel to use updated packages.\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "\n", + "[notice] A new release of pip available: 22.3.1 -> 25.3\n", + "[notice] To update, run: c:\\Users\\parte\\AppData\\Local\\Programs\\Python\\Python311\\python.exe -m pip install --upgrade pip\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Requirement already satisfied: termcolor in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (3.2.0)\n", + "Requirement already satisfied: langchain_openai in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (1.1.7)\n", + "Requirement already satisfied: langchain-huggingface in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (1.2.0)\n", + "Requirement already satisfied: sentence-transformers in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (5.2.0)\n", + "Requirement already satisfied: chromadb in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (1.4.1)\n", + "Requirement already satisfied: langchain_chroma in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (1.1.0)\n", + "Requirement already satisfied: tiktoken in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (0.12.0)\n", + "Requirement already satisfied: openai in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (2.15.0)\n", + "Requirement already satisfied: python-dotenv in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (1.2.1)\n", + "Requirement already satisfied: langchain-core<2.0.0,>=1.2.6 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langchain_openai) (1.2.7)\n", + "Requirement already satisfied: huggingface-hub<1.0.0,>=0.33.4 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langchain-huggingface) (0.36.0)\n", + "Requirement already satisfied: tokenizers<1.0.0,>=0.19.1 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langchain-huggingface) (0.22.2)\n", + "Requirement already satisfied: transformers<6.0.0,>=4.41.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from sentence-transformers) (4.57.3)\n", + "Requirement already satisfied: tqdm in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from sentence-transformers) (4.67.1)\n", + "Requirement already satisfied: torch>=1.11.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from sentence-transformers) (2.9.1)\n", + "Requirement already satisfied: scikit-learn in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from sentence-transformers) (1.7.2)\n", + "Requirement already satisfied: scipy in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from sentence-transformers) (1.16.2)\n", + "Requirement already satisfied: typing_extensions>=4.5.0 in c:\\users\\parte\\appdata\\roaming\\python\\python311\\site-packages (from sentence-transformers) (4.15.0)\n", + "Requirement already satisfied: build>=1.0.3 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (1.4.0)\n", + "Requirement already satisfied: pydantic>=1.9 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (2.12.5)\n", + "Requirement already satisfied: pybase64>=1.4.1 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (1.4.3)\n", + "Requirement already satisfied: uvicorn[standard]>=0.18.3 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (0.40.0)\n", + "Requirement already satisfied: numpy>=1.22.5 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (2.3.3)\n", + "Requirement already satisfied: posthog<6.0.0,>=2.4.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (5.4.0)\n", + "Requirement already satisfied: onnxruntime>=1.14.1 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (1.23.2)\n", + "Requirement already satisfied: opentelemetry-api>=1.2.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (1.39.1)\n", + "Requirement already satisfied: opentelemetry-exporter-otlp-proto-grpc>=1.2.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (1.39.1)\n", + "Requirement already satisfied: opentelemetry-sdk>=1.2.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (1.39.1)\n", + "Requirement already satisfied: pypika>=0.48.9 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (0.50.0)\n", + "Requirement already satisfied: overrides>=7.3.1 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (7.7.0)\n", + "Requirement already satisfied: importlib-resources in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (6.5.2)\n", + "Requirement already satisfied: grpcio>=1.58.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (1.76.0)\n", + "Requirement already satisfied: bcrypt>=4.0.1 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (5.0.0)\n", + "Requirement already satisfied: typer>=0.9.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (0.21.1)\n", + "Requirement already satisfied: kubernetes>=28.1.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (35.0.0)\n", + "Requirement already satisfied: tenacity>=8.2.3 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (9.1.2)\n", + "Requirement already satisfied: pyyaml>=6.0.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (6.0.3)\n", + "Requirement already satisfied: mmh3>=4.0.1 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (5.2.0)\n", + "Requirement already satisfied: orjson>=3.9.12 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (3.11.5)\n", + "Requirement already satisfied: httpx>=0.27.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (0.28.1)\n", + "Requirement already satisfied: rich>=10.11.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (14.2.0)\n", + "Requirement already satisfied: jsonschema>=4.19.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (4.26.0)\n", + "Requirement already satisfied: regex>=2022.1.18 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from tiktoken) (2025.11.3)\n", + "Requirement already satisfied: requests>=2.26.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from tiktoken) (2.32.5)\n", + "Requirement already satisfied: anyio<5,>=3.5.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from openai) (4.12.1)\n", + "Requirement already satisfied: distro<2,>=1.7.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from openai) (1.9.0)\n", + "Requirement already satisfied: jiter<1,>=0.10.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from openai) (0.12.0)\n", + "Requirement already satisfied: sniffio in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from openai) (1.3.1)\n", + "Requirement already satisfied: idna>=2.8 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from anyio<5,>=3.5.0->openai) (3.11)\n", + "Requirement already satisfied: packaging>=24.0 in c:\\users\\parte\\appdata\\roaming\\python\\python311\\site-packages (from build>=1.0.3->chromadb) (25.0)\n", + "Requirement already satisfied: pyproject_hooks in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from build>=1.0.3->chromadb) (1.2.0)\n", + "Requirement already satisfied: colorama in c:\\users\\parte\\appdata\\roaming\\python\\python311\\site-packages (from build>=1.0.3->chromadb) (0.4.6)\n", + "Requirement already satisfied: certifi in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from httpx>=0.27.0->chromadb) (2025.10.5)\n", + "Requirement already satisfied: httpcore==1.* in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from httpx>=0.27.0->chromadb) (1.0.9)\n", + "Requirement already satisfied: h11>=0.16 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from httpcore==1.*->httpx>=0.27.0->chromadb) (0.16.0)\n", + "Requirement already satisfied: filelock in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from huggingface-hub<1.0.0,>=0.33.4->langchain-huggingface) (3.20.3)\n", + "Requirement already satisfied: fsspec>=2023.5.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from huggingface-hub<1.0.0,>=0.33.4->langchain-huggingface) (2026.1.0)\n", + "Requirement already satisfied: attrs>=22.2.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from jsonschema>=4.19.0->chromadb) (25.4.0)\n", + "Requirement already satisfied: jsonschema-specifications>=2023.03.6 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from jsonschema>=4.19.0->chromadb) (2025.9.1)\n", + "Requirement already satisfied: referencing>=0.28.4 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from jsonschema>=4.19.0->chromadb) (0.37.0)\n", + "Requirement already satisfied: rpds-py>=0.25.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from jsonschema>=4.19.0->chromadb) (0.30.0)\n", + "Requirement already satisfied: six>=1.9.0 in c:\\users\\parte\\appdata\\roaming\\python\\python311\\site-packages (from kubernetes>=28.1.0->chromadb) (1.17.0)\n", + "Requirement already satisfied: python-dateutil>=2.5.3 in c:\\users\\parte\\appdata\\roaming\\python\\python311\\site-packages (from kubernetes>=28.1.0->chromadb) (2.9.0.post0)\n", + "Requirement already satisfied: websocket-client!=0.40.0,!=0.41.*,!=0.42.*,>=0.32.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from kubernetes>=28.1.0->chromadb) (1.9.0)\n", + "Requirement already satisfied: requests-oauthlib in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from kubernetes>=28.1.0->chromadb) (2.0.0)\n", + "Requirement already satisfied: urllib3!=2.6.0,>=1.24.2 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from kubernetes>=28.1.0->chromadb) (2.5.0)\n", + "Requirement already satisfied: durationpy>=0.7 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from kubernetes>=28.1.0->chromadb) (0.10)\n", + "Requirement already satisfied: jsonpatch<2.0.0,>=1.33.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langchain-core<2.0.0,>=1.2.6->langchain_openai) (1.33)\n", + "Requirement already satisfied: langsmith<1.0.0,>=0.3.45 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langchain-core<2.0.0,>=1.2.6->langchain_openai) (0.6.4)\n", + "Requirement already satisfied: uuid-utils<1.0,>=0.12.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langchain-core<2.0.0,>=1.2.6->langchain_openai) (0.14.0)\n", + "Requirement already satisfied: coloredlogs in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from onnxruntime>=1.14.1->chromadb) (15.0.1)\n", + "Requirement already satisfied: flatbuffers in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from onnxruntime>=1.14.1->chromadb) (25.9.23)\n", + "Requirement already satisfied: protobuf in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from onnxruntime>=1.14.1->chromadb) (6.33.1)\n", + "Requirement already satisfied: sympy in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from onnxruntime>=1.14.1->chromadb) (1.14.0)\n", + "Requirement already satisfied: importlib-metadata<8.8.0,>=6.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from opentelemetry-api>=1.2.0->chromadb) (8.7.1)\n", + "Requirement already satisfied: googleapis-common-protos~=1.57 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from opentelemetry-exporter-otlp-proto-grpc>=1.2.0->chromadb) (1.72.0)\n", + "Requirement already satisfied: opentelemetry-exporter-otlp-proto-common==1.39.1 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from opentelemetry-exporter-otlp-proto-grpc>=1.2.0->chromadb) (1.39.1)\n", + "Requirement already satisfied: opentelemetry-proto==1.39.1 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from opentelemetry-exporter-otlp-proto-grpc>=1.2.0->chromadb) (1.39.1)\n", + "Requirement already satisfied: opentelemetry-semantic-conventions==0.60b1 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from opentelemetry-sdk>=1.2.0->chromadb) (0.60b1)\n", + "Requirement already satisfied: backoff>=1.10.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from posthog<6.0.0,>=2.4.0->chromadb) (2.2.1)\n", + "Requirement already satisfied: annotated-types>=0.6.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from pydantic>=1.9->chromadb) (0.7.0)\n", + "Requirement already satisfied: pydantic-core==2.41.5 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from pydantic>=1.9->chromadb) (2.41.5)\n", + "Requirement already satisfied: typing-inspection>=0.4.2 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from pydantic>=1.9->chromadb) (0.4.2)\n", + "Requirement already satisfied: charset_normalizer<4,>=2 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from requests>=2.26.0->tiktoken) (3.4.4)\n", + "Requirement already satisfied: markdown-it-py>=2.2.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from rich>=10.11.0->chromadb) (4.0.0)\n", + "Requirement already satisfied: pygments<3.0.0,>=2.13.0 in c:\\users\\parte\\appdata\\roaming\\python\\python311\\site-packages (from rich>=10.11.0->chromadb) (2.19.2)\n", + "Requirement already satisfied: networkx>=2.5.1 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from torch>=1.11.0->sentence-transformers) (3.5)\n", + "Requirement already satisfied: jinja2 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from torch>=1.11.0->sentence-transformers) (3.1.6)\n", + "Requirement already satisfied: safetensors>=0.4.3 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from transformers<6.0.0,>=4.41.0->sentence-transformers) (0.7.0)\n", + "Requirement already satisfied: click>=8.0.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from typer>=0.9.0->chromadb) (8.3.1)\n", + "Requirement already satisfied: shellingham>=1.3.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from typer>=0.9.0->chromadb) (1.5.4)\n", + "Requirement already satisfied: httptools>=0.6.3 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from uvicorn[standard]>=0.18.3->chromadb) (0.7.1)\n", + "Requirement already satisfied: watchfiles>=0.13 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from uvicorn[standard]>=0.18.3->chromadb) (1.1.1)\n", + "Requirement already satisfied: websockets>=10.4 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from uvicorn[standard]>=0.18.3->chromadb) (16.0)\n", + "Requirement already satisfied: joblib>=1.2.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from scikit-learn->sentence-transformers) (1.5.2)\n", + "Requirement already satisfied: threadpoolctl>=3.1.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from scikit-learn->sentence-transformers) (3.6.0)\n", + "Requirement already satisfied: zipp>=3.20 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from importlib-metadata<8.8.0,>=6.0->opentelemetry-api>=1.2.0->chromadb) (3.23.0)\n", + "Requirement already satisfied: jsonpointer>=1.9 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from jsonpatch<2.0.0,>=1.33.0->langchain-core<2.0.0,>=1.2.6->langchain_openai) (3.0.0)\n", + "Requirement already satisfied: requests-toolbelt>=1.0.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langsmith<1.0.0,>=0.3.45->langchain-core<2.0.0,>=1.2.6->langchain_openai) (1.0.0)\n", + "Requirement already satisfied: zstandard>=0.23.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langsmith<1.0.0,>=0.3.45->langchain-core<2.0.0,>=1.2.6->langchain_openai) (0.25.0)\n", + "Requirement already satisfied: mdurl~=0.1 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from markdown-it-py>=2.2.0->rich>=10.11.0->chromadb) (0.1.2)\n", + "Requirement already satisfied: mpmath<1.4,>=1.1.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from sympy->onnxruntime>=1.14.1->chromadb) (1.3.0)\n", + "Requirement already satisfied: humanfriendly>=9.1 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from coloredlogs->onnxruntime>=1.14.1->chromadb) (10.0)\n", + "Requirement already satisfied: MarkupSafe>=2.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from jinja2->torch>=1.11.0->sentence-transformers) (3.0.3)\n", + "Requirement already satisfied: oauthlib>=3.0.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from requests-oauthlib->kubernetes>=28.1.0->chromadb) (3.3.1)\n", + "Requirement already satisfied: pyreadline3 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from humanfriendly>=9.1->coloredlogs->onnxruntime>=1.14.1->chromadb) (3.5.4)\n", + "Note: you may need to restart the kernel to use updated packages.\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "\n", + "[notice] A new release of pip available: 22.3.1 -> 25.3\n", + "[notice] To update, run: c:\\Users\\parte\\AppData\\Local\\Programs\\Python\\Python311\\python.exe -m pip install --upgrade pip\n" + ] + } + ], + "source": [ + "%pip install langchain langchain_community pypdf\n", + "%pip install termcolor langchain_openai langchain-huggingface sentence-transformers chromadb langchain_chroma tiktoken openai python-dotenv\n" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "WARNING:tensorflow:From c:\\Users\\parte\\AppData\\Local\\Programs\\Python\\Python311\\Lib\\site-packages\\tf_keras\\src\\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead.\n", + "\n" + ] + } + ], + "source": [ + "import os\n", + "import warnings\n", + "warnings.filterwarnings(\"ignore\")\n", + "\n", + "from langchain_community.document_loaders import PyPDFLoader\n", + "from langchain_text_splitters import (\n", + " CharacterTextSplitter,\n", + " RecursiveCharacterTextSplitter\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sRS44B2XxYZs", + "vscode": { + "languageId": "plaintext" + } + }, + "source": [ + "

Loading the Documents

" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "id": "cuREtJRixYZt" + }, + "outputs": [], + "source": [ + "# File path for the document\n", + "\n", + "file_path = r\"C:\\week18\\lab-intro-rag\\ai-for-everyone.pdf\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pz_8SOLxxYZt" + }, + "source": [ + "

Documents into pages

\n", + "\n", + "The `PyPDFLoader` library allows efficient loading and splitting of PDF documents into smaller, manageable parts for NLP tasks.\n", + "\n", + "This functionality is particularly useful in workflows requiring granular text processing, such as Retrieval-Augmented Generation (RAG).\n" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "310" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from langchain_community.document_loaders import PyPDFLoader\n", + "\n", + "loader = PyPDFLoader(file_path)\n", + "pages = loader.load() # already split by page\n", + "\n", + "len(pages)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wt50NRQaxYZv" + }, + "source": [ + "

Pages into Chunks

\n", + "\n", + "\n", + "#### RecursiveCharacterTextSplitter in LangChain\n", + "\n", + "The `RecursiveCharacterTextSplitter` is the **recommended splitter** in LangChain when you want to break down long documents into smaller, semantically meaningful chunks — especially useful in **RAG pipelines**, where clean context chunks lead to better LLM responses.\n", + "\n", + "#### Parameters\n", + "\n", + "| Parameter | Description |\n", + "|-----------------|-----------------------------------------------------------------------------|\n", + "| `chunk_size` | The **maximum number of characters** allowed in a chunk (e.g., `1000`). |\n", + "| `chunk_overlap` | The number of **overlapping characters** between consecutive chunks (e.g., `200`). This helps preserve context continuity. |\n", + "\n", + "#### How it works\n", + "`RecursiveCharacterTextSplitter` attempts to split the text **intelligently**, trying the following separators in order:\n", + "1. Paragraphs (`\"\\n\\n\"`)\n", + "2. Lines (`\"\\n\"`)\n", + "3. Sentences or words (`\" \"`)\n", + "4. Individual characters (as a last resort)\n", + "\n", + "This makes it ideal for handling **natural language documents**, such as PDFs, articles, or long reports, without breaking sentences or paragraphs in awkward ways.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "1096" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "text_splitter = RecursiveCharacterTextSplitter(\n", + " chunk_size=1000,\n", + " chunk_overlap=200\n", + ")\n", + "chunks = text_splitter.split_documents(pages)\n", + "\n", + "len(chunks)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Alternative: CharacterTextSplitter\n", + "\n", + "`CharacterTextSplitter` is a simpler splitter that breaks text into chunks based **purely on character count**, without trying to preserve any natural language structure.\n", + "\n", + "##### Example:\n", + "```python\n", + "from langchain.text_splitter import CharacterTextSplitter\n", + "\n", + "text_splitter = CharacterTextSplitter(\n", + " chunk_size=1000,\n", + " chunk_overlap=200\n", + ")\n", + "````\n", + "\n", + "This method is faster and more predictable but may split text in the middle of a sentence or paragraph, which can hurt performance in downstream tasks like retrieval or QA.\n", + "\n", + "---\n", + "\n", + "#### Comparison Table\n", + "\n", + "| Feature | RecursiveCharacterTextSplitter | CharacterTextSplitter |\n", + "| ------------------------------ | ------------------------------ | ------------------------- |\n", + "| Structure-aware splitting | Yes | No |\n", + "| Preserves sentence/paragraphs | Yes | No |\n", + "| Risk of splitting mid-sentence | Minimal | High |\n", + "| Ideal for RAG/document QA | Highly recommended | Only if structured text |\n", + "| Performance speed | Slightly slower | Faster |\n", + "\n", + "---\n", + "\n", + "#### Recommendation\n", + "\n", + "Use `RecursiveCharacterTextSplitter` for most real-world document processing tasks, especially when building RAG pipelines or working with structured natural language content like PDFs or articles." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Best Practices for Choosing Chunk Size in RAG\n", + "\n", + "### Best Practices for Chunk Size in RAG\n", + "\n", + "| Factor | Recommendation |\n", + "| ---------------------------| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n", + "| **LLM context limit** | Choose a chunk size that lets you retrieve multiple chunks **without exceeding the model’s token limit**. For example, GPT-4o supports 128k tokens, but with GPT-3.5 (16k) or GPT-4 (32k), keep it modest. |\n", + "| **Chunk size (in characters)** | Typically: **500–1,000 characters** per chunk → ~75–200 tokens. This fits well for retrieval + prompt without context overflow. |\n", + "| **Chunk size (in tokens)** | If using token-based splitter (e.g. `TokenTextSplitter`): aim for **100–300 tokens** per chunk. |\n", + "| **Chunk overlap** | Use **overlap of 10–30%** (e.g., 100–300 characters or ~50 tokens) to preserve context across chunk boundaries and avoid cutting off important ideas mid-sentence. |\n", + "| **Document structure** | Use **`RecursiveCharacterTextSplitter`** to preserve semantic boundaries (paragraphs, sentences) instead of arbitrary cuts. |\n", + "| **Task type** | For **question answering**, smaller chunks (~500–800 chars) reduce noise.
For **summarization**, slightly larger chunks (~1000–1500) are OK. |\n", + "| **Embedding model** | Some models (e.g., `text-embedding-3-large`) can handle long input. But still, smaller chunks give **finer-grained retrieval**, which improves relevance. |\n", + "| **Query type** | If users ask **very specific questions**, small focused chunks are better. For broader queries, bigger chunks might help. |\n", + "\n", + "\n", + "### Rule of Thumb\n", + "\n", + "| Use Case | Chunk Size | Overlap |\n", + "| ------------------------| --------------- | ------- |\n", + "| Factual Q&A | 500–800 chars | 100–200 |\n", + "| Summarization | 1000–1500 chars | 200–300 |\n", + "| Technical documents | 400–700 chars | 100–200 |\n", + "| Long reports/books | 800–1200 chars | 200–300 |\n", + "| Small LLMs (≤16k tokens) | ≤800 chars | 100–200 |\n", + "\n", + "\n", + "### Avoid\n", + "\n", + "- Chunks >2000 characters: risks context overflow.\n", + "- No overlap: may lose key information between chunks.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Mg15RjVPxYZw" + }, + "source": [ + "

Embeddings

\n", + "\n", + "Embeddings transform text into dense vector representations, capturing semantic meaning and contextual relationships. They are essential for efficient document retrieval and similarity analysis.\n", + "\n", + "- **What are OpenAI Embeddings?**\n", + " - Pre-trained embeddings like `text-embedding-3-large` generate high-quality vector representations for text.\n", + " - Encapsulate semantic relationships in the text, enabling robust NLP applications.\n", + "\n", + "- **Key Features of `text-embedding-3-large`:**\n", + " - Large-scale embedding model optimized for accuracy and versatility.\n", + " - Handles diverse NLP tasks, including retrieval, classification, and clustering.\n", + " - Ideal for applications with high-performance requirements.\n", + "\n", + "- **Benefits:**\n", + " - Reduces the need for extensive custom training.\n", + " - Provides state-of-the-art performance in retrieval-augmented systems.\n", + " - Compatible with RAGs to create powerful context-aware models.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_openai import OpenAIEmbeddings\n", + "from dotenv import load_dotenv\n" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": { + "id": "_WRIo3_0xYZx", + "outputId": "78bfbbf3-9d25-4e31-bdbc-3e932e6bbfec" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "load_dotenv()" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": { + "id": "MNZfTng5xYZz", + "outputId": "db1a7c85-ef9f-447e-92cd-9d097e959847" + }, + "outputs": [], + "source": [ + "api_key = os.getenv(\"OPENAI_API_KEY\")\n", + "embeddings = OpenAIEmbeddings(model=\"text-embedding-3-large\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EsSA7RKvxYZz" + }, + "source": [ + "

ChromaDB

\n", + "\n", + "ChromaDB is a versatile vector database designed for efficiently storing and retrieving embeddings. It integrates seamlessly with embedding models to enable high-performance similarity search and context-based retrieval.\n", + "\n", + "### Workflow Overview:\n", + "- **Step 1:** Generate embeddings using a pre-trained model (e.g., OpenAI's `text-embedding-3-large`).\n", + "- **Step 2:** Store the embeddings in ChromaDB for efficient retrieval and similarity calculations.\n", + "- **Step 3:** Use the stored embeddings to perform searches, matching, or context-based retrieval.\n", + "\n", + "### Key Features of ChromaDB:\n", + "- **Scalability:** Handles large-scale datasets with optimized indexing and search capabilities.\n", + "- **Speed:** Provides fast and accurate retrieval of embeddings for real-time applications.\n", + "- **Integration:** Supports integration with popular frameworks and libraries for embedding generation." + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_community.vectorstores import Chroma\n" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": { + "id": "VkjHR-RkxYZ0", + "outputId": "bc11bda9-f283-457a-f584-5a06b95c4dd9" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "ChromaDB created with document embeddings.\n" + ] + } + ], + "source": [ + "db = Chroma.from_documents(chunks, embeddings, persist_directory=\"./chroma_db_LAB\")\n", + "print(\"ChromaDB created with document embeddings.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "27OdN1IVxYZ1" + }, + "source": [ + "

Retrieving Documents

\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercice1: Write a user question that someone might ask about your book’s topic or content." + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": { + "id": "XiLv-TfrxYZ1" + }, + "outputs": [], + "source": [ + "user_question = \"\" # User question\n", + "retrieved_docs = db.similarity_search(user_question, k=10) # k is the number of documents to retrieve" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": { + "id": "qgWsh50JxYZ1", + "outputId": "c8640c5d-5955-471f-fdd2-37096f5f68c7" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Document 1:\n", + "f Human Communication. Palo Alto, CA: \n", + "Science and Behavior Books.\n", + "Weizenbaum, J. 1976. Computer Power and Human Reason: From Judgment to \n", + "Calculation. San Francisco: W . H. Freeman.\n", + "Document 2:\n", + "– and when the difference between human \n", + "and machine is affirmed at the cost of their unity that is negated – done so by \n", + "disconnections. The way out is the establishment of a relation through affirm-\n", + "ing both the identity of, and the difference between, the two sides – as done by\n", + "Document 3:\n", + "ne, Not a Camera: How Financial Models Shape \n", + "Markets. (1st edn.). Cambridge, MA: The MIT Press.\n", + "Malik, M. M. 2020. A Hierarchy of Limitations in Machine Learning. \n", + "ArXiv:2002.05193 [Cs, Econ, Math, Stat] , February. http://arxiv.org \n", + "/abs/2002.05193.\n", + "Marcus, G. 2018. Deep Learning: A Critical Appraisal. ArXiv:1801.00631 [Cs, \n", + "Stat], January. http://arxiv.org/abs/1801.00631.\n", + "McQuillan, D. 2015. Algorithmic States of Exception. European Journal \n", + "of Cultural Studies 18 (4–5), 564–576. DOI: https://doi.org/10.1177 \n", + "/1367549415577389.\n", + "McQuillan, D. 2017. Data Science as Machinic Neoplatonism. Philosophy & \n", + "Technolog y, August, 1–20. DOI: https://doi.org/10.1007/s13347-017-0273-3.\n", + "McQuillan, D. 2018. People’s Councils for Ethical Machine Learning. Social \n", + "Media + Society 4 (2). DOI: https://doi.org/10.1177/2056305118768303.\n", + "Mitchell, A. 2015. Posthumanist Post-Colonialism? Worldly (blog). 26 Feb -\n" + ] + } + ], + "source": [ + "# Display top results\n", + "for i, doc in enumerate(retrieved_docs[:3]): # Display top 3 results\n", + " print(f\"Document {i+1}:\\n{doc.page_content[36:1000]}\") # Display content" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XuGK8gL6xYZ1" + }, + "source": [ + "

Preparing Content for GenAI

" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": { + "id": "2iB3lZqHxYZ2" + }, + "outputs": [], + "source": [ + "def _get_document_prompt(docs):\n", + " prompt = \"\\n\"\n", + " for doc in docs:\n", + " prompt += \"\\nContent:\\n\"\n", + " prompt += doc.page_content + \"\\n\\n\"\n", + " return prompt" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": { + "id": "2okzmuADxYZ2", + "outputId": "0aa6cdca-188d-40e0-f5b4-8888d3549ea4" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Context formatted for GPT model.\n" + ] + } + ], + "source": [ + "# Generate a formatted context from the retrieved documents\n", + "formatted_context = _get_document_prompt(retrieved_docs)\n", + "print(\"Context formatted for GPT model.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qzIczQNTxYZ2" + }, + "source": [ + "

ChatBot Architecture

" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercice2: Write a prompt that is relevant and tailored to the content and style of your book." + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": {}, + "outputs": [], + "source": [ + "prompt = f\"\"\"\n", + "You are an AI assistant designed to help users understand the content of this book.\n", + "\n", + "Use ONLY the information provided in the retrieved context to answer the question.\n", + "Do not use external knowledge or make assumptions.\n", + "\n", + "If the answer is not clearly stated in the context, say:\n", + "\"The information is not available in the provided document.\"\n", + "\n", + "Keep your answer:\n", + "- Clear\n", + "- Concise\n", + "- Technically accurate\n", + "- Easy to understand for students\n", + "\n", + "Context:\n", + "{{context}}\n", + "\n", + "Question:\n", + "{{question}}\n", + "\n", + "Answer:\n", + "\"\"\"\n" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": { + "id": "0mjkQJ_ZxYZ3" + }, + "outputs": [], + "source": [ + "import openai" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercice3: Tune parameters like temperature, and penalties to control how creative, focused, or varied the model's responses are." + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": {}, + "outputs": [], + "source": [ + "# Set up GPT client and parameters\n", + "client = openai.OpenAI()\n", + "\n", + "model_params = {\n", + " \"model\": \"gpt-4o\",\n", + " \"temperature\": 0.2, # Low creativity → factual, consistent answers\n", + " \"max_tokens\": 800, # Enough for detailed but controlled responses\n", + " \"top_p\": 0.9, # Balanced nucleus sampling\n", + " \"frequency_penalty\": 0.1, # Slightly reduce repetition\n", + " \"presence_penalty\": 0.0 # Do NOT encourage new topics in RAG\n", + "}\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C8e942xDxYZ4" + }, + "source": [ + "

Response

\n" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": { + "id": "4eXZO4pIxYZ4" + }, + "outputs": [], + "source": [ + "messages = [{'role': 'user', 'content': prompt}]\n", + "completion = client.chat.completions.create(messages=messages, **model_params, timeout=120)" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": { + "id": "wLPAcchBxYZ5", + "outputId": "976c7800-16ed-41fe-c4cf-58f60d3230d2" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "I'm sorry, but I need the context to answer your question. Please provide the relevant text or information from the book so I can assist you effectively.\n" + ] + } + ], + "source": [ + "answer = completion.choices[0].message.content\n", + "print(answer)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VXVNXPwLxYaT" + }, + "source": [ + "\"NLP" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ldybhlqKxYaT" + }, + "source": [ + "

Cosine Similarity

\n", + "\n", + "**Cosine similarity** is a metric used to measure the alignment or similarity between two vectors, calculated as the cosine of the angle between them. It is the **most common metric used in RAG pipelines** for vector retrieval.. It provides a scale from -1 to 1:\n", + "\n", + "- **-1**: Vectors are completely opposite.\n", + "- **0**: Vectors are orthogonal (uncorrelated or unrelated).\n", + "- **1**: Vectors are identical.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1c1I1TNhxYaT" + }, + "source": [ + "\"NLP" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EoEMdNgQxYaU" + }, + "source": [ + "

Keyword Highlighting

\n", + "\n", + "Highlighting important keywords helps users quickly understand the relevance of the retrieved text to their query." + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": { + "id": "nCXL9Cz1xYaV" + }, + "outputs": [], + "source": [ + "from termcolor import colored" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xwDyofY0xYaV" + }, + "source": [ + "The `highlight_keywords` function is designed to highlight specific keywords within a given text. It replaces each keyword in the text with a highlighted version using the `colored` function from the `termcolor` library.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": { + "id": "9y3E0YWExYaV" + }, + "outputs": [], + "source": [ + "def highlight_keywords(text, keywords):\n", + " for keyword in keywords:\n", + " text = text.replace(keyword, colored(keyword, 'green', attrs=['bold']))\n", + " return text" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Exercice4: add your keywords" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": { + "id": "i7SkWPpnxYaW", + "outputId": "28e82563-edba-4b41-acad-ec27e5ba134f" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Snippet 1:\n", + "Watzlawick, P . 1964. An Anthology of Human Communication. Palo Alto, CA: \n", + "Science and Behavior Books.\n", + "Weizenbaum, J. 1976. Computer Power and Human Reason: From Judgment to \n", + "Calculation. San Francisc\n", + "--------------------------------------------------------------------------------\n" + ] + } + ], + "source": [ + "query_keywords = [] # add your keywords\n", + "for i, doc in enumerate(retrieved_docs[:1]):\n", + " snippet = doc.page_content[:200]\n", + " highlighted = highlight_keywords(snippet, query_keywords)\n", + " print(f\"Snippet {i+1}:\\n{highlighted}\\n{'-'*80}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AhV_Jf_LxYaX" + }, + "source": [ + "1. `query_keywords` is a list of keywords to be highlighted.\n", + "2. The loop iterates over the first document in retrieved_docs.\n", + "3. For each document, a snippet of the first 200 characters is extracted.\n", + "4. The highlight_keywords function is called to highlight the keywords in the snippet.\n", + "5. The highlighted snippet is printed along with a separator line." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pBRKysAvxYaX" + }, + "source": [ + "

Bonus

" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Qj25lCybxYaX" + }, + "source": [ + "**Try loading one of your own PDF books and go through the steps again to explore how the pipeline works with your content**:\n" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.2" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +}