diff --git a/your-code/main.ipynb b/your-code/main.ipynb
deleted file mode 100644
index e3a225a..0000000
--- a/your-code/main.ipynb
+++ /dev/null
@@ -1,709 +0,0 @@
-{
-  "cells": [
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "RsnCPbdkxYZd"
-      },
-      "source": [
-        "<div style=\"text-align: center;\">\n",
-        "    <h1 style=\"color: #FF6347;\">Self-Guided Lab: Retrieval-Augmented Generation (RAGs)</h1>\n",
-        "</div>"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "tZp4BQAVxYZj"
-      },
-      "source": [
-        "<div style=\"text-align: center;\">\n",
-        "    <img src=\"https://media4.giphy.com/media/v1.Y2lkPTc5MGI3NjExZ3FsdzRveTBrenMxM3VnbDMwaTJxN2NnZm50aGFibXk1NzNnY2Q0MCZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/LR5ZBwZHv02lmpVoEU/giphy.gif\" alt=\"NLP Gif\" style=\"width: 300px; height: 150px; object-fit: cover; object-position: center;\">\n",
-        "</div>"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "gizk6HCYxYZo"
-      },
-      "source": [
-        "<h1 style=\"color: #FF6347;\">Data Storage & Retrieval</h1>\n"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "QW5UOI8ZxYZp"
-      },
-      "source": [
-        "<h2 style=\"color: #FF8C00;\">PyPDFLoader</h2>\n",
-        "\n",
-        "`PyPDFLoader` is a lightweight Python library designed to streamline the process of loading and parsing PDF documents for text processing tasks. It is particularly useful in Retrieval-Augmented Generation workflows where text extraction from PDFs is required.\n",
-        "\n",
-        "- **What Does PyPDFLoader Do?**\n",
-        "  - Extracts text from PDF files, retaining formatting and layout.\n",
-        "  - Simplifies the preprocessing of document-based datasets.\n",
-        "  - Supports efficient and scalable loading of large PDF collections.\n",
-        "\n",
-        "- **Key Features:**\n",
-        "  - Compatible with popular NLP libraries and frameworks.\n",
-        "  - Handles multi-page PDFs and embedded images (e.g., OCR-compatible setups).\n",
-        "  - Provides flexible configurations for structured text extraction.\n",
-        "\n",
-        "- **Use Cases:**\n",
-        "  - Preparing PDF documents for retrieval-based systems in RAGs.\n",
-        "  - Automating the text extraction pipeline for document analysis.\n",
-        "  - Creating datasets from academic papers, technical manuals, and reports.\n"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {},
-      "outputs": [],
-      "source": [
-        "%pip install langchain langchain_community pypdf\n",
-        "%pip install termcolor langchain_openai langchain-huggingface sentence-transformers chromadb langchain_chroma tiktoken openai python-dotenv\n"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "6heKZkQUxYZr"
-      },
-      "outputs": [],
-      "source": [
-        "import os\n",
-        "from langchain.document_loaders import PyPDFLoader\n",
-        "from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter\n",
-        "import warnings\n",
-        "warnings.filterwarnings('ignore')\n"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "sRS44B2XxYZs",
-        "vscode": {
-          "languageId": "plaintext"
-        }
-      },
-      "source": [
-        "<h3 style=\"color: #FF8C00;\">Loading the Documents</h3>"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "cuREtJRixYZt"
-      },
-      "outputs": [],
-      "source": [
-        "# File path for the document\n",
-        "\n",
-        "file_path = \"LAB/ai-for-everyone.pdf\""
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "pz_8SOLxxYZt"
-      },
-      "source": [
-        "<h3 style=\"color: #FF8C00;\">Documents into pages</h3>\n",
-        "\n",
-        "The `PyPDFLoader` library allows efficient loading and splitting of PDF documents into smaller, manageable parts for NLP tasks.\n",
-        "\n",
-        "This functionality is particularly useful in workflows requiring granular text processing, such as Retrieval-Augmented Generation (RAG).\n"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "_b5Z_45UxYZu",
-        "outputId": "a600d69f-14fe-4492-f236-97261d6ff36c"
-      },
-      "outputs": [],
-      "source": [
-        "# Load and split the document\n",
-        "loader = PyPDFLoader(file_path)\n",
-        "pages = loader.load_and_split()\n",
-        "len(pages)"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "wt50NRQaxYZv"
-      },
-      "source": [
-        "<h3 style=\"color: #FF8C00;\">Pages into Chunks</h3>\n",
-        "\n",
-        "\n",
-        "####  RecursiveCharacterTextSplitter in LangChain\n",
-        "\n",
-        "The `RecursiveCharacterTextSplitter` is the **recommended splitter** in LangChain when you want to break down long documents into smaller, semantically meaningful chunks — especially useful in **RAG pipelines**, where clean context chunks lead to better LLM responses.\n",
-        "\n",
-        "####  Parameters\n",
-        "\n",
-        "| Parameter       | Description                                                                 |\n",
-        "|-----------------|-----------------------------------------------------------------------------|\n",
-        "| `chunk_size`    | The **maximum number of characters** allowed in a chunk (e.g., `1000`).     |\n",
-        "| `chunk_overlap` | The number of **overlapping characters** between consecutive chunks (e.g., `200`). This helps preserve context continuity. |\n",
-        "\n",
-        "####  How it works\n",
-        "`RecursiveCharacterTextSplitter` attempts to split the text **intelligently**, trying the following separators in order:\n",
-        "1. Paragraphs (`\"\\n\\n\"`)\n",
-        "2. Lines (`\"\\n\"`)\n",
-        "3. Sentences or words (`\" \"`)\n",
-        "4. Individual characters (as a last resort)\n",
-        "\n",
-        "This makes it ideal for handling **natural language documents**, such as PDFs, articles, or long reports, without breaking sentences or paragraphs in awkward ways.\n",
-        "\n"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {},
-      "outputs": [],
-      "source": [
-        "text_splitter = RecursiveCharacterTextSplitter(\n",
-        "    chunk_size=1000,\n",
-        "    chunk_overlap=200\n",
-        ")\n",
-        "chunks = text_splitter.split_documents(pages)\n",
-        "\n",
-        "len(chunks)"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "####  Alternative: CharacterTextSplitter\n",
-        "\n",
-        "`CharacterTextSplitter` is a simpler splitter that breaks text into chunks based **purely on character count**, without trying to preserve any natural language structure.\n",
-        "\n",
-        "##### Example:\n",
-        "```python\n",
-        "from langchain.text_splitter import CharacterTextSplitter\n",
-        "\n",
-        "text_splitter = CharacterTextSplitter(\n",
-        "    chunk_size=1000,\n",
-        "    chunk_overlap=200\n",
-        ")\n",
-        "````\n",
-        "\n",
-        "This method is faster and more predictable but may split text in the middle of a sentence or paragraph, which can hurt performance in downstream tasks like retrieval or QA.\n",
-        "\n",
-        "---\n",
-        "\n",
-        "#### Comparison Table\n",
-        "\n",
-        "| Feature                        | RecursiveCharacterTextSplitter | CharacterTextSplitter     |\n",
-        "| ------------------------------ | ------------------------------ | ------------------------- |\n",
-        "| Structure-aware splitting      |  Yes                          |  No                      |\n",
-        "| Preserves sentence/paragraphs  |  Yes                          |  No                      |\n",
-        "| Risk of splitting mid-sentence |  Minimal                     |  High                   |\n",
-        "| Ideal for RAG/document QA      |  Highly recommended           |  Only if structured text |\n",
-        "| Performance speed              |  Slightly slower             |  Faster                  |\n",
-        "\n",
-        "---\n",
-        "\n",
-        "#### Recommendation\n",
-        "\n",
-        "Use `RecursiveCharacterTextSplitter` for most real-world document processing tasks, especially when building RAG pipelines or working with structured natural language content like PDFs or articles."
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "## Best Practices for Choosing Chunk Size in RAG\n",
-        "\n",
-        "### Best Practices for Chunk Size in RAG\n",
-        "\n",
-        "| Factor                      | Recommendation                                                                                                                                                                                          |\n",
-        "| ---------------------------| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n",
-        "| **LLM context limit**       | Choose a chunk size that lets you retrieve multiple chunks **without exceeding the model’s token limit**. For example, GPT-4o supports 128k tokens, but with GPT-3.5 (16k) or GPT-4 (32k), keep it modest. |\n",
-        "| **Chunk size (in characters)** | Typically: **500–1,000 characters** per chunk → ~75–200 tokens. This fits well for retrieval + prompt without context overflow.                                                                           |\n",
-        "| **Chunk size (in tokens)**  | If using token-based splitter (e.g. `TokenTextSplitter`): aim for **100–300 tokens** per chunk.                                                                                                            |\n",
-        "| **Chunk overlap**           | Use **overlap of 10–30%** (e.g., 100–300 characters or ~50 tokens) to preserve context across chunk boundaries and avoid cutting off important ideas mid-sentence.                                        |\n",
-        "| **Document structure**      | Use **`RecursiveCharacterTextSplitter`** to preserve semantic boundaries (paragraphs, sentences) instead of arbitrary cuts.                                                                                |\n",
-        "| **Task type**               | For **question answering**, smaller chunks (~500–800 chars) reduce noise.<br>For **summarization**, slightly larger chunks (~1000–1500) are OK.                                                          |\n",
-        "| **Embedding model**         | Some models (e.g., `text-embedding-3-large`) can handle long input. But still, smaller chunks give **finer-grained retrieval**, which improves relevance.                                                  |\n",
-        "| **Query type**              | If users ask **very specific questions**, small focused chunks are better. For broader queries, bigger chunks might help.                                                                                  |\n",
-        "\n",
-        "\n",
-        "### Rule of Thumb\n",
-        "\n",
-        "| Use Case                 | Chunk Size      | Overlap |\n",
-        "| ------------------------| --------------- | ------- |\n",
-        "| Factual Q&A              | 500–800 chars   | 100–200 |\n",
-        "| Summarization            | 1000–1500 chars | 200–300 |\n",
-        "| Technical documents      | 400–700 chars   | 100–200 |\n",
-        "| Long reports/books       | 800–1200 chars  | 200–300 |\n",
-        "| Small LLMs (≤16k tokens) | ≤800 chars      | 100–200 |\n",
-        "\n",
-        "\n",
-        "### Avoid\n",
-        "\n",
-        "- Chunks >2000 characters: risks context overflow.\n",
-        "- No overlap: may lose key information between chunks.\n",
-        "\n"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "Mg15RjVPxYZw"
-      },
-      "source": [
-        "<h2 style=\"color: #FF8C00;\">Embeddings</h2>\n",
-        "\n",
-        "Embeddings transform text into dense vector representations, capturing semantic meaning and contextual relationships. They are essential for efficient document retrieval and similarity analysis.\n",
-        "\n",
-        "- **What are OpenAI Embeddings?**\n",
-        "  - Pre-trained embeddings like `text-embedding-3-large` generate high-quality vector representations for text.\n",
-        "  - Encapsulate semantic relationships in the text, enabling robust NLP applications.\n",
-        "\n",
-        "- **Key Features of `text-embedding-3-large`:**\n",
-        "  - Large-scale embedding model optimized for accuracy and versatility.\n",
-        "  - Handles diverse NLP tasks, including retrieval, classification, and clustering.\n",
-        "  - Ideal for applications with high-performance requirements.\n",
-        "\n",
-        "- **Benefits:**\n",
-        "  - Reduces the need for extensive custom training.\n",
-        "  - Provides state-of-the-art performance in retrieval-augmented systems.\n",
-        "  - Compatible with RAGs to create powerful context-aware models.\n"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "L0xDxElwxYZw"
-      },
-      "outputs": [],
-      "source": [
-        "from langchain.embeddings import OpenAIEmbeddings\n",
-        "from dotenv import load_dotenv"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "_WRIo3_0xYZx",
-        "outputId": "78bfbbf3-9d25-4e31-bdbc-3e932e6bbfec"
-      },
-      "outputs": [],
-      "source": [
-        "load_dotenv()"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "MNZfTng5xYZz",
-        "outputId": "db1a7c85-ef9f-447e-92cd-9d097e959847"
-      },
-      "outputs": [],
-      "source": [
-        "api_key = os.getenv(\"OPENAI_API_KEY\")\n",
-        "embeddings = OpenAIEmbeddings(model=\"text-embedding-3-large\")"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "EsSA7RKvxYZz"
-      },
-      "source": [
-        "<h2 style=\"color: #FF8C00;\">ChromaDB</h2>\n",
-        "\n",
-        "ChromaDB is a versatile vector database designed for efficiently storing and retrieving embeddings. It integrates seamlessly with embedding models to enable high-performance similarity search and context-based retrieval.\n",
-        "\n",
-        "### Workflow Overview:\n",
-        "- **Step 1:** Generate embeddings using a pre-trained model (e.g., OpenAI's `text-embedding-3-large`).\n",
-        "- **Step 2:** Store the embeddings in ChromaDB for efficient retrieval and similarity calculations.\n",
-        "- **Step 3:** Use the stored embeddings to perform searches, matching, or context-based retrieval.\n",
-        "\n",
-        "### Key Features of ChromaDB:\n",
-        "- **Scalability:** Handles large-scale datasets with optimized indexing and search capabilities.\n",
-        "- **Speed:** Provides fast and accurate retrieval of embeddings for real-time applications.\n",
-        "- **Integration:** Supports integration with popular frameworks and libraries for embedding generation."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "brKe6wUgxYZ0"
-      },
-      "outputs": [],
-      "source": [
-        "from langchain.vectorstores import Chroma"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "VkjHR-RkxYZ0",
-        "outputId": "bc11bda9-f283-457a-f584-5a06b95c4dd9"
-      },
-      "outputs": [],
-      "source": [
-        "db = Chroma.from_documents(chunks, embeddings, persist_directory=\"./chroma_db_LAB\")\n",
-        "print(\"ChromaDB created with document embeddings.\")"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "27OdN1IVxYZ1"
-      },
-      "source": [
-        "<h1 style=\"color: #FF6347;\">Retrieving Documents</h1>\n"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "### Exercice1: Write a user question that someone might ask about your book’s topic or content."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "XiLv-TfrxYZ1"
-      },
-      "outputs": [],
-      "source": [
-        "user_question = \"\" # User question\n",
-        "retrieved_docs = db.similarity_search(user_question, k=10) # k is the number of documents to retrieve"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "qgWsh50JxYZ1",
-        "outputId": "c8640c5d-5955-471f-fdd2-37096f5f68c7"
-      },
-      "outputs": [],
-      "source": [
-        "# Display top results\n",
-        "for i, doc in enumerate(retrieved_docs[:3]): # Display top 3 results\n",
-        "    print(f\"Document {i+1}:\\n{doc.page_content[36:1000]}\") # Display content"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "XuGK8gL6xYZ1"
-      },
-      "source": [
-        "<h2 style=\"color: #FF8C00;\">Preparing Content for GenAI</h2>"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "2iB3lZqHxYZ2"
-      },
-      "outputs": [],
-      "source": [
-        "def _get_document_prompt(docs):\n",
-        "    prompt = \"\\n\"\n",
-        "    for doc in docs:\n",
-        "        prompt += \"\\nContent:\\n\"\n",
-        "        prompt += doc.page_content + \"\\n\\n\"\n",
-        "    return prompt"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "2okzmuADxYZ2",
-        "outputId": "0aa6cdca-188d-40e0-f5b4-8888d3549ea4"
-      },
-      "outputs": [],
-      "source": [
-        "# Generate a formatted context from the retrieved documents\n",
-        "formatted_context = _get_document_prompt(retrieved_docs)\n",
-        "print(\"Context formatted for GPT model.\")"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "qzIczQNTxYZ2"
-      },
-      "source": [
-        "<h2 style=\"color: #FF8C00;\">ChatBot Architecture</h2>"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "### Exercice2: Write a prompt that is relevant and tailored to the content and style of your book."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "tqxVh9s3xYZ3",
-        "outputId": "97cca95d-4ab3-44d8-a76c-5713aad387d8"
-      },
-      "outputs": [],
-      "source": [
-        "prompt = f\"\"\"\n",
-        "\n",
-        "\n",
-        "\"\"\"\n"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "0mjkQJ_ZxYZ3"
-      },
-      "outputs": [],
-      "source": [
-        "import openai"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "### Exercice3: Tune parameters like temperature, and penalties to control how creative, focused, or varied the model's responses are."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "ylypRWRlxYZ4"
-      },
-      "outputs": [],
-      "source": [
-        "# Set up GPT client and parameters\n",
-        "client = openai.OpenAI()\n",
-        "model_params = {\n",
-        "    'model': 'gpt-4o',\n",
-        "    'temperature': ,  # Increase creativity\n",
-        "    'max_tokens': ,  # Allow for longer responses\n",
-        "    'top_p': ,        # Use nucleus sampling\n",
-        "    'frequency_penalty': ,  # Reduce repetition\n",
-        "    'presence_penalty':    # Encourage new topics\n",
-        "}"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "C8e942xDxYZ4"
-      },
-      "source": [
-        "<h1 style=\"color: #FF6347;\">Response</h1>\n"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "4eXZO4pIxYZ4"
-      },
-      "outputs": [],
-      "source": [
-        "messages = [{'role': 'user', 'content': prompt}]\n",
-        "completion = client.chat.completions.create(messages=messages, **model_params, timeout=120)"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "wLPAcchBxYZ5",
-        "outputId": "976c7800-16ed-41fe-c4cf-58f60d3230d2"
-      },
-      "outputs": [],
-      "source": [
-        "answer = completion.choices[0].message.content\n",
-        "print(answer)"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "VXVNXPwLxYaT"
-      },
-      "source": [
-        "<img src=\"https://miro.medium.com/v2/resize:fit:824/1*GK56xmDIWtNQAD_jnBIt2g.png\" alt=\"NLP Gif\" style=\"width: 500px\">"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "ldybhlqKxYaT"
-      },
-      "source": [
-        "<h2 style=\"color: #FF6347;\">Cosine Similarity</h2>\n",
-        "\n",
-        "**Cosine similarity** is a metric used to measure the alignment or similarity between two vectors, calculated as the cosine of the angle between them. It is the **most common metric used in RAG pipelines** for vector retrieval.. It provides a scale from -1 to 1:\n",
-        "\n",
-        "- **-1**: Vectors are completely opposite.\n",
-        "- **0**: Vectors are orthogonal (uncorrelated or unrelated).\n",
-        "- **1**: Vectors are identical.\n"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "1c1I1TNhxYaT"
-      },
-      "source": [
-        "<img src=\"https://storage.googleapis.com/lds-media/images/cosine-similarity-vectors.original.jpg\" alt=\"NLP Gif\" style=\"width: 700px\">"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "EoEMdNgQxYaU"
-      },
-      "source": [
-        "<h2 style=\"color: #FF6347;\">Keyword Highlighting</h2>\n",
-        "\n",
-        "Highlighting important keywords helps users quickly understand the relevance of the retrieved text to their query."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "nCXL9Cz1xYaV"
-      },
-      "outputs": [],
-      "source": [
-        "from termcolor import colored"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "xwDyofY0xYaV"
-      },
-      "source": [
-        "The `highlight_keywords` function is designed to highlight specific keywords within a given text. It replaces each keyword in the text with a highlighted version using the `colored` function from the `termcolor` library.\n"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "9y3E0YWExYaV"
-      },
-      "outputs": [],
-      "source": [
-        "def highlight_keywords(text, keywords):\n",
-        "    for keyword in keywords:\n",
-        "        text = text.replace(keyword, colored(keyword, 'green', attrs=['bold']))\n",
-        "    return text"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "### Exercice4: add your keywords"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {
-        "id": "i7SkWPpnxYaW",
-        "outputId": "28e82563-edba-4b41-acad-ec27e5ba134f"
-      },
-      "outputs": [],
-      "source": [
-        "query_keywords = [] # add your keywords\n",
-        "for i, doc in enumerate(retrieved_docs[:1]):\n",
-        "    snippet = doc.page_content[:200]\n",
-        "    highlighted = highlight_keywords(snippet, query_keywords)\n",
-        "    print(f\"Snippet {i+1}:\\n{highlighted}\\n{'-'*80}\")"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "AhV_Jf_LxYaX"
-      },
-      "source": [
-        "1. `query_keywords` is a list of keywords to be highlighted.\n",
-        "2. The loop iterates over the first document in retrieved_docs.\n",
-        "3. For each document, a snippet of the first 200 characters is extracted.\n",
-        "4. The highlight_keywords function is called to highlight the keywords in the snippet.\n",
-        "5. The highlighted snippet is printed along with a separator line."
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "pBRKysAvxYaX"
-      },
-      "source": [
-        "<h1 style=\"color: #FF6347;\">Bonus</h1>"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {
-        "id": "Qj25lCybxYaX"
-      },
-      "source": [
-        "**Try loading one of your own PDF books and go through the steps again to explore how the pipeline works with your content**:\n"
-      ]
-    }
-  ],
-  "metadata": {
-    "colab": {
-      "provenance": []
-    },
-    "kernelspec": {
-      "display_name": "llm",
-      "language": "python",
-      "name": "python3"
-    },
-    "language_info": {
-      "codemirror_mode": {
-        "name": "ipython",
-        "version": 3
-      },
-      "file_extension": ".py",
-      "mimetype": "text/x-python",
-      "name": "python",
-      "nbconvert_exporter": "python",
-      "pygments_lexer": "ipython3",
-      "version": "3.12.10"
-    }
-  },
-  "nbformat": 4,
-  "nbformat_minor": 0
-}
diff --git a/your-code/rag lab.ipynb b/your-code/rag lab.ipynb
new file mode 100644
index 0000000..bfa9db3
--- /dev/null
+++ b/your-code/rag lab.ipynb	
@@ -0,0 +1,1026 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "RsnCPbdkxYZd"
+      },
+      "source": [
+        "<div style=\"text-align: center;\">\n",
+        "    <h1 style=\"color: #FF6347;\">Self-Guided Lab: Retrieval-Augmented Generation (RAGs)</h1>\n",
+        "</div>"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "tZp4BQAVxYZj"
+      },
+      "source": [
+        "<div style=\"text-align: center;\">\n",
+        "    <img src=\"https://media4.giphy.com/media/v1.Y2lkPTc5MGI3NjExZ3FsdzRveTBrenMxM3VnbDMwaTJxN2NnZm50aGFibXk1NzNnY2Q0MCZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/LR5ZBwZHv02lmpVoEU/giphy.gif\" alt=\"NLP Gif\" style=\"width: 300px; height: 150px; object-fit: cover; object-position: center;\">\n",
+        "</div>"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "gizk6HCYxYZo"
+      },
+      "source": [
+        "<h1 style=\"color: #FF6347;\">Data Storage & Retrieval</h1>\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "QW5UOI8ZxYZp"
+      },
+      "source": [
+        "<h2 style=\"color: #FF8C00;\">PyPDFLoader</h2>\n",
+        "\n",
+        "`PyPDFLoader` is a lightweight Python library designed to streamline the process of loading and parsing PDF documents for text processing tasks. It is particularly useful in Retrieval-Augmented Generation workflows where text extraction from PDFs is required.\n",
+        "\n",
+        "- **What Does PyPDFLoader Do?**\n",
+        "  - Extracts text from PDF files, retaining formatting and layout.\n",
+        "  - Simplifies the preprocessing of document-based datasets.\n",
+        "  - Supports efficient and scalable loading of large PDF collections.\n",
+        "\n",
+        "- **Key Features:**\n",
+        "  - Compatible with popular NLP libraries and frameworks.\n",
+        "  - Handles multi-page PDFs and embedded images (e.g., OCR-compatible setups).\n",
+        "  - Provides flexible configurations for structured text extraction.\n",
+        "\n",
+        "- **Use Cases:**\n",
+        "  - Preparing PDF documents for retrieval-based systems in RAGs.\n",
+        "  - Automating the text extraction pipeline for document analysis.\n",
+        "  - Creating datasets from academic papers, technical manuals, and reports.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 1,
+      "metadata": {},
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Requirement already satisfied: langchain in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (1.2.7)\n",
+            "Requirement already satisfied: langchain_community in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (0.4.1)\n",
+            "Requirement already satisfied: pypdf in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (6.6.0)\n",
+            "Requirement already satisfied: langchain-core<2.0.0,>=1.2.7 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langchain) (1.2.7)\n",
+            "Requirement already satisfied: langgraph<1.1.0,>=1.0.7 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langchain) (1.0.7)\n",
+            "Requirement already satisfied: pydantic<3.0.0,>=2.7.4 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langchain) (2.12.5)\n",
+            "Requirement already satisfied: langchain-classic<2.0.0,>=1.0.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langchain_community) (1.0.1)\n",
+            "Requirement already satisfied: SQLAlchemy<3.0.0,>=1.4.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langchain_community) (2.0.46)\n",
+            "Requirement already satisfied: requests<3.0.0,>=2.32.5 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langchain_community) (2.32.5)\n",
+            "Requirement already satisfied: PyYAML<7.0.0,>=5.3.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langchain_community) (6.0.3)\n",
+            "Requirement already satisfied: aiohttp<4.0.0,>=3.8.3 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langchain_community) (3.13.3)\n",
+            "Requirement already satisfied: tenacity!=8.4.0,<10.0.0,>=8.1.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langchain_community) (9.1.2)\n",
+            "Requirement already satisfied: dataclasses-json<0.7.0,>=0.6.7 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langchain_community) (0.6.7)\n",
+            "Requirement already satisfied: pydantic-settings<3.0.0,>=2.10.1 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langchain_community) (2.12.0)\n",
+            "Requirement already satisfied: langsmith<1.0.0,>=0.1.125 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langchain_community) (0.6.4)\n",
+            "Requirement already satisfied: httpx-sse<1.0.0,>=0.4.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langchain_community) (0.4.3)\n",
+            "Requirement already satisfied: numpy>=1.26.2 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langchain_community) (2.3.3)\n",
+            "Requirement already satisfied: aiohappyeyeballs>=2.5.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from aiohttp<4.0.0,>=3.8.3->langchain_community) (2.6.1)\n",
+            "Requirement already satisfied: aiosignal>=1.4.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from aiohttp<4.0.0,>=3.8.3->langchain_community) (1.4.0)\n",
+            "Requirement already satisfied: attrs>=17.3.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from aiohttp<4.0.0,>=3.8.3->langchain_community) (25.4.0)\n",
+            "Requirement already satisfied: frozenlist>=1.1.1 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from aiohttp<4.0.0,>=3.8.3->langchain_community) (1.8.0)\n",
+            "Requirement already satisfied: multidict<7.0,>=4.5 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from aiohttp<4.0.0,>=3.8.3->langchain_community) (6.7.0)\n",
+            "Requirement already satisfied: propcache>=0.2.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from aiohttp<4.0.0,>=3.8.3->langchain_community) (0.4.1)\n",
+            "Requirement already satisfied: yarl<2.0,>=1.17.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from aiohttp<4.0.0,>=3.8.3->langchain_community) (1.22.0)\n",
+            "Requirement already satisfied: marshmallow<4.0.0,>=3.18.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from dataclasses-json<0.7.0,>=0.6.7->langchain_community) (3.26.2)\n",
+            "Requirement already satisfied: typing-inspect<1,>=0.4.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from dataclasses-json<0.7.0,>=0.6.7->langchain_community) (0.9.0)\n",
+            "Requirement already satisfied: langchain-text-splitters<2.0.0,>=1.1.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langchain-classic<2.0.0,>=1.0.0->langchain_community) (1.1.0)\n",
+            "Requirement already satisfied: jsonpatch<2.0.0,>=1.33.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langchain-core<2.0.0,>=1.2.7->langchain) (1.33)\n",
+            "Requirement already satisfied: packaging<26.0.0,>=23.2.0 in c:\\users\\parte\\appdata\\roaming\\python\\python311\\site-packages (from langchain-core<2.0.0,>=1.2.7->langchain) (25.0)\n",
+            "Requirement already satisfied: typing-extensions<5.0.0,>=4.7.0 in c:\\users\\parte\\appdata\\roaming\\python\\python311\\site-packages (from langchain-core<2.0.0,>=1.2.7->langchain) (4.15.0)\n",
+            "Requirement already satisfied: uuid-utils<1.0,>=0.12.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langchain-core<2.0.0,>=1.2.7->langchain) (0.14.0)\n",
+            "Requirement already satisfied: langgraph-checkpoint<5.0.0,>=2.1.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langgraph<1.1.0,>=1.0.7->langchain) (4.0.0)\n",
+            "Requirement already satisfied: langgraph-prebuilt<1.1.0,>=1.0.7 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langgraph<1.1.0,>=1.0.7->langchain) (1.0.7)\n",
+            "Requirement already satisfied: langgraph-sdk<0.4.0,>=0.3.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langgraph<1.1.0,>=1.0.7->langchain) (0.3.3)\n",
+            "Requirement already satisfied: xxhash>=3.5.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langgraph<1.1.0,>=1.0.7->langchain) (3.6.0)\n",
+            "Requirement already satisfied: httpx<1,>=0.23.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langsmith<1.0.0,>=0.1.125->langchain_community) (0.28.1)\n",
+            "Requirement already satisfied: orjson>=3.9.14 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langsmith<1.0.0,>=0.1.125->langchain_community) (3.11.5)\n",
+            "Requirement already satisfied: requests-toolbelt>=1.0.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langsmith<1.0.0,>=0.1.125->langchain_community) (1.0.0)\n",
+            "Requirement already satisfied: zstandard>=0.23.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langsmith<1.0.0,>=0.1.125->langchain_community) (0.25.0)\n",
+            "Requirement already satisfied: annotated-types>=0.6.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from pydantic<3.0.0,>=2.7.4->langchain) (0.7.0)\n",
+            "Requirement already satisfied: pydantic-core==2.41.5 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from pydantic<3.0.0,>=2.7.4->langchain) (2.41.5)\n",
+            "Requirement already satisfied: typing-inspection>=0.4.2 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from pydantic<3.0.0,>=2.7.4->langchain) (0.4.2)\n",
+            "Requirement already satisfied: python-dotenv>=0.21.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from pydantic-settings<3.0.0,>=2.10.1->langchain_community) (1.2.1)\n",
+            "Requirement already satisfied: charset_normalizer<4,>=2 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from requests<3.0.0,>=2.32.5->langchain_community) (3.4.4)\n",
+            "Requirement already satisfied: idna<4,>=2.5 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from requests<3.0.0,>=2.32.5->langchain_community) (3.11)\n",
+            "Requirement already satisfied: urllib3<3,>=1.21.1 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from requests<3.0.0,>=2.32.5->langchain_community) (2.5.0)\n",
+            "Requirement already satisfied: certifi>=2017.4.17 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from requests<3.0.0,>=2.32.5->langchain_community) (2025.10.5)\n",
+            "Requirement already satisfied: greenlet>=1 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from SQLAlchemy<3.0.0,>=1.4.0->langchain_community) (3.3.0)\n",
+            "Requirement already satisfied: anyio in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from httpx<1,>=0.23.0->langsmith<1.0.0,>=0.1.125->langchain_community) (4.12.1)\n",
+            "Requirement already satisfied: httpcore==1.* in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from httpx<1,>=0.23.0->langsmith<1.0.0,>=0.1.125->langchain_community) (1.0.9)\n",
+            "Requirement already satisfied: h11>=0.16 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from httpcore==1.*->httpx<1,>=0.23.0->langsmith<1.0.0,>=0.1.125->langchain_community) (0.16.0)\n",
+            "Requirement already satisfied: jsonpointer>=1.9 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from jsonpatch<2.0.0,>=1.33.0->langchain-core<2.0.0,>=1.2.7->langchain) (3.0.0)\n",
+            "Requirement already satisfied: ormsgpack>=1.12.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langgraph-checkpoint<5.0.0,>=2.1.0->langgraph<1.1.0,>=1.0.7->langchain) (1.12.2)\n",
+            "Requirement already satisfied: mypy-extensions>=0.3.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from typing-inspect<1,>=0.4.0->dataclasses-json<0.7.0,>=0.6.7->langchain_community) (1.1.0)\n",
+            "Note: you may need to restart the kernel to use updated packages.\n"
+          ]
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "\n",
+            "[notice] A new release of pip available: 22.3.1 -> 25.3\n",
+            "[notice] To update, run: c:\\Users\\parte\\AppData\\Local\\Programs\\Python\\Python311\\python.exe -m pip install --upgrade pip\n"
+          ]
+        },
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Requirement already satisfied: termcolor in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (3.2.0)\n",
+            "Requirement already satisfied: langchain_openai in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (1.1.7)\n",
+            "Requirement already satisfied: langchain-huggingface in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (1.2.0)\n",
+            "Requirement already satisfied: sentence-transformers in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (5.2.0)\n",
+            "Requirement already satisfied: chromadb in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (1.4.1)\n",
+            "Requirement already satisfied: langchain_chroma in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (1.1.0)\n",
+            "Requirement already satisfied: tiktoken in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (0.12.0)\n",
+            "Requirement already satisfied: openai in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (2.15.0)\n",
+            "Requirement already satisfied: python-dotenv in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (1.2.1)\n",
+            "Requirement already satisfied: langchain-core<2.0.0,>=1.2.6 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langchain_openai) (1.2.7)\n",
+            "Requirement already satisfied: huggingface-hub<1.0.0,>=0.33.4 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langchain-huggingface) (0.36.0)\n",
+            "Requirement already satisfied: tokenizers<1.0.0,>=0.19.1 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langchain-huggingface) (0.22.2)\n",
+            "Requirement already satisfied: transformers<6.0.0,>=4.41.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from sentence-transformers) (4.57.3)\n",
+            "Requirement already satisfied: tqdm in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from sentence-transformers) (4.67.1)\n",
+            "Requirement already satisfied: torch>=1.11.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from sentence-transformers) (2.9.1)\n",
+            "Requirement already satisfied: scikit-learn in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from sentence-transformers) (1.7.2)\n",
+            "Requirement already satisfied: scipy in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from sentence-transformers) (1.16.2)\n",
+            "Requirement already satisfied: typing_extensions>=4.5.0 in c:\\users\\parte\\appdata\\roaming\\python\\python311\\site-packages (from sentence-transformers) (4.15.0)\n",
+            "Requirement already satisfied: build>=1.0.3 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (1.4.0)\n",
+            "Requirement already satisfied: pydantic>=1.9 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (2.12.5)\n",
+            "Requirement already satisfied: pybase64>=1.4.1 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (1.4.3)\n",
+            "Requirement already satisfied: uvicorn[standard]>=0.18.3 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (0.40.0)\n",
+            "Requirement already satisfied: numpy>=1.22.5 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (2.3.3)\n",
+            "Requirement already satisfied: posthog<6.0.0,>=2.4.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (5.4.0)\n",
+            "Requirement already satisfied: onnxruntime>=1.14.1 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (1.23.2)\n",
+            "Requirement already satisfied: opentelemetry-api>=1.2.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (1.39.1)\n",
+            "Requirement already satisfied: opentelemetry-exporter-otlp-proto-grpc>=1.2.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (1.39.1)\n",
+            "Requirement already satisfied: opentelemetry-sdk>=1.2.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (1.39.1)\n",
+            "Requirement already satisfied: pypika>=0.48.9 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (0.50.0)\n",
+            "Requirement already satisfied: overrides>=7.3.1 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (7.7.0)\n",
+            "Requirement already satisfied: importlib-resources in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (6.5.2)\n",
+            "Requirement already satisfied: grpcio>=1.58.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (1.76.0)\n",
+            "Requirement already satisfied: bcrypt>=4.0.1 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (5.0.0)\n",
+            "Requirement already satisfied: typer>=0.9.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (0.21.1)\n",
+            "Requirement already satisfied: kubernetes>=28.1.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (35.0.0)\n",
+            "Requirement already satisfied: tenacity>=8.2.3 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (9.1.2)\n",
+            "Requirement already satisfied: pyyaml>=6.0.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (6.0.3)\n",
+            "Requirement already satisfied: mmh3>=4.0.1 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (5.2.0)\n",
+            "Requirement already satisfied: orjson>=3.9.12 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (3.11.5)\n",
+            "Requirement already satisfied: httpx>=0.27.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (0.28.1)\n",
+            "Requirement already satisfied: rich>=10.11.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (14.2.0)\n",
+            "Requirement already satisfied: jsonschema>=4.19.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from chromadb) (4.26.0)\n",
+            "Requirement already satisfied: regex>=2022.1.18 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from tiktoken) (2025.11.3)\n",
+            "Requirement already satisfied: requests>=2.26.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from tiktoken) (2.32.5)\n",
+            "Requirement already satisfied: anyio<5,>=3.5.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from openai) (4.12.1)\n",
+            "Requirement already satisfied: distro<2,>=1.7.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from openai) (1.9.0)\n",
+            "Requirement already satisfied: jiter<1,>=0.10.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from openai) (0.12.0)\n",
+            "Requirement already satisfied: sniffio in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from openai) (1.3.1)\n",
+            "Requirement already satisfied: idna>=2.8 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from anyio<5,>=3.5.0->openai) (3.11)\n",
+            "Requirement already satisfied: packaging>=24.0 in c:\\users\\parte\\appdata\\roaming\\python\\python311\\site-packages (from build>=1.0.3->chromadb) (25.0)\n",
+            "Requirement already satisfied: pyproject_hooks in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from build>=1.0.3->chromadb) (1.2.0)\n",
+            "Requirement already satisfied: colorama in c:\\users\\parte\\appdata\\roaming\\python\\python311\\site-packages (from build>=1.0.3->chromadb) (0.4.6)\n",
+            "Requirement already satisfied: certifi in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from httpx>=0.27.0->chromadb) (2025.10.5)\n",
+            "Requirement already satisfied: httpcore==1.* in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from httpx>=0.27.0->chromadb) (1.0.9)\n",
+            "Requirement already satisfied: h11>=0.16 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from httpcore==1.*->httpx>=0.27.0->chromadb) (0.16.0)\n",
+            "Requirement already satisfied: filelock in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from huggingface-hub<1.0.0,>=0.33.4->langchain-huggingface) (3.20.3)\n",
+            "Requirement already satisfied: fsspec>=2023.5.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from huggingface-hub<1.0.0,>=0.33.4->langchain-huggingface) (2026.1.0)\n",
+            "Requirement already satisfied: attrs>=22.2.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from jsonschema>=4.19.0->chromadb) (25.4.0)\n",
+            "Requirement already satisfied: jsonschema-specifications>=2023.03.6 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from jsonschema>=4.19.0->chromadb) (2025.9.1)\n",
+            "Requirement already satisfied: referencing>=0.28.4 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from jsonschema>=4.19.0->chromadb) (0.37.0)\n",
+            "Requirement already satisfied: rpds-py>=0.25.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from jsonschema>=4.19.0->chromadb) (0.30.0)\n",
+            "Requirement already satisfied: six>=1.9.0 in c:\\users\\parte\\appdata\\roaming\\python\\python311\\site-packages (from kubernetes>=28.1.0->chromadb) (1.17.0)\n",
+            "Requirement already satisfied: python-dateutil>=2.5.3 in c:\\users\\parte\\appdata\\roaming\\python\\python311\\site-packages (from kubernetes>=28.1.0->chromadb) (2.9.0.post0)\n",
+            "Requirement already satisfied: websocket-client!=0.40.0,!=0.41.*,!=0.42.*,>=0.32.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from kubernetes>=28.1.0->chromadb) (1.9.0)\n",
+            "Requirement already satisfied: requests-oauthlib in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from kubernetes>=28.1.0->chromadb) (2.0.0)\n",
+            "Requirement already satisfied: urllib3!=2.6.0,>=1.24.2 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from kubernetes>=28.1.0->chromadb) (2.5.0)\n",
+            "Requirement already satisfied: durationpy>=0.7 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from kubernetes>=28.1.0->chromadb) (0.10)\n",
+            "Requirement already satisfied: jsonpatch<2.0.0,>=1.33.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langchain-core<2.0.0,>=1.2.6->langchain_openai) (1.33)\n",
+            "Requirement already satisfied: langsmith<1.0.0,>=0.3.45 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langchain-core<2.0.0,>=1.2.6->langchain_openai) (0.6.4)\n",
+            "Requirement already satisfied: uuid-utils<1.0,>=0.12.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langchain-core<2.0.0,>=1.2.6->langchain_openai) (0.14.0)\n",
+            "Requirement already satisfied: coloredlogs in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from onnxruntime>=1.14.1->chromadb) (15.0.1)\n",
+            "Requirement already satisfied: flatbuffers in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from onnxruntime>=1.14.1->chromadb) (25.9.23)\n",
+            "Requirement already satisfied: protobuf in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from onnxruntime>=1.14.1->chromadb) (6.33.1)\n",
+            "Requirement already satisfied: sympy in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from onnxruntime>=1.14.1->chromadb) (1.14.0)\n",
+            "Requirement already satisfied: importlib-metadata<8.8.0,>=6.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from opentelemetry-api>=1.2.0->chromadb) (8.7.1)\n",
+            "Requirement already satisfied: googleapis-common-protos~=1.57 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from opentelemetry-exporter-otlp-proto-grpc>=1.2.0->chromadb) (1.72.0)\n",
+            "Requirement already satisfied: opentelemetry-exporter-otlp-proto-common==1.39.1 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from opentelemetry-exporter-otlp-proto-grpc>=1.2.0->chromadb) (1.39.1)\n",
+            "Requirement already satisfied: opentelemetry-proto==1.39.1 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from opentelemetry-exporter-otlp-proto-grpc>=1.2.0->chromadb) (1.39.1)\n",
+            "Requirement already satisfied: opentelemetry-semantic-conventions==0.60b1 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from opentelemetry-sdk>=1.2.0->chromadb) (0.60b1)\n",
+            "Requirement already satisfied: backoff>=1.10.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from posthog<6.0.0,>=2.4.0->chromadb) (2.2.1)\n",
+            "Requirement already satisfied: annotated-types>=0.6.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from pydantic>=1.9->chromadb) (0.7.0)\n",
+            "Requirement already satisfied: pydantic-core==2.41.5 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from pydantic>=1.9->chromadb) (2.41.5)\n",
+            "Requirement already satisfied: typing-inspection>=0.4.2 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from pydantic>=1.9->chromadb) (0.4.2)\n",
+            "Requirement already satisfied: charset_normalizer<4,>=2 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from requests>=2.26.0->tiktoken) (3.4.4)\n",
+            "Requirement already satisfied: markdown-it-py>=2.2.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from rich>=10.11.0->chromadb) (4.0.0)\n",
+            "Requirement already satisfied: pygments<3.0.0,>=2.13.0 in c:\\users\\parte\\appdata\\roaming\\python\\python311\\site-packages (from rich>=10.11.0->chromadb) (2.19.2)\n",
+            "Requirement already satisfied: networkx>=2.5.1 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from torch>=1.11.0->sentence-transformers) (3.5)\n",
+            "Requirement already satisfied: jinja2 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from torch>=1.11.0->sentence-transformers) (3.1.6)\n",
+            "Requirement already satisfied: safetensors>=0.4.3 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from transformers<6.0.0,>=4.41.0->sentence-transformers) (0.7.0)\n",
+            "Requirement already satisfied: click>=8.0.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from typer>=0.9.0->chromadb) (8.3.1)\n",
+            "Requirement already satisfied: shellingham>=1.3.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from typer>=0.9.0->chromadb) (1.5.4)\n",
+            "Requirement already satisfied: httptools>=0.6.3 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from uvicorn[standard]>=0.18.3->chromadb) (0.7.1)\n",
+            "Requirement already satisfied: watchfiles>=0.13 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from uvicorn[standard]>=0.18.3->chromadb) (1.1.1)\n",
+            "Requirement already satisfied: websockets>=10.4 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from uvicorn[standard]>=0.18.3->chromadb) (16.0)\n",
+            "Requirement already satisfied: joblib>=1.2.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from scikit-learn->sentence-transformers) (1.5.2)\n",
+            "Requirement already satisfied: threadpoolctl>=3.1.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from scikit-learn->sentence-transformers) (3.6.0)\n",
+            "Requirement already satisfied: zipp>=3.20 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from importlib-metadata<8.8.0,>=6.0->opentelemetry-api>=1.2.0->chromadb) (3.23.0)\n",
+            "Requirement already satisfied: jsonpointer>=1.9 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from jsonpatch<2.0.0,>=1.33.0->langchain-core<2.0.0,>=1.2.6->langchain_openai) (3.0.0)\n",
+            "Requirement already satisfied: requests-toolbelt>=1.0.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langsmith<1.0.0,>=0.3.45->langchain-core<2.0.0,>=1.2.6->langchain_openai) (1.0.0)\n",
+            "Requirement already satisfied: zstandard>=0.23.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from langsmith<1.0.0,>=0.3.45->langchain-core<2.0.0,>=1.2.6->langchain_openai) (0.25.0)\n",
+            "Requirement already satisfied: mdurl~=0.1 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from markdown-it-py>=2.2.0->rich>=10.11.0->chromadb) (0.1.2)\n",
+            "Requirement already satisfied: mpmath<1.4,>=1.1.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from sympy->onnxruntime>=1.14.1->chromadb) (1.3.0)\n",
+            "Requirement already satisfied: humanfriendly>=9.1 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from coloredlogs->onnxruntime>=1.14.1->chromadb) (10.0)\n",
+            "Requirement already satisfied: MarkupSafe>=2.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from jinja2->torch>=1.11.0->sentence-transformers) (3.0.3)\n",
+            "Requirement already satisfied: oauthlib>=3.0.0 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from requests-oauthlib->kubernetes>=28.1.0->chromadb) (3.3.1)\n",
+            "Requirement already satisfied: pyreadline3 in c:\\users\\parte\\appdata\\local\\programs\\python\\python311\\lib\\site-packages (from humanfriendly>=9.1->coloredlogs->onnxruntime>=1.14.1->chromadb) (3.5.4)\n",
+            "Note: you may need to restart the kernel to use updated packages.\n"
+          ]
+        },
+        {
+          "name": "stderr",
+          "output_type": "stream",
+          "text": [
+            "\n",
+            "[notice] A new release of pip available: 22.3.1 -> 25.3\n",
+            "[notice] To update, run: c:\\Users\\parte\\AppData\\Local\\Programs\\Python\\Python311\\python.exe -m pip install --upgrade pip\n"
+          ]
+        }
+      ],
+      "source": [
+        "%pip install langchain langchain_community pypdf\n",
+        "%pip install termcolor langchain_openai langchain-huggingface sentence-transformers chromadb langchain_chroma tiktoken openai python-dotenv\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 4,
+      "metadata": {},
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "WARNING:tensorflow:From c:\\Users\\parte\\AppData\\Local\\Programs\\Python\\Python311\\Lib\\site-packages\\tf_keras\\src\\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead.\n",
+            "\n"
+          ]
+        }
+      ],
+      "source": [
+        "import os\n",
+        "import warnings\n",
+        "warnings.filterwarnings(\"ignore\")\n",
+        "\n",
+        "from langchain_community.document_loaders import PyPDFLoader\n",
+        "from langchain_text_splitters import (\n",
+        "    CharacterTextSplitter,\n",
+        "    RecursiveCharacterTextSplitter\n",
+        ")\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "sRS44B2XxYZs",
+        "vscode": {
+          "languageId": "plaintext"
+        }
+      },
+      "source": [
+        "<h3 style=\"color: #FF8C00;\">Loading the Documents</h3>"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 8,
+      "metadata": {
+        "id": "cuREtJRixYZt"
+      },
+      "outputs": [],
+      "source": [
+        "# File path for the document\n",
+        "\n",
+        "file_path = r\"C:\\week18\\lab-intro-rag\\ai-for-everyone.pdf\""
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pz_8SOLxxYZt"
+      },
+      "source": [
+        "<h3 style=\"color: #FF8C00;\">Documents into pages</h3>\n",
+        "\n",
+        "The `PyPDFLoader` library allows efficient loading and splitting of PDF documents into smaller, manageable parts for NLP tasks.\n",
+        "\n",
+        "This functionality is particularly useful in workflows requiring granular text processing, such as Retrieval-Augmented Generation (RAG).\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 9,
+      "metadata": {},
+      "outputs": [
+        {
+          "data": {
+            "text/plain": [
+              "310"
+            ]
+          },
+          "execution_count": 9,
+          "metadata": {},
+          "output_type": "execute_result"
+        }
+      ],
+      "source": [
+        "from langchain_community.document_loaders import PyPDFLoader\n",
+        "\n",
+        "loader = PyPDFLoader(file_path)\n",
+        "pages = loader.load()   # already split by page\n",
+        "\n",
+        "len(pages)\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "wt50NRQaxYZv"
+      },
+      "source": [
+        "<h3 style=\"color: #FF8C00;\">Pages into Chunks</h3>\n",
+        "\n",
+        "\n",
+        "####  RecursiveCharacterTextSplitter in LangChain\n",
+        "\n",
+        "The `RecursiveCharacterTextSplitter` is the **recommended splitter** in LangChain when you want to break down long documents into smaller, semantically meaningful chunks — especially useful in **RAG pipelines**, where clean context chunks lead to better LLM responses.\n",
+        "\n",
+        "####  Parameters\n",
+        "\n",
+        "| Parameter       | Description                                                                 |\n",
+        "|-----------------|-----------------------------------------------------------------------------|\n",
+        "| `chunk_size`    | The **maximum number of characters** allowed in a chunk (e.g., `1000`).     |\n",
+        "| `chunk_overlap` | The number of **overlapping characters** between consecutive chunks (e.g., `200`). This helps preserve context continuity. |\n",
+        "\n",
+        "####  How it works\n",
+        "`RecursiveCharacterTextSplitter` attempts to split the text **intelligently**, trying the following separators in order:\n",
+        "1. Paragraphs (`\"\\n\\n\"`)\n",
+        "2. Lines (`\"\\n\"`)\n",
+        "3. Sentences or words (`\" \"`)\n",
+        "4. Individual characters (as a last resort)\n",
+        "\n",
+        "This makes it ideal for handling **natural language documents**, such as PDFs, articles, or long reports, without breaking sentences or paragraphs in awkward ways.\n",
+        "\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 10,
+      "metadata": {},
+      "outputs": [
+        {
+          "data": {
+            "text/plain": [
+              "1096"
+            ]
+          },
+          "execution_count": 10,
+          "metadata": {},
+          "output_type": "execute_result"
+        }
+      ],
+      "source": [
+        "text_splitter = RecursiveCharacterTextSplitter(\n",
+        "    chunk_size=1000,\n",
+        "    chunk_overlap=200\n",
+        ")\n",
+        "chunks = text_splitter.split_documents(pages)\n",
+        "\n",
+        "len(chunks)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "####  Alternative: CharacterTextSplitter\n",
+        "\n",
+        "`CharacterTextSplitter` is a simpler splitter that breaks text into chunks based **purely on character count**, without trying to preserve any natural language structure.\n",
+        "\n",
+        "##### Example:\n",
+        "```python\n",
+        "from langchain.text_splitter import CharacterTextSplitter\n",
+        "\n",
+        "text_splitter = CharacterTextSplitter(\n",
+        "    chunk_size=1000,\n",
+        "    chunk_overlap=200\n",
+        ")\n",
+        "````\n",
+        "\n",
+        "This method is faster and more predictable but may split text in the middle of a sentence or paragraph, which can hurt performance in downstream tasks like retrieval or QA.\n",
+        "\n",
+        "---\n",
+        "\n",
+        "#### Comparison Table\n",
+        "\n",
+        "| Feature                        | RecursiveCharacterTextSplitter | CharacterTextSplitter     |\n",
+        "| ------------------------------ | ------------------------------ | ------------------------- |\n",
+        "| Structure-aware splitting      |  Yes                          |  No                      |\n",
+        "| Preserves sentence/paragraphs  |  Yes                          |  No                      |\n",
+        "| Risk of splitting mid-sentence |  Minimal                     |  High                   |\n",
+        "| Ideal for RAG/document QA      |  Highly recommended           |  Only if structured text |\n",
+        "| Performance speed              |  Slightly slower             |  Faster                  |\n",
+        "\n",
+        "---\n",
+        "\n",
+        "#### Recommendation\n",
+        "\n",
+        "Use `RecursiveCharacterTextSplitter` for most real-world document processing tasks, especially when building RAG pipelines or working with structured natural language content like PDFs or articles."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Best Practices for Choosing Chunk Size in RAG\n",
+        "\n",
+        "### Best Practices for Chunk Size in RAG\n",
+        "\n",
+        "| Factor                      | Recommendation                                                                                                                                                                                          |\n",
+        "| ---------------------------| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n",
+        "| **LLM context limit**       | Choose a chunk size that lets you retrieve multiple chunks **without exceeding the model’s token limit**. For example, GPT-4o supports 128k tokens, but with GPT-3.5 (16k) or GPT-4 (32k), keep it modest. |\n",
+        "| **Chunk size (in characters)** | Typically: **500–1,000 characters** per chunk → ~75–200 tokens. This fits well for retrieval + prompt without context overflow.                                                                           |\n",
+        "| **Chunk size (in tokens)**  | If using token-based splitter (e.g. `TokenTextSplitter`): aim for **100–300 tokens** per chunk.                                                                                                            |\n",
+        "| **Chunk overlap**           | Use **overlap of 10–30%** (e.g., 100–300 characters or ~50 tokens) to preserve context across chunk boundaries and avoid cutting off important ideas mid-sentence.                                        |\n",
+        "| **Document structure**      | Use **`RecursiveCharacterTextSplitter`** to preserve semantic boundaries (paragraphs, sentences) instead of arbitrary cuts.                                                                                |\n",
+        "| **Task type**               | For **question answering**, smaller chunks (~500–800 chars) reduce noise.<br>For **summarization**, slightly larger chunks (~1000–1500) are OK.                                                          |\n",
+        "| **Embedding model**         | Some models (e.g., `text-embedding-3-large`) can handle long input. But still, smaller chunks give **finer-grained retrieval**, which improves relevance.                                                  |\n",
+        "| **Query type**              | If users ask **very specific questions**, small focused chunks are better. For broader queries, bigger chunks might help.                                                                                  |\n",
+        "\n",
+        "\n",
+        "### Rule of Thumb\n",
+        "\n",
+        "| Use Case                 | Chunk Size      | Overlap |\n",
+        "| ------------------------| --------------- | ------- |\n",
+        "| Factual Q&A              | 500–800 chars   | 100–200 |\n",
+        "| Summarization            | 1000–1500 chars | 200–300 |\n",
+        "| Technical documents      | 400–700 chars   | 100–200 |\n",
+        "| Long reports/books       | 800–1200 chars  | 200–300 |\n",
+        "| Small LLMs (≤16k tokens) | ≤800 chars      | 100–200 |\n",
+        "\n",
+        "\n",
+        "### Avoid\n",
+        "\n",
+        "- Chunks >2000 characters: risks context overflow.\n",
+        "- No overlap: may lose key information between chunks.\n",
+        "\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "Mg15RjVPxYZw"
+      },
+      "source": [
+        "<h2 style=\"color: #FF8C00;\">Embeddings</h2>\n",
+        "\n",
+        "Embeddings transform text into dense vector representations, capturing semantic meaning and contextual relationships. They are essential for efficient document retrieval and similarity analysis.\n",
+        "\n",
+        "- **What are OpenAI Embeddings?**\n",
+        "  - Pre-trained embeddings like `text-embedding-3-large` generate high-quality vector representations for text.\n",
+        "  - Encapsulate semantic relationships in the text, enabling robust NLP applications.\n",
+        "\n",
+        "- **Key Features of `text-embedding-3-large`:**\n",
+        "  - Large-scale embedding model optimized for accuracy and versatility.\n",
+        "  - Handles diverse NLP tasks, including retrieval, classification, and clustering.\n",
+        "  - Ideal for applications with high-performance requirements.\n",
+        "\n",
+        "- **Benefits:**\n",
+        "  - Reduces the need for extensive custom training.\n",
+        "  - Provides state-of-the-art performance in retrieval-augmented systems.\n",
+        "  - Compatible with RAGs to create powerful context-aware models.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 14,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "from langchain_openai import OpenAIEmbeddings\n",
+        "from dotenv import load_dotenv\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 18,
+      "metadata": {
+        "id": "_WRIo3_0xYZx",
+        "outputId": "78bfbbf3-9d25-4e31-bdbc-3e932e6bbfec"
+      },
+      "outputs": [
+        {
+          "data": {
+            "text/plain": [
+              "True"
+            ]
+          },
+          "execution_count": 18,
+          "metadata": {},
+          "output_type": "execute_result"
+        }
+      ],
+      "source": [
+        "load_dotenv()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 19,
+      "metadata": {
+        "id": "MNZfTng5xYZz",
+        "outputId": "db1a7c85-ef9f-447e-92cd-9d097e959847"
+      },
+      "outputs": [],
+      "source": [
+        "api_key = os.getenv(\"OPENAI_API_KEY\")\n",
+        "embeddings = OpenAIEmbeddings(model=\"text-embedding-3-large\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "EsSA7RKvxYZz"
+      },
+      "source": [
+        "<h2 style=\"color: #FF8C00;\">ChromaDB</h2>\n",
+        "\n",
+        "ChromaDB is a versatile vector database designed for efficiently storing and retrieving embeddings. It integrates seamlessly with embedding models to enable high-performance similarity search and context-based retrieval.\n",
+        "\n",
+        "### Workflow Overview:\n",
+        "- **Step 1:** Generate embeddings using a pre-trained model (e.g., OpenAI's `text-embedding-3-large`).\n",
+        "- **Step 2:** Store the embeddings in ChromaDB for efficient retrieval and similarity calculations.\n",
+        "- **Step 3:** Use the stored embeddings to perform searches, matching, or context-based retrieval.\n",
+        "\n",
+        "### Key Features of ChromaDB:\n",
+        "- **Scalability:** Handles large-scale datasets with optimized indexing and search capabilities.\n",
+        "- **Speed:** Provides fast and accurate retrieval of embeddings for real-time applications.\n",
+        "- **Integration:** Supports integration with popular frameworks and libraries for embedding generation."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 21,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "from langchain_community.vectorstores import Chroma\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 22,
+      "metadata": {
+        "id": "VkjHR-RkxYZ0",
+        "outputId": "bc11bda9-f283-457a-f584-5a06b95c4dd9"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "ChromaDB created with document embeddings.\n"
+          ]
+        }
+      ],
+      "source": [
+        "db = Chroma.from_documents(chunks, embeddings, persist_directory=\"./chroma_db_LAB\")\n",
+        "print(\"ChromaDB created with document embeddings.\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "27OdN1IVxYZ1"
+      },
+      "source": [
+        "<h1 style=\"color: #FF6347;\">Retrieving Documents</h1>\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### Exercice1: Write a user question that someone might ask about your book’s topic or content."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 23,
+      "metadata": {
+        "id": "XiLv-TfrxYZ1"
+      },
+      "outputs": [],
+      "source": [
+        "user_question = \"\" # User question\n",
+        "retrieved_docs = db.similarity_search(user_question, k=10) # k is the number of documents to retrieve"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 24,
+      "metadata": {
+        "id": "qgWsh50JxYZ1",
+        "outputId": "c8640c5d-5955-471f-fdd2-37096f5f68c7"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Document 1:\n",
+            "f Human Communication. Palo Alto, CA: \n",
+            "Science and Behavior Books.\n",
+            "Weizenbaum, J. 1976. Computer Power and Human Reason: From Judgment to \n",
+            "Calculation. San Francisco: W . H. Freeman.\n",
+            "Document 2:\n",
+            "– and when the difference between human \n",
+            "and machine is affirmed at the cost of their unity that is negated – done so by \n",
+            "disconnections. The way out is the establishment of a relation through affirm-\n",
+            "ing both the identity of, and the difference between, the two sides – as done by\n",
+            "Document 3:\n",
+            "ne, Not a Camera: How Financial Models Shape \n",
+            "Markets. (1st edn.). Cambridge, MA: The MIT Press.\n",
+            "Malik, M. M. 2020. A Hierarchy of Limitations in Machine Learning. \n",
+            "ArXiv:2002.05193 [Cs, Econ, Math, Stat] , February. http://arxiv.org \n",
+            "/abs/2002.05193.\n",
+            "Marcus, G. 2018. Deep Learning: A Critical Appraisal. ArXiv:1801.00631 [Cs, \n",
+            "Stat], January. http://arxiv.org/abs/1801.00631.\n",
+            "McQuillan, D. 2015. Algorithmic States of Exception. European Journal  \n",
+            "of Cultural Studies  18 (4–5), 564–576. DOI: https://doi.org/10.1177 \n",
+            "/1367549415577389.\n",
+            "McQuillan, D. 2017. Data Science as Machinic Neoplatonism. Philosophy & \n",
+            "Technolog y, August, 1–20. DOI: https://doi.org/10.1007/s13347-017-0273-3.\n",
+            "McQuillan, D. 2018. People’s Councils for Ethical Machine Learning. Social \n",
+            "Media + Society 4 (2). DOI: https://doi.org/10.1177/2056305118768303.\n",
+            "Mitchell, A. 2015. Posthumanist Post-Colonialism? Worldly (blog). 26 Feb -\n"
+          ]
+        }
+      ],
+      "source": [
+        "# Display top results\n",
+        "for i, doc in enumerate(retrieved_docs[:3]): # Display top 3 results\n",
+        "    print(f\"Document {i+1}:\\n{doc.page_content[36:1000]}\") # Display content"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "XuGK8gL6xYZ1"
+      },
+      "source": [
+        "<h2 style=\"color: #FF8C00;\">Preparing Content for GenAI</h2>"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 25,
+      "metadata": {
+        "id": "2iB3lZqHxYZ2"
+      },
+      "outputs": [],
+      "source": [
+        "def _get_document_prompt(docs):\n",
+        "    prompt = \"\\n\"\n",
+        "    for doc in docs:\n",
+        "        prompt += \"\\nContent:\\n\"\n",
+        "        prompt += doc.page_content + \"\\n\\n\"\n",
+        "    return prompt"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 26,
+      "metadata": {
+        "id": "2okzmuADxYZ2",
+        "outputId": "0aa6cdca-188d-40e0-f5b4-8888d3549ea4"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Context formatted for GPT model.\n"
+          ]
+        }
+      ],
+      "source": [
+        "# Generate a formatted context from the retrieved documents\n",
+        "formatted_context = _get_document_prompt(retrieved_docs)\n",
+        "print(\"Context formatted for GPT model.\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qzIczQNTxYZ2"
+      },
+      "source": [
+        "<h2 style=\"color: #FF8C00;\">ChatBot Architecture</h2>"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### Exercice2: Write a prompt that is relevant and tailored to the content and style of your book."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 27,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "prompt = f\"\"\"\n",
+        "You are an AI assistant designed to help users understand the content of this book.\n",
+        "\n",
+        "Use ONLY the information provided in the retrieved context to answer the question.\n",
+        "Do not use external knowledge or make assumptions.\n",
+        "\n",
+        "If the answer is not clearly stated in the context, say:\n",
+        "\"The information is not available in the provided document.\"\n",
+        "\n",
+        "Keep your answer:\n",
+        "- Clear\n",
+        "- Concise\n",
+        "- Technically accurate\n",
+        "- Easy to understand for students\n",
+        "\n",
+        "Context:\n",
+        "{{context}}\n",
+        "\n",
+        "Question:\n",
+        "{{question}}\n",
+        "\n",
+        "Answer:\n",
+        "\"\"\"\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 28,
+      "metadata": {
+        "id": "0mjkQJ_ZxYZ3"
+      },
+      "outputs": [],
+      "source": [
+        "import openai"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### Exercice3: Tune parameters like temperature, and penalties to control how creative, focused, or varied the model's responses are."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 30,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Set up GPT client and parameters\n",
+        "client = openai.OpenAI()\n",
+        "\n",
+        "model_params = {\n",
+        "    \"model\": \"gpt-4o\",\n",
+        "    \"temperature\": 0.2,          # Low creativity → factual, consistent answers\n",
+        "    \"max_tokens\": 800,           # Enough for detailed but controlled responses\n",
+        "    \"top_p\": 0.9,                # Balanced nucleus sampling\n",
+        "    \"frequency_penalty\": 0.1,    # Slightly reduce repetition\n",
+        "    \"presence_penalty\": 0.0      # Do NOT encourage new topics in RAG\n",
+        "}\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "C8e942xDxYZ4"
+      },
+      "source": [
+        "<h1 style=\"color: #FF6347;\">Response</h1>\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 31,
+      "metadata": {
+        "id": "4eXZO4pIxYZ4"
+      },
+      "outputs": [],
+      "source": [
+        "messages = [{'role': 'user', 'content': prompt}]\n",
+        "completion = client.chat.completions.create(messages=messages, **model_params, timeout=120)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 32,
+      "metadata": {
+        "id": "wLPAcchBxYZ5",
+        "outputId": "976c7800-16ed-41fe-c4cf-58f60d3230d2"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "I'm sorry, but I need the context to answer your question. Please provide the relevant text or information from the book so I can assist you effectively.\n"
+          ]
+        }
+      ],
+      "source": [
+        "answer = completion.choices[0].message.content\n",
+        "print(answer)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "VXVNXPwLxYaT"
+      },
+      "source": [
+        "<img src=\"https://miro.medium.com/v2/resize:fit:824/1*GK56xmDIWtNQAD_jnBIt2g.png\" alt=\"NLP Gif\" style=\"width: 500px\">"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "ldybhlqKxYaT"
+      },
+      "source": [
+        "<h2 style=\"color: #FF6347;\">Cosine Similarity</h2>\n",
+        "\n",
+        "**Cosine similarity** is a metric used to measure the alignment or similarity between two vectors, calculated as the cosine of the angle between them. It is the **most common metric used in RAG pipelines** for vector retrieval.. It provides a scale from -1 to 1:\n",
+        "\n",
+        "- **-1**: Vectors are completely opposite.\n",
+        "- **0**: Vectors are orthogonal (uncorrelated or unrelated).\n",
+        "- **1**: Vectors are identical.\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "1c1I1TNhxYaT"
+      },
+      "source": [
+        "<img src=\"https://storage.googleapis.com/lds-media/images/cosine-similarity-vectors.original.jpg\" alt=\"NLP Gif\" style=\"width: 700px\">"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "EoEMdNgQxYaU"
+      },
+      "source": [
+        "<h2 style=\"color: #FF6347;\">Keyword Highlighting</h2>\n",
+        "\n",
+        "Highlighting important keywords helps users quickly understand the relevance of the retrieved text to their query."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 33,
+      "metadata": {
+        "id": "nCXL9Cz1xYaV"
+      },
+      "outputs": [],
+      "source": [
+        "from termcolor import colored"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "xwDyofY0xYaV"
+      },
+      "source": [
+        "The `highlight_keywords` function is designed to highlight specific keywords within a given text. It replaces each keyword in the text with a highlighted version using the `colored` function from the `termcolor` library.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 34,
+      "metadata": {
+        "id": "9y3E0YWExYaV"
+      },
+      "outputs": [],
+      "source": [
+        "def highlight_keywords(text, keywords):\n",
+        "    for keyword in keywords:\n",
+        "        text = text.replace(keyword, colored(keyword, 'green', attrs=['bold']))\n",
+        "    return text"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### Exercice4: add your keywords"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 35,
+      "metadata": {
+        "id": "i7SkWPpnxYaW",
+        "outputId": "28e82563-edba-4b41-acad-ec27e5ba134f"
+      },
+      "outputs": [
+        {
+          "name": "stdout",
+          "output_type": "stream",
+          "text": [
+            "Snippet 1:\n",
+            "Watzlawick, P . 1964. An Anthology of Human Communication. Palo Alto, CA: \n",
+            "Science and Behavior Books.\n",
+            "Weizenbaum, J. 1976. Computer Power and Human Reason: From Judgment to \n",
+            "Calculation. San Francisc\n",
+            "--------------------------------------------------------------------------------\n"
+          ]
+        }
+      ],
+      "source": [
+        "query_keywords = [] # add your keywords\n",
+        "for i, doc in enumerate(retrieved_docs[:1]):\n",
+        "    snippet = doc.page_content[:200]\n",
+        "    highlighted = highlight_keywords(snippet, query_keywords)\n",
+        "    print(f\"Snippet {i+1}:\\n{highlighted}\\n{'-'*80}\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "AhV_Jf_LxYaX"
+      },
+      "source": [
+        "1. `query_keywords` is a list of keywords to be highlighted.\n",
+        "2. The loop iterates over the first document in retrieved_docs.\n",
+        "3. For each document, a snippet of the first 200 characters is extracted.\n",
+        "4. The highlight_keywords function is called to highlight the keywords in the snippet.\n",
+        "5. The highlighted snippet is printed along with a separator line."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pBRKysAvxYaX"
+      },
+      "source": [
+        "<h1 style=\"color: #FF6347;\">Bonus</h1>"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "Qj25lCybxYaX"
+      },
+      "source": [
+        "**Try loading one of your own PDF books and go through the steps again to explore how the pipeline works with your content**:\n"
+      ]
+    }
+  ],
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "display_name": "Python 3",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.11.2"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 0
+}