Skip to content

Commit b5edd78

Browse files
committed
Outline: Shrink llms-txt output to <200_000 input tokens
1 parent 481111b commit b5edd78

5 files changed

Lines changed: 48 additions & 7 deletions

File tree

CHANGES.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
# About CrateDB changelog
22

33
## Unreleased
4+
- Outline: Shrank llms-txt output to <200_000 input tokens
45

56
## v0.0.7 - 2025-07-22
67
- Prompt: Added `instructions-general.md` file when generating bundle

src/cratedb_about/bundle/llmstxt.py

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -44,8 +44,11 @@ def run(self):
4444
# listing all the pages in the documentation.
4545
# - The `llms-full.txt` contains the entire documentation, expanded from the `llms.txt`
4646
# file. Note this may exceed the context window of your LLM.
47-
Path(self.outdir / "llms.txt").write_text(self.outline.to_markdown())
48-
Path(self.outdir / "llms-full.txt").write_text(self.outline.to_llms_txt(optional=True))
47+
llms_txt = Path(self.outdir / "llms.txt")
48+
llms_txt_full = Path(self.outdir / "llms-full.txt")
49+
50+
llms_txt.write_text(self.outline.to_markdown())
51+
llms_txt_full.write_text(self.outline.to_llms_txt(optional=False))
4952

5053
return self
5154

src/cratedb_about/outline/cratedb-outline.yaml

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -293,6 +293,8 @@ data:
293293
They also influence the behaviour when the records are queried.
294294
parents: [ sql-syntax ]
295295
tags: [ sql ]
296+
# FIXME: This needs about 40_000 input tokens. Maybe a stripped-down variant could help?
297+
markdown_enabled: false
296298

297299
# SQL: Functions
298300
- title: "CrateDB SQL reference: Scalar functions"
@@ -388,18 +390,21 @@ data:
388390
The `ctk cluster {start,info,stop}` subcommands provide higher level CLI
389391
entrypoints to start/deploy/resume a database cluster, inquire information
390392
about it, and stop/suspend it again.
393+
markdown_enabled: false
391394
- title: "Cluster API: Python"
392395
link: https://cratedb-toolkit.readthedocs.io/_sources/cluster/python.md.txt
393396
description: |
394397
The `cratedb_toolkit.ManagedCluster` class provides the higher level API/SDK
395398
entrypoints to start/deploy/resume a database cluster, inquire information
396399
about it, and stop/suspend it again.
400+
markdown_enabled: false
397401
- title: "Cluster API: Tutorial"
398402
link: https://cratedb-toolkit.readthedocs.io/_sources/cluster/tutorial.md.txt
399403
description: |
400404
This tutorial outlines end-to-end examples connecting to the CrateDB Cloud
401405
API and the CrateDB database cluster. It includes examples about both the
402406
CrateDB Cluster CLI and the CrateDB Cluster Python API.
407+
markdown_enabled: false
403408

404409
# Drivers and clients
405410
- title: "CrateDB drivers and clients"
@@ -410,6 +415,7 @@ data:
410415
source: docs
411416
type: index
412417
id: drivers
418+
markdown_enabled: false
413419
- title: "CrateDB Python Client"
414420
link: https://cratedb.com/docs/python/en/latest/_sources/index.rst.txt
415421
description: |
@@ -419,6 +425,7 @@ data:
419425
connecting to CrateDB from the Python ecosystem. It is verified to work with CPython, but it has also
420426
been tested successfully with PyPy.
421427
tags: [ driver ]
428+
markdown_enabled: false
422429
- title: "CrateDB SQLAlchemy dialect"
423430
link: https://cratedb.com/docs/sqlalchemy-cratedb/_sources/index.rst.txt
424431
description: |
@@ -429,12 +436,14 @@ data:
429436
CrateDB from the Python ecosystem. It is verified to work with CPython, but it has also been tested
430437
successfully with PyPy.
431438
tags: [ driver ]
439+
markdown_enabled: false
432440
- title: "CrateDB Driver for MicroPython"
433441
link: https://raw.githubusercontent.com/crate/micropython-cratedb/refs/heads/main/README.md
434442
description: |
435443
micropython-cratedb is a CrateDB driver for the MicroPython language.
436444
It connects to CrateDB using the HTTP Endpoint.
437445
tags: [ driver ]
446+
markdown_enabled: false
438447
- title: "Python psycopg3 driver"
439448
link: https://www.psycopg.org/psycopg3/docs/_sources/basic/usage.rst.txt
440449
description: |
@@ -447,6 +456,7 @@ data:
447456
The basic Psycopg usage is common to all the database adapters implementing the DB-API protocol.
448457
Other database adapters, such as the builtin sqlite3 or psycopg2, have roughly the same pattern of interaction.
449458
tags: [ driver ]
459+
markdown_enabled: false
450460
- title: "node-postgres driver"
451461
link: https://raw.githubusercontent.com/brianc/node-postgres/refs/heads/master/docs/pages/index.mdx
452462
description: |
@@ -455,6 +465,7 @@ data:
455465
It has support for callbacks, promises, async/await, connection pooling, prepared statements,
456466
cursors, streaming results, C/C++ bindings, rich type parsing, and more.
457467
tags: [ driver ]
468+
markdown_enabled: false
458469
- title: "PostgreSQL JDBC Driver"
459470
link: https://raw.githubusercontent.com/pgjdbc/pgjdbc/refs/heads/master/docs/content/documentation/_index.md
460471
description: |
@@ -463,6 +474,7 @@ data:
463474
Pure Java (Type 4), and communicates in the PostgreSQL native network protocol. Because of this,
464475
the driver is platform independent; once compiled, the driver can be used on any system.
465476
tags: [ driver ]
477+
markdown_enabled: false
466478
- title: "PostgreSQL driver and toolkit for Go"
467479
link: https://raw.githubusercontent.com/jackc/pgx/refs/heads/master/README.md
468480
description: |
@@ -471,23 +483,27 @@ data:
471483
the wire protocol and type mapping between PostgreSQL and Go. These underlying packages can be used to
472484
implement alternative drivers, proxies, load balancers, logical replication clients, etc.
473485
tags: [ driver ]
486+
markdown_enabled: false
474487
- title: "Npgsql - .NET Access to PostgreSQL"
475488
link: https://raw.githubusercontent.com/npgsql/doc/refs/heads/main/conceptual/Npgsql/index.md
476489
description: |
477490
Npgsql is an open source ADO.NET Data Provider for PostgreSQL, it allows programs written in C#,
478491
Visual Basic, F# to access the PostgreSQL database server. It is implemented in 100% C# code,
479492
is free and is open source.
480493
tags: [ driver ]
494+
markdown_enabled: false
481495
- title: "psqlODBC - PostgreSQL ODBC driver"
482496
link: https://raw.githubusercontent.com/postgresql-interfaces/psqlodbc/refs/heads/main/docs/config.html
483497
description: A library to talk to the PostgreSQL DBMS using ODBC.
484498
tags: [ driver ]
499+
markdown_enabled: false
485500
- title: "PHP PostgreSQL PDO Driver (PDO_PGSQL)"
486501
link: https://raw.githubusercontent.com/php/doc-en/refs/heads/master/reference/pdo_pgsql/reference.xml
487502
description: |
488503
PDO_PGSQL is a driver that implements the PHP Data Objects (PDO) interface
489504
to enable access from PHP to PostgreSQL databases.
490505
tags: [ driver ]
506+
markdown_enabled: false
491507

492508
- name: Examples
493509
items:
@@ -511,23 +527,29 @@ data:
511527
- title: "CrateDB GTFS / GTFS-RT Transit Data Demo"
512528
link: https://raw.githubusercontent.com/crate/devrel-gtfs-transit/refs/heads/main/README.md
513529
description: Capture GTFS and GTFS-RT data for storage and analysis with CrateDB.
530+
markdown_enabled: false
514531
- title: "CrateDB Offshore Wind Farms Demo Application"
515532
link: https://raw.githubusercontent.com/crate/devrel-offshore-wind-farms-demo/refs/heads/main/README.md
516533
description: A CrateDB demo application using data from the UK's offshore wind farms.
534+
markdown_enabled: false
517535
- title: "CrateDB RAG / Hybrid Search PDF Chatbot"
518536
link: https://raw.githubusercontent.com/crate/devrel-pdf-rag-chatbot/refs/heads/main/README.md
519537
description: A chatbot powered by CrateDB using RAG techniques and data from PDF files.
538+
markdown_enabled: false
520539
- title: "CrateDB Geospatial Data Demo"
521540
link: https://raw.githubusercontent.com/crate/devrel-shipping-forecast-geo-demo/refs/heads/main/README.md
522541
description: Spatial data demo application using CrateDB and the Express framework.
542+
markdown_enabled: false
523543
- title: "Plane Spotting with Software Defined Radio, CrateDB and Node.js"
524544
link: https://raw.githubusercontent.com/crate/devrel-plane-spotting-with-cratedb/refs/heads/main/README.md
525545
description: Code for the Plane Spotting with Software Defined Radio, CrateDB and Node.js talk.
546+
markdown_enabled: false
526547
- title: "MongoDB/CrateDB/Grafana CDC Demonstration"
527548
link: https://raw.githubusercontent.com/crate/devrel-mongo-cdc-demo/refs/heads/main/README.md
528549
description: |
529550
A small Python project that demonstrates how a CrateDB database can be populated and kept
530551
in sync with a collection in MongoDB using Change Data Capture (CDC).
552+
markdown_enabled: false
531553

532554
- name: Optional
533555
items:
@@ -541,36 +563,44 @@ data:
541563
type: index
542564
id: cloud
543565
parents: [ cloud ]
566+
markdown_enabled: false
544567
- title: "CrateDB Cloud: Services"
545568
link: https://cratedb.com/docs/cloud/en/latest/_sources/reference/services.md.txt
546569
description: Services specifications and variants of CrateDB Cloud.
547570
parents: [ cloud ]
571+
markdown_enabled: false
548572
- title: "CrateDB Cloud: Billing"
549573
link: https://cratedb.com/docs/cloud/en/latest/_sources/organization/billing.md.txt
550574
description: How billing works in CrateDB Cloud.
551575
parents: [ cloud ]
576+
markdown_enabled: false
552577
- title: "CrateDB Cloud: API"
553578
link: https://cratedb.com/docs/cloud/en/latest/_sources/organization/api.md.txt
554579
description: CrateDB Cloud provides an HTTP API for programmatic cluster and resource management.
555580
parents: [ cloud ]
581+
markdown_enabled: false
556582
- title: "CrateDB Cloud: Import data"
557583
link: https://cratedb.com/docs/cloud/en/latest/_sources/cluster/import.md.txt
558584
description: How to conveniently import data into CrateDB Cloud.
559585
parents: [ cloud ]
586+
markdown_enabled: false
560587
- title: "CrateDB Cloud: Export data"
561588
link: https://cratedb.com/docs/cloud/en/latest/_sources/cluster/export.md.txt
562589
description: How to conveniently export data from CrateDB Cloud.
563590
parents: [ cloud ]
591+
markdown_enabled: false
564592
- title: "CrateDB Cloud: Automatic backups"
565593
link: https://cratedb.com/docs/cloud/en/latest/_sources/cluster/backups.md.txt
566594
description: How automatic backups work in CrateDB Cloud.
567595
parents: [ cloud ]
596+
markdown_enabled: false
568597
- title: "CrateDB Cloud: MongoDB CDC integration"
569598
link: https://cratedb.com/docs/cloud/en/latest/_sources/cluster/integrations/mongo-cdc.md.txt
570599
description: |
571600
CrateDB Cloud enables continuous data ingestion from MongoDB using Change Data Capture (CDC),
572601
providing seamless, real-time synchronization of your data.
573602
parents: [ cloud ]
603+
markdown_enabled: false
574604

575605
# Features
576606
- title: "CrateDB features"
@@ -691,13 +721,15 @@ data:
691721
- How to provide content from Jupyter Notebooks?
692722
- What other content to feed about the timeseries topic?
693723
source: examples
724+
markdown_enabled: false
694725
- title: "Timeseries QA Assistant with CrateDB, LLMs, and Machine Manuals"
695726
link: https://raw.githubusercontent.com/crate/cratedb-examples/refs/heads/main/topic/chatbot/table-augmented-generation/app/README.md
696727
description: |
697728
A full interactive pipeline for simulating telemetry data from industrial motors,
698729
storing that data in CrateDB, and enabling natural-language querying powered by
699730
OpenAI — including RAG-style guidance from machine manuals.
700731
source: examples
732+
markdown_enabled: false
701733

702734
# Generative AI
703735
- title: "LangChain and CrateDB"
@@ -710,7 +742,9 @@ data:
710742
link: https://raw.githubusercontent.com/crate/about/refs/heads/main/src/content/blog/shared-nothing-architecture-multi-model-databases-scalable-real-time-analytics.md
711743
description: Leveraging Shared Nothing Architecture and Multi-Model Databases for Scalable Real-Time Analytics on Large Data.
712744
source: blog
745+
markdown_enabled: false
713746
- title: "Use case: Digital Twins"
714747
link: https://raw.githubusercontent.com/crate/about/refs/heads/main/src/content/blog/digital-twins.md
715748
description: Digital twins are virtual representations of physical objects, processes, or systems in the digital realm. The abundance of data to be processed in digital twin setups is no problem for CrateDB.
716749
source: blog
750+
markdown_enabled: false

src/cratedb_about/outline/model.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ class OutlineItem(DictTools):
2525
title: str
2626
link: str
2727
description: str
28+
markdown_enabled: bool = True
2829

2930
def __attrs_post_init__(self):
3031
# FIXME: Currently, `llms_txt` does not accept newlines in description fields.
@@ -76,6 +77,8 @@ def to_markdown(self) -> str:
7677
for section in self.data.sections:
7778
buffer.write(f"## {section.name}\n\n")
7879
for item in section.items:
80+
if not item.markdown_enabled:
81+
continue
7982
buffer.write(f"- [{item.title}]({item.link}): {item.description}\n")
8083
buffer.write("\n")
8184
return buffer.getvalue().strip()

tests/test_outline.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -178,17 +178,17 @@ def test_outline_section_all_items(cratedb_outline_builtin):
178178

179179

180180
def test_outline_find_items_dict(cratedb_outline_builtin):
181-
items = cratedb_outline_builtin.find_items(title="gtfs").to_dict()
182-
assert "Capture GTFS and GTFS-RT data" in items[0]["description"]
181+
items = cratedb_outline_builtin.find_items(title="toolkit").to_dict()
182+
assert "load curated datasets" in items[0]["description"]
183183

184184

185185
def test_outline_find_items_objects(cratedb_outline_builtin):
186-
items = cratedb_outline_builtin.find_items(title="gtfs")
187-
assert "Capture GTFS and GTFS-RT data" in items[0].description
186+
items = cratedb_outline_builtin.find_items(title="toolkit")
187+
assert "load curated datasets" in items[0].description
188188

189189

190190
def test_outline_find_items_not_found_in_section(cratedb_outline_builtin):
191-
items = cratedb_outline_builtin.find_items(title="gtfs", section_name="Docs")
191+
items = cratedb_outline_builtin.find_items(title="toolkit", section_name="Docs")
192192
assert items == []
193193

194194

0 commit comments

Comments
 (0)