Skip to content

gengzll/llm_prompt_cold_start

Repository files navigation

llm_prompt_cold_start

Turn a document corpus into a baseline system prompt for a document-grounded LLM / RAG application — even when you start cold, with no labeled Q&A and no hand-written prompt.

You give it documents (PDF / TXT / MD / DOCX). Optionally you add a few sample questions and/or some domain knowledge. It gives you back a ready-to-use system prompt containing the role, domain knowledge, answer/format constraints, and a per-query-type playbook — all grounded in what is actually in your documents, not in what a user guessed.

documents (+ optional questions, + optional domain knowledge)
        │
        ▼
   system prompt

Why

Most teams adopting RAG have lots of documents but no gold Q&A and no good starting prompt. They also often describe their own domain imprecisely. This tool extracts the domain signal directly from the corpus so the baseline prompt does not depend on a user's (possibly inaccurate) description.

The result is a baseline: not guaranteed optimal, but immediately usable and grounded.

How it works

A small, explicit pipeline. Each stage is plain and inspectable.

Stage Module What it does Libraries
1. Parse parsing.py Read pdf/txt/md/docx → text + detected section headings pymupdf (PDF only, lazy)
2. Analyze analysis.py Extract corpus evidence: keyphrases (TF/doc-freq), metric/unit patterns, doc-type, entity signals stdlib only
3. Synthesize synthesis.py Turn evidence (+ optional user input) into a Domain Pack LLM, or deterministic fallback
4. Query types synthesis.py Infer reusable query types from questions or from the corpus LLM, or deterministic fallback
5. Verify synthesis.py Count corpus support per concept; score confidence stdlib only
6. Build prompt_builder.py Assemble the final system prompt stdlib only

Key design choices:

  • Corpus-grounded, not user-dictated. User-provided domain knowledge is treated as an optional signal, layered on top of evidence mined from the documents — it does not drive the content on its own.
  • The LLM synthesizes, it does not invent. The synthesis prompt is constrained to the extracted evidence and forbidden from fabricating values, names, or claims. System and user roles are separated so the stable instruction/schema can be provider-cached.
  • Works offline with zero dependencies. Without an API key (or with --offline), a deterministic fallback maps the mined evidence straight into the prompt. Great for testing, air-gapped use, and reproducibility.

Install

No packaging/build step needed. Just clone and install the dependencies, then run the CLI directly from the folder:

git clone https://github.com/gengzll/llm_prompt_cold_start.git
cd llm_prompt_cold_start
pip install -r requirements.txt

The offline path needs none of these — pure standard library (PDF input still needs pymupdf).

Prefer an installed console command? pip install -e . is still supported and gives you the cold-start-prompt entry point — but it is optional.

Configure

cp .env.example .env
# set OPENAI_API_KEY, optionally OPENAI_BASE_URL (DeepSeek/Together/Ollama/...), COLD_START_MODEL

Use

CLI (run directly, no install)

From the project folder, use either the launcher script or the module form:

# 1) Simplest: just point at documents (offline, no API key)
python cold_start.py ./examples/sample_docs --offline -o prompt.md

# 2) Demo with sample questions + domain knowledge (runnable as-is, offline)
python cold_start.py ./examples/sample_docs \
    --questions examples/questions.txt \
    --domain-knowledge examples/domain_knowledge.txt \
    --offline -o prompt.md --json result.json

# 3) Online synthesis (uses your OPENAI_API_KEY): drop --offline
python cold_start.py ./docs \
    --questions examples/questions.txt \
    --domain-knowledge examples/domain_knowledge.txt \
    -o prompt.md --json result.json

Both --questions and --domain-knowledge take a plain text file with one item per line; both are optional. The equivalent module form is python -m llm_prompt_cold_start.cli ..., and after pip install -e . it is cold-start-prompt ....

Python

Run your script from the project folder (so llm_prompt_cold_start/ is importable), or add the project root to sys.path — see examples/run_example.py.

from llm_prompt_cold_start import generate_system_prompt

prompt = generate_system_prompt(
    ["./docs"],
    questions=["What is the 2030 target?"],     # optional
    domain_knowledge=["Answer only from the documents."],  # optional
)
print(prompt)

For full artifacts (domain pack, query types, corpus profile, confidence):

from llm_prompt_cold_start import ColdStartPipeline, Settings

settings = Settings.load()
settings.offline = True
result = ColdStartPipeline(settings).run(["./docs"])
print(result.confidence, [qt.name for qt in result.query_types])

Try it now

# Python demo (questions + domain knowledge passed inline), offline
python examples/run_example.py

# CLI demo (questions + domain knowledge read from files), offline
python cold_start.py ./examples/sample_docs \
    --questions examples/questions.txt \
    --domain-knowledge examples/domain_knowledge.txt \
    --offline -o prompt.md

Bundled example inputs: examples/questions.txt and examples/domain_knowledge.txt.

Output shape

The generated prompt has these sections:

# ROLE              role + domain specialization
# CONTEXT           what the corpus is / business context
# DOMAIN KNOWLEDGE  compact vocabulary (with aliases), quantities, where answers live
# ANSWER POLICY     grounding + citation + abstention + format constraints
# QUERY-TYPE PLAYBOOK   per query type: what to cover, how to answer, retrieval focus, risks
# TASK              {context} / {question} template to fill at runtime

Status & roadmap

This is a clean first version. Deliberately minimal. Natural next steps:

  • Better keyphrase extraction (yake / KeyBERT) and table-aware PDF parsing.
  • Cross-document entity normalization and a co-occurrence graph (GraphRAG-style).
  • Few-shot example selection from the corpus.
  • Optional self-consistency on the synthesis step and richer confidence calibration.

Tests

python -m pytest tests/ -q

About

The repo for llm rag system prompt cold start.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages