This file is the build contract for an AI agent working in this repo.
Goal:
- build a local-first Discord guild crawler
- mirror all guild data the configured bot can access
- store it in SQLite
- support fast text search, semantic search, and raw SQL
- support one-shot backfill and long-running live sync
This spec is intentionally detailed so an agent can keep shipping without re-asking foundational questions.
discrawl is a Go CLI that mirrors Discord guild data into local SQLite.
V1 scope:
- one guild at a time
- all accessible text channels
- all accessible announcement channels
- all accessible forum channels and their posts
- all accessible public threads
- all accessible private threads
- archived thread coverage
- full message history
- current member snapshot
- FTS5 search
- optional OpenAI embeddings with local vector search
- raw SQL access
Out of scope for V1:
- personal-account DMs
- reactions as primary indexed entities
- attachment blob downloads by default
- cross-guild unified sync UX
- write-back or moderation actions
These are settled unless the user explicitly changes them:
- config format:
TOML - config location:
~/.discrawl/config.toml - DB location:
~/.discrawl/discrawl.db - cache dir:
~/.discrawl/cache/ - log dir:
~/.discrawl/logs/ - token source: reuse Molty / existing OpenClaw Discord bot config
- guild model: one guild in CLI UX, multi-guild-ready schema
- search: hybrid, with FTS first and embeddings optional
- embedding provider: OpenAI
- API key source:
OPENAI_API_KEYfrom shell env - message retention: current canonical row + append-only event log
- member retention: current snapshot only
- files: metadata only in DB, fetch binaries later on demand
- reactions: not important for V1
- polls: flatten into text during normalization
An agent should assume:
- repo path:
~/Projects/discrawl - shell:
zsh - Go is installed and modern
- user is Peter
- user keeps many secrets in
~/.profile - an existing OpenClaw install may already contain usable Discord bot config
~/.discrawl/config.toml~/.discrawl/discrawl.db~/.profile~/.openclaw/openclaw.json~/.openclaw/openclaw.json.bak*
The current bot token source is expected in:
~/.openclaw/openclaw.json
Expected path inside JSON:
channels.discord.token
Expected guild selection path:
channels.discord.guilds
The current intended default mode is:
discrawl init --from-openclaw ~/.openclaw/openclaw.json
Do not store raw API keys in repo files.
Expected source:
- env var
OPENAI_API_KEY
Typical place to discover it locally:
~/.profile
The code should read the env var at runtime, not copy the value into config by default.
Important Discord facts that drive the schema:
- channels and threads are closely related; threads should be stored as channels
- forum posts are threads under a forum parent
- message history is paginated and must be backfilled incrementally
- live updates come from Gateway events, not from polling alone
- archived public and private threads must be enumerated explicitly
- private archived thread access may require elevated bot perms like
Manage Threads
- guild
- categories
- channels
- threads
- members
- messages
- message lifecycle events
- category
- text
- announcement
- forum
- thread public
- thread private
- thread announcement
Voice channels can be mirrored as metadata rows, but there is no need to crawl message history because there is none.
Use SQLite.
Requirements:
- WAL mode
- foreign keys on
- FTS5 enabled
- vector extension optional
At minimum:
guildschannelsmembersmessagesmessage_eventssync_stateembedding_jobsmessage_fts
Optional once vectors are wired:
message_embeddings
Suggested shape:
create table guilds (
id text primary key,
name text not null,
icon text,
raw_json text not null,
updated_at text not null
);Threads should live in the same table.
Suggested shape:
create table channels (
id text primary key,
guild_id text not null,
parent_id text,
kind text not null,
name text not null,
topic text,
position integer,
is_nsfw integer not null default 0,
is_archived integer not null default 0,
is_locked integer not null default 0,
is_private_thread integer not null default 0,
thread_parent_id text,
archive_timestamp text,
raw_json text not null,
updated_at text not null
);Suggested shape:
create table members (
guild_id text not null,
user_id text not null,
username text not null,
global_name text,
display_name text,
nick text,
discriminator text,
avatar text,
bot integer not null default 0,
joined_at text,
role_ids_json text not null,
raw_json text not null,
updated_at text not null,
primary key (guild_id, user_id)
);Suggested shape:
create table messages (
id text primary key,
guild_id text not null,
channel_id text not null,
author_id text,
message_type integer not null,
created_at text not null,
edited_at text,
deleted_at text,
content text not null,
normalized_content text not null,
reply_to_message_id text,
pinned integer not null default 0,
has_attachments integer not null default 0,
raw_json text not null,
updated_at text not null
);Suggested shape:
create table message_events (
event_id integer primary key autoincrement,
guild_id text not null,
channel_id text not null,
message_id text not null,
event_type text not null,
event_at text not null,
payload_json text not null
);Suggested shape:
create table sync_state (
scope text primary key,
cursor text,
updated_at text not null
);Examples of scope:
guild:<guild_id>:memberschannel:<channel_id>:messagestail:<guild_id>
Suggested shape:
create table embedding_jobs (
message_id text primary key,
state text not null,
attempts integer not null default 0,
updated_at text not null
);Recommended pattern:
- content table =
messages - FTS virtual table =
message_fts - keep it updated explicitly, not by fragile magic
Suggested columns:
message_idguild_idchannel_idauthor_idauthor_namechannel_namecontent
Support three modes:
ftssemantichybrid
Default:
hybridwhen embeddings are enabledftsotherwise
FTS is mandatory.
It should be good enough that the tool is useful before embeddings exist.
Expected use cases:
- exact terms
- commands
- stack traces
- URLs
- model names
- channel names
- user names
Embeddings are optional but planned from day one.
Recommended provider:
- OpenAI
text-embedding-3-small
Implementation guidance:
- batch embedding jobs
- keep embedding generation out of the hot sync path
- store vectors locally
- semantic search should degrade gracefully when vectors are absent
Prefer SQLite-local vector search so the whole product stays portable.
Recommended direction:
sqlite-vec
This can be wired after the base crawler and FTS system work.
Design goals:
- simple for humans
- composable for scripts
- obvious nouns and verbs
- no secrets in flags
Usage:
discrawl [global flags] <command> [args]
-h, --help--version--config <path>--json--plain-q, --quiet-v, --verbose--no-color
initsynctailsearchsqlmemberschannelsstatusdoctor
Purpose:
- create
~/.discrawl/config.toml - import defaults from OpenClaw
- persist guild id and DB path
Expected flags:
--from-openclaw <path>--guild <id>--db <path>--with-embeddings
Purpose:
- one-shot crawl
Expected flags:
--full--since <timestamp>--concurrency <n>--with-embeddings
Requirements:
- idempotent
- restart-safe
- shows progress on stderr
Purpose:
- live sync from Gateway
Expected flags:
--repair-every <duration>--with-embeddings
Requirements:
- reconnect automatically
- write checkpoints
- periodic repair sync
Purpose:
- query mirrored messages
Expected flags:
--mode fts|semantic|hybrid--channel <name-or-id>--author <name-or-id>--limit <n>--json--plain
Purpose:
- run read-only SQL
Requirements:
- support query arg or stdin
- block non-read-only statements by default
Subcommands:
listshow <user-id>search <query>
Subcommands:
listshow <channel-id>
Must show:
- guild id
- guild name if known
- db path
- total channels
- total threads
- total messages
- total members
- last sync time
- last tail event time
- embedding backlog
Must check:
- config file readable
- OpenClaw token source readable
- Discord auth valid
- guild reachable
- DB openable
- FTS present
- vector extension present if configured
Format:
- TOML
Location:
~/.discrawl/config.toml
Suggested shape:
version = 1
guild_id = "1456350064065904867"
db_path = "~/.discrawl/discrawl.db"
cache_dir = "~/.discrawl/cache"
log_dir = "~/.discrawl/logs"
[discord]
token_source = "openclaw"
openclaw_config = "~/.openclaw/openclaw.json"
channel_account = "discord"
[sync]
concurrency = 4
repair_every = "6h"
full_history = true
[search]
default_mode = "hybrid"
[search.embeddings]
enabled = true
provider = "openai"
model = "text-embedding-3-small"
api_key_env = "OPENAI_API_KEY"
batch_size = 64Config precedence:
- flags
- environment
- config file
Environment variables:
DISCRAWL_CONFIGOPENAI_API_KEY
Do not:
- put bot tokens in git
- put API keys in git
- print secrets in normal logs
Do:
- load bot token from OpenClaw config path
- load OpenAI key from env
- redact secrets in debug and doctor output
- load config
- resolve token
- fetch bot identity
- fetch guild metadata
- fetch guild channels
- fetch active threads
- enumerate archived public threads per parent channel
- enumerate archived private threads per parent channel
- fetch member snapshot
- backfill messages for every crawlable channel and thread
- normalize message content
- upsert
messages - append
message_eventswhere relevant - update FTS rows
- enqueue embedding jobs
- write checkpoints
Use REST pagination with before.
Rules:
- fetch newest page first for incremental runs
- fetch oldest via repeated
beforepaging for full runs - stop when no messages remain
- handle rate limits centrally
Use Gateway events for:
- new messages
- edited messages
- deleted messages
- channel updates
- thread updates
- member updates
Tail should:
- upsert live state
- append lifecycle events
- keep retrying on disconnect
- periodically run repair sync
normalized_content should flatten Discord payloads into searchable text.
Include:
- message content
- embed titles and descriptions where helpful
- poll question and answers
- attachment filenames
- referenced message hints if available
Do not overcomplicate:
- reactions can be ignored
- attachment binary contents are not indexed in V1
Members matter for AI workflows.
Expected use cases:
- “who is this user”
- “find messages by this person”
- “find maintainers”
- “find everyone with a display name containing X”
At minimum, store:
- user id
- username
- display name
- nick
- roles
- bot flag
cmd/discrawl/
internal/cli/
internal/config/
internal/discord/
internal/store/
internal/search/
internal/syncer/
internal/embed/
Responsibilities:
internal/cli: command wiring, output modesinternal/config: parse and validate configinternal/discord: REST + Gateway client wrappersinternal/store: SQLite schema, migrations, queriesinternal/search: FTS and result rankinginternal/syncer: full sync and repair orchestrationinternal/embed: embedding queue and provider integration
Reasonable picks:
- Discord client:
github.com/bwmarrin/discordgo - TOML parser: something small and maintained
- SQLite driver: pick one path and stay consistent
- vector search:
sqlite-vec
Guidance:
- keep dependency count low
- prefer boring stable libraries
- avoid frameworks
- config loader
initstatus- DB open + migrations
- guild metadata sync
- channel sync
- member sync
- full message backfill
- incremental checkpoints
- FTS indexing
searchsqlmemberschannels
tail- reconnect logic
- repair loop
- embedding queue
- vector search
- hybrid ranking
For an AI agent to finish the product without external memory, this repo should contain:
- this spec
- README with user-facing overview
- schema and migration files
- config sample
- CLI contract
- implementation package layout
- token discovery rules
- API key discovery rules
- milestone order
This file is the authoritative engineering spec for now.