discrawl Spec

This file is the build contract for an AI agent working in this repo.

Goal:

build a local-first Discord guild crawler
mirror all guild data the configured bot can access
store it in SQLite
support fast text search, semantic search, and raw SQL
support one-shot backfill and long-running live sync

This spec is intentionally detailed so an agent can keep shipping without re-asking foundational questions.

Product Summary

discrawl is a Go CLI that mirrors Discord guild data into local SQLite.

V1 scope:

one guild at a time
all accessible text channels
all accessible announcement channels
all accessible forum channels and their posts
all accessible public threads
all accessible private threads
archived thread coverage
full message history
current member snapshot
FTS5 search
optional OpenAI embeddings with local vector search
raw SQL access

Out of scope for V1:

personal-account DMs
reactions as primary indexed entities
attachment blob downloads by default
cross-guild unified sync UX
write-back or moderation actions

Requirements Already Chosen

These are settled unless the user explicitly changes them:

config format: TOML
config location: ~/.discrawl/config.toml
DB location: ~/.discrawl/discrawl.db
cache dir: ~/.discrawl/cache/
log dir: ~/.discrawl/logs/
token source: reuse Molty / existing OpenClaw Discord bot config
guild model: one guild in CLI UX, multi-guild-ready schema
search: hybrid, with FTS first and embeddings optional
embedding provider: OpenAI
API key source: OPENAI_API_KEY from shell env
message retention: current canonical row + append-only event log
member retention: current snapshot only
files: metadata only in DB, fetch binaries later on demand
reactions: not important for V1
polls: flatten into text during normalization

Local Environment Contract

An agent should assume:

repo path: ~/Projects/discrawl
shell: zsh
Go is installed and modern
user is Peter
user keeps many secrets in ~/.profile
an existing OpenClaw install may already contain usable Discord bot config

Key file paths

~/.discrawl/config.toml
~/.discrawl/discrawl.db
~/.profile
~/.openclaw/openclaw.json
~/.openclaw/openclaw.json.bak*

Existing bot config

The current bot token source is expected in:

~/.openclaw/openclaw.json

Expected path inside JSON:

channels.discord.token

Expected guild selection path:

channels.discord.guilds

The current intended default mode is:

discrawl init --from-openclaw ~/.openclaw/openclaw.json

OpenAI embeddings key

Do not store raw API keys in repo files.

Expected source:

env var OPENAI_API_KEY

Typical place to discover it locally:

~/.profile

The code should read the env var at runtime, not copy the value into config by default.

Discord Data Model Notes

Important Discord facts that drive the schema:

channels and threads are closely related; threads should be stored as channels
forum posts are threads under a forum parent
message history is paginated and must be backfilled incrementally
live updates come from Gateway events, not from polling alone
archived public and private threads must be enumerated explicitly
private archived thread access may require elevated bot perms like Manage Threads

Entities to mirror

guild
categories
channels
threads
members
messages
message lifecycle events

Channel kinds worth preserving

category
text
announcement
forum
thread public
thread private
thread announcement

Voice channels can be mirrored as metadata rows, but there is no need to crawl message history because there is none.

Database Design

Use SQLite.

Requirements:

WAL mode
foreign keys on
FTS5 enabled
vector extension optional

Tables

At minimum:

guilds
channels
members
messages
message_events
sync_state
embedding_jobs
message_fts

Optional once vectors are wired:

message_embeddings

`guilds`

Suggested shape:

create table guilds (
  id text primary key,
  name text not null,
  icon text,
  raw_json text not null,
  updated_at text not null
);

`channels`

Threads should live in the same table.

Suggested shape:

create table channels (
  id text primary key,
  guild_id text not null,
  parent_id text,
  kind text not null,
  name text not null,
  topic text,
  position integer,
  is_nsfw integer not null default 0,
  is_archived integer not null default 0,
  is_locked integer not null default 0,
  is_private_thread integer not null default 0,
  thread_parent_id text,
  archive_timestamp text,
  raw_json text not null,
  updated_at text not null
);

`members`

Suggested shape:

create table members (
  guild_id text not null,
  user_id text not null,
  username text not null,
  global_name text,
  display_name text,
  nick text,
  discriminator text,
  avatar text,
  bot integer not null default 0,
  joined_at text,
  role_ids_json text not null,
  raw_json text not null,
  updated_at text not null,
  primary key (guild_id, user_id)
);

`messages`

Suggested shape:

create table messages (
  id text primary key,
  guild_id text not null,
  channel_id text not null,
  author_id text,
  message_type integer not null,
  created_at text not null,
  edited_at text,
  deleted_at text,
  content text not null,
  normalized_content text not null,
  reply_to_message_id text,
  pinned integer not null default 0,
  has_attachments integer not null default 0,
  raw_json text not null,
  updated_at text not null
);

`message_events`

Suggested shape:

create table message_events (
  event_id integer primary key autoincrement,
  guild_id text not null,
  channel_id text not null,
  message_id text not null,
  event_type text not null,
  event_at text not null,
  payload_json text not null
);

`sync_state`

Suggested shape:

create table sync_state (
  scope text primary key,
  cursor text,
  updated_at text not null
);

Examples of scope:

guild:<guild_id>:members
channel:<channel_id>:messages
tail:<guild_id>

`embedding_jobs`

Suggested shape:

create table embedding_jobs (
  message_id text primary key,
  state text not null,
  attempts integer not null default 0,
  updated_at text not null
);

FTS

Recommended pattern:

content table = messages
FTS virtual table = message_fts
keep it updated explicitly, not by fragile magic

Suggested columns:

message_id
guild_id
channel_id
author_id
author_name
channel_name
content

Search Design

Modes

Support three modes:

fts
semantic
hybrid

Default:

hybrid when embeddings are enabled
fts otherwise

FTS behavior

FTS is mandatory.

It should be good enough that the tool is useful before embeddings exist.

Expected use cases:

exact terms
commands
stack traces
URLs
model names
channel names
user names

Semantic behavior

Embeddings are optional but planned from day one.

Recommended provider:

OpenAI text-embedding-3-small

Implementation guidance:

batch embedding jobs
keep embedding generation out of the hot sync path
store vectors locally
semantic search should degrade gracefully when vectors are absent

Vector store choice

Prefer SQLite-local vector search so the whole product stays portable.

Recommended direction:

sqlite-vec

This can be wired after the base crawler and FTS system work.

CLI Spec

Design goals:

simple for humans
composable for scripts
obvious nouns and verbs
no secrets in flags

Usage:

discrawl [global flags] <command> [args]

Global flags

-h, --help
--version
--config <path>
--json
--plain
-q, --quiet
-v, --verbose
--no-color

Commands

init
sync
tail
search
sql
members
channels
status
doctor

`init`

Purpose:

create ~/.discrawl/config.toml
import defaults from OpenClaw
persist guild id and DB path

Expected flags:

--from-openclaw <path>
--guild <id>
--db <path>
--with-embeddings

`sync`

Purpose:

one-shot crawl

Expected flags:

--full
--since <timestamp>
--concurrency <n>
--with-embeddings

Requirements:

idempotent
restart-safe
shows progress on stderr

`tail`

Purpose:

live sync from Gateway

Expected flags:

--repair-every <duration>
--with-embeddings

Requirements:

reconnect automatically
write checkpoints
periodic repair sync

`search`

Purpose:

query mirrored messages

Expected flags:

--mode fts|semantic|hybrid
--channel <name-or-id>
--author <name-or-id>
--limit <n>
--json
--plain

`sql`

Purpose:

run read-only SQL

Requirements:

support query arg or stdin
block non-read-only statements by default

`members`

Subcommands:

list
show <user-id>
search <query>

`channels`

Subcommands:

list
show <channel-id>

`status`

Must show:

guild id
guild name if known
db path
total channels
total threads
total messages
total members
last sync time
last tail event time
embedding backlog

`doctor`

Must check:

config file readable
OpenClaw token source readable
Discord auth valid
guild reachable
DB openable
FTS present
vector extension present if configured

Config Spec

Format:

TOML

Location:

~/.discrawl/config.toml

Suggested shape:

version = 1
guild_id = "1456350064065904867"
db_path = "~/.discrawl/discrawl.db"
cache_dir = "~/.discrawl/cache"
log_dir = "~/.discrawl/logs"

[discord]
token_source = "openclaw"
openclaw_config = "~/.openclaw/openclaw.json"
channel_account = "discord"

[sync]
concurrency = 4
repair_every = "6h"
full_history = true

[search]
default_mode = "hybrid"

[search.embeddings]
enabled = true
provider = "openai"
model = "text-embedding-3-small"
api_key_env = "OPENAI_API_KEY"
batch_size = 64

Config precedence:

flags
environment
config file

Environment variables:

DISCRAWL_CONFIG
OPENAI_API_KEY

Token Handling Rules

Do not:

put bot tokens in git
put API keys in git
print secrets in normal logs

Do:

load bot token from OpenClaw config path
load OpenAI key from env
redact secrets in debug and doctor output

Discord Sync Algorithm

Initial full sync

load config
resolve token
fetch bot identity
fetch guild metadata
fetch guild channels
fetch active threads
enumerate archived public threads per parent channel
enumerate archived private threads per parent channel
fetch member snapshot
backfill messages for every crawlable channel and thread
normalize message content
upsert messages
append message_events where relevant
update FTS rows
enqueue embedding jobs
write checkpoints

Message crawl strategy

Use REST pagination with before.

Rules:

fetch newest page first for incremental runs
fetch oldest via repeated before paging for full runs
stop when no messages remain
handle rate limits centrally

Live tail strategy

Use Gateway events for:

new messages
edited messages
deleted messages
channel updates
thread updates
member updates

Tail should:

upsert live state
append lifecycle events
keep retrying on disconnect
periodically run repair sync

Message Normalization

normalized_content should flatten Discord payloads into searchable text.

Include:

message content
embed titles and descriptions where helpful
poll question and answers
attachment filenames
referenced message hints if available

Do not overcomplicate:

reactions can be ignored
attachment binary contents are not indexed in V1

Member Query Design

Members matter for AI workflows.

Expected use cases:

“who is this user”
“find messages by this person”
“find maintainers”
“find everyone with a display name containing X”

At minimum, store:

user id
username
display name
nick
roles
bot flag

Recommended Go Package Layout

cmd/discrawl/
internal/cli/
internal/config/
internal/discord/
internal/store/
internal/search/
internal/syncer/
internal/embed/

Responsibilities:

internal/cli: command wiring, output modes
internal/config: parse and validate config
internal/discord: REST + Gateway client wrappers
internal/store: SQLite schema, migrations, queries
internal/search: FTS and result ranking
internal/syncer: full sync and repair orchestration
internal/embed: embedding queue and provider integration

Recommended Dependencies

Reasonable picks:

Discord client: github.com/bwmarrin/discordgo
TOML parser: something small and maintained
SQLite driver: pick one path and stay consistent
vector search: sqlite-vec

Guidance:

keep dependency count low
prefer boring stable libraries
avoid frameworks

Milestones

Milestone 1

config loader
init
status
DB open + migrations

Milestone 2

guild metadata sync
channel sync
member sync

Milestone 3

full message backfill
incremental checkpoints
FTS indexing

Milestone 4

search
sql
members
channels

Milestone 5

tail
reconnect logic
repair loop

Milestone 6

embedding queue
vector search
hybrid ranking

What The Repo Must Eventually Contain

For an AI agent to finish the product without external memory, this repo should contain:

this spec
README with user-facing overview
schema and migration files
config sample
CLI contract
implementation package layout
token discovery rules
API key discovery rules
milestone order

This file is the authoritative engineering spec for now.

FilesExpand file tree

SPEC.md

Latest commit

History

SPEC.md

File metadata and controls

discrawl Spec

Product Summary

Requirements Already Chosen

Local Environment Contract

Key file paths

Existing bot config

OpenAI embeddings key

Discord Data Model Notes

Entities to mirror

Channel kinds worth preserving

Database Design

Tables

guilds

channels

members

messages

message_events

sync_state

embedding_jobs

FTS

Search Design

Modes

FTS behavior

Semantic behavior

Vector store choice

CLI Spec

Global flags

Commands

init

sync

tail

search

sql

members

channels

status

doctor

Config Spec

Token Handling Rules

Discord Sync Algorithm

Initial full sync

Message crawl strategy

Live tail strategy

Message Normalization

Member Query Design

Recommended Go Package Layout

Recommended Dependencies

Milestones

Milestone 1

Milestone 2

Milestone 3

Milestone 4

Milestone 5

Milestone 6

What The Repo Must Eventually Contain

`guilds`

`channels`

`members`

`messages`

`message_events`

`sync_state`

`embedding_jobs`

`init`

`sync`

`tail`

`search`

`sql`

`members`

`channels`

`status`

`doctor`