A web application that provides an AI-powered chatbot interface for dataset discovery, using Google Gemini API on the backend and a React-based frontend.
- Prerequisites
- Setup
- Database Setup
- Running the Application
- Data Processing Pipeline
- Deployment
- API Documentation
- Environment Configuration
- Python: 3.11 or higher
- Node.js: 18.x or higher (for frontend development)
- Google API Key for Gemini
- Google Cloud Platform Account (for BigQuery and Vertex AI)
- UV package manager (for backend environment & dependencies)
- Docker & Docker Compose (optional, for containerized deployment)
git clone https://github.com/INCF/knowledge-space-agent.git
cd knowledge-space-agent- Windows:
pip install uv 
- macOS/Linux: Follow the official guide: https://docs.astral.sh/uv/getting-started/installation/
Create a file named .env in the project root based on .env.template. You can choose between two authentication modes:
Option 1: Google API Key (Recommended for development)
- Set GOOGLE_API_KEYin your.envfile
Option 2: Vertex AI (Recommended for production)
- Configure Google Cloud credentials and Vertex AI settings as shown in .env.template
Note: Do not commit
.envto version control.
# Create a virtual environment using UV
uv venv
# Activate it:
# On Windows (cmd):
 .venv/bin/activate
With the virtual environment activated:
uv synccd frontend
npm install
- 
Install Google Cloud CLI and Authenticate: # Install Google Cloud CLI curl -O https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-cli-linux-x86_64.tar.gz tar -xf google-cloud-cli-linux-x86_64.tar.gz ./google-cloud-sdk/install.sh # Initialize and authenticate gcloud init gcloud auth application-default login 
Configuration details for BigQuery and Vertex AI services are provided in the .env.template file.
In one terminal, from the project root with the virtual environment active:
uv run main.py- By default, this will start the backend server on port 8000. Adjust configuration if you need a different port.
In another terminal:
cd frontend
npm start- This will start the React development server, typically on http://localhost:5000.
Open your browser to:
http://localhost:5000
The frontend will communicate with the backend at port 8000.
- Docker and Docker Compose installed
- .envfile configured with required environment variables
To build and start both the backend and frontend in containers:
docker-compose up --buildFrontend → http://localhost:3000
Backend health → http://localhost:8000/api/health
Backend only:
docker build -t knowledge-space-backend ./backend
docker run -p 8000:8000 --env-file .env knowledge-space-backendFrontend only:
docker build -t knowledge-space-frontend ./frontend
docker run -p 3000:3000 knowledge-space-frontendThis repository provides a set of Python scripts and modules to ingest, clean, and enrich neuroscience metadata from Google Cloud Storage, as well as scrape identifiers and references from linked resources.
- 
Elasticsearch Scraping: The ksdata_scraping.pyscript harvests raw dataset records directly from our Elasticsearch cluster and writes them to GCS. It uses a Point-In-Time (PIT) scroll to page through each index safely, authenticating via credentials stored in your environment.
- 
GCS I/O: Download raw JSON lists from gs://ks_datasets/raw_dataset/...and upload preprocessed outputs togs://ks_datasets/preprocessed_data/....
- 
HTML Cleaning: Strip or convert embedded HTML (e.g. <a>tags) into plain text or Markdown.
- 
URL Extraction: Find and dedupe all links in descriptions and metadata for later retrieval. 
- 
Chunk Construction: Build semantic "chunks" by concatenating fields (title, description, context labels, etc.) for downstream vectorization. 
- 
Metadata Filters: Assemble structured metadata dictionaries ( species,region,keywords,identifier1…n, etc.) for each record.
- 
Per-Datasource Preprocessing: Each data source has its own preprocessing script (e.g. scr_017041_dandi.py,scr_006274_neuroelectro_ephys.py) saved ingcs://ks_datasets/preprocessed_data/.
- 
Extensible Configs: Easily add new datasources by updating GCS paths and field mappings. 
To update the vector store with new datasets from Knowledge Space, run:
python data_processing/full_pipeline.pyThe script performs a complete data processing workflow:
- Scrapes all data - Runs preprocessing scripts to collect data from Knowledge Space datasources
- Generates hashes - Creates unique hash-based datapoint IDs for all chunks
- Matches BigQuery datapoint IDs - Queries existing data to find what's already processed
- Selects new/unique data - Identifies only new chunks that need processing
- Creates embeddings - Generates vector embeddings for new chunks only
- Upserts to vector store - Uploads new embeddings to Vertex AI Matching Engine
- Inserts to BigQuery - Stores new chunk metadata and content
This completes the update process with only new data, avoiding reprocessing existing content.
- VM: Debian/Ubuntu server with Docker & Docker Compose installed
- Firewall: Open ports 80 and 443 (http-server, https-server tags on GCP)
- DNS: Domain pointing to your server's external IP
- SSL: Caddy will auto-provision Let's Encrypt certificates
- 
Clean Previous Deployments: cd ~/knowledge-space-agent || true # Stop current stack sudo docker compose down || true # Clean Docker cache and old images sudo docker system prune -af sudo docker builder prune -af # Optional: Clear HF model cache (will re-download on first use) sudo docker volume rm knowledge-space-agent_hf_cache 2>/dev/null || true # Stop host nginx if installed sudo systemctl stop nginx || true sudo systemctl disable nginx || true 
- 
Create Required Configuration Files: Environment file: Create .envbased on.env.templatewith your specific values.Caddy configuration ( Caddyfile):your-domain.com, www.your-domain.com { reverse_proxy frontend:80 encode gzip header { Strict-Transport-Security "max-age=31536000; includeSubDomains; preload" } }Frontend Nginx: The nginx configuration is already provided in frontend/nginx.conf.
- 
Deploy Stack: cd ~/knowledge-space-agent sudo docker compose up -d --build sudo docker compose ps 
- 
Verify Deployment: # Check services are running sudo docker compose ps # Test local endpoints curl -I http://127.0.0.1/ curl -sS http://127.0.0.1/api/health # Test public HTTPS curl -I https://your-domain.com/ curl -sS https://your-domain.com/api/health 
View logs:
sudo docker compose logs -f backend
sudo docker compose logs -f frontend  
sudo docker compose logs -f caddyUpdate and redeploy:
git pull
sudo docker compose up -d --buildStatus check:
sudo docker compose psBackend unhealthy:
sudo docker inspect -f '{{json .State.Health}}' knowledge-space-agent-backend-1502/504 errors:
sudo docker exec -it knowledge-space-agent-frontend-1 sh -c 'wget -S -O- http://backend:8000/health'DNS issues:
dig +short your-domain.com
curl -s -H "Metadata-Flavor: Google" http://metadata/computeMetadata/v1/instance/network-interfaces/0/access-configs/0/external-ip- Development: http://localhost:8000
- Production: https://your-domain.com
GET /
- Description: Root endpoint, returns service status
- Response:
{ "message": "KnowledgeSpace AI Backend is running", "version": "2.0.0" }
GET /health
- Description: Basic health check for Docker/load balancers
- Response:
{ "status": "healthy", "timestamp": "2024-01-01T12:00:00.000Z", "service": "knowledge-space-agent-backend", "version": "2.0.0" }
GET /api/health
- Description: Detailed health check with component status
- Response:
{ "status": "healthy", "version": "2.0.0", "components": { "vector_search": "enabled|disabled", "llm": "enabled|disabled", "keyword_search": "enabled" }, "timestamp": "2024-01-01T12:00:00.000Z" }
POST /api/chat
- Description: Send a query to the neuroscience assistant
- Request Body:
{ "query": "Find datasets about motor cortex recordings", "session_id": "optional-session-id", "reset": false }
- Response:
{ "response": "I found several datasets related to motor cortex recordings...", "metadata": { "process_time": 2.5, "session_id": "default", "timestamp": "2024-01-01T12:00:00.000Z", "reset": false } }
POST /api/session/reset
- Description: Clear conversation history for a session
- Request Body:
{ "session_id": "session-to-reset" }
- Response:
{ "status": "ok", "session_id": "session-to-reset", "message": "Session cleared" }
504 Gateway Timeout
{
  "detail": "Request timed out. Please try with a simpler query."
}500 Internal Server Error
{
  "response": "Error: [error description]",
  "metadata": {
    "error": true,
    "session_id": "session-id"
  }
}For required environment variables, see .env.template in the project root.
- Environment: Make sure .envis present before starting the backend.
- Ports: If ports 5000 or 8000 are in use, adjust scripts/configuration accordingly.
- UV Commands:
- uv venvcreates the virtual environment.
- uv syncinstalls dependencies as defined in your project’s config.
 
- Troubleshooting:
- Verify Python version (python --version) and that dependencies installed correctly.
- Ensure the .envfile syntax is correct (no extra quotes).
- For frontend issues, check Node.js version (node --version) and logs in terminal.
 
- Verify Python version (