Skip to content

Conversation

@tsbhangu
Copy link
Contributor

Summary

  • Add database schema and migration for websites table
  • Add API endpoints for website indexing
  • Integrate with IndexSourceDb for source tracking
  • Add Turbopuffer sync for vector search

Details

This PR builds on the website crawler infrastructure (#4656) to add the API and database layer:

Database:

  • New websites table migration
  • WebsiteDb model with full metadata support
  • Integration with IndexSourceDb for job tracking

API Endpoints:

  • POST /sources/website/{domain}/index - Start website crawling
  • GET /sources/website/{domain}/status - Check crawl job status
  • GET /sources/website/{domain}/{website_id} - Get specific page
  • GET /sources/website/{domain} - List all indexed pages
  • POST /sources/website/{domain}/reindex - Re-crawl website
  • DELETE /sources/website/{domain}/delete - Delete specific website
  • DELETE /sources/website/{domain}/delete-all - Delete all websites

Features:

  • Background job processing for crawling
  • Real-time status tracking
  • Automatic Turbopuffer sync for search
  • Proper error handling and rollback

Dependencies

Test plan

  • Database migrations verified
  • Background jobs tested with real crawling

@tsbhangu tsbhangu requested a review from eyw520 as a code owner October 31, 2025 23:16
@vercel
Copy link
Contributor

vercel bot commented Oct 31, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Updated (UTC)
dev.ferndocs.com Error Error Nov 5, 2025 4:47am
fern-dashboard Ready Ready Preview Nov 5, 2025 4:47am
fern-dashboard-dev Ready Ready Preview Nov 5, 2025 4:47am
ferndocs.com Ready Ready Preview Nov 5, 2025 4:47am
preview.ferndocs.com Error Error Nov 5, 2025 4:47am
prod-assets.ferndocs.com Error Error Nov 5, 2025 4:47am
prod.ferndocs.com Error Error Nov 5, 2025 4:47am
1 Skipped Deployment
Project Deployment Preview Updated (UTC)
fern-platform Ignored Ignored Nov 5, 2025 4:47am

@tsbhangu tsbhangu force-pushed the tanvir/website-database-api-routes branch from 01a17ba to 9925581 Compare October 31, 2025 23:20
from fai.utils.website.crawler import DocumentationCrawler
from fai.utils.website.extractor import ContentExtractor
from fai.utils.website.models import DocumentChunk
from fai.utils.website.jobs import crawl_website_job
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use direct imports throughout the codebase, let's remove the init.py file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it!

port=8080,
server_header=False,
reload=VARIABLES.IS_LOCAL,
reload=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather keep the local reload behavior - thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup sorry had sneaked in from testing



class ReindexWebsiteRequest(BaseModel):
base_url: str = Field(description="The base URL to re-crawl (will delete old pages and re-index)")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticing this reuses the previous config. What if we wanted to update settings like the chunk overlap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair. We can add them all as optional arguments and override the initial config with the values someone specifies

tsbhangu and others added 9 commits November 4, 2025 21:38
- Add database migration for websites table
- Add WebsiteDb model and API types
- Add API endpoints for website indexing (/sources/website/{domain}/...)
- Add integration with IndexSourceDb for job tracking
- Add Turbopuffer sync functionality for vector search
- Update OpenAPI spec with new endpoints

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
- Remove obvious comments that don't add value
- Keep comments that explain non-obvious logic (e.g., status determination, fresh DB sessions)
- Extract crawl_website_job to utils/website/jobs.py for better separation
- Add WebsiteCrawlConfig domain model with default values
- Implement selective sync functions for websites (sync_websites_to_tpuf, sync_websites_to_query_index)
- Track website IDs during crawl for incremental syncing
- Update delete operations to use selective deletion
- Add comprehensive test suite (12 route tests + 10 sync tests)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
…ation

- Fixed af951c45da91 to reference 1a06a4d351f9 instead of missing 2d743e49aaa1
- Created merge migration to combine two branches from initial schema
- Regenerated websites table migration with proper revision chain
- Migration chain: 1a06a4d351f9 -> [af951c45da91, 62afaf912daa] -> 7440621afbb0 -> 8e63cf285ea3

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants