-
Notifications
You must be signed in to change notification settings - Fork 22
Tanvir/website database api routes #4657
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: app
Are you sure you want to change the base?
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
1 Skipped Deployment
|
01a17ba to
9925581
Compare
| from fai.utils.website.crawler import DocumentationCrawler | ||
| from fai.utils.website.extractor import ContentExtractor | ||
| from fai.utils.website.models import DocumentChunk | ||
| from fai.utils.website.jobs import crawl_website_job |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use direct imports throughout the codebase, let's remove the init.py file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it!
| port=8080, | ||
| server_header=False, | ||
| reload=VARIABLES.IS_LOCAL, | ||
| reload=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather keep the local reload behavior - thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup sorry had sneaked in from testing
|
|
||
|
|
||
| class ReindexWebsiteRequest(BaseModel): | ||
| base_url: str = Field(description="The base URL to re-crawl (will delete old pages and re-index)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noticing this reuses the previous config. What if we wanted to update settings like the chunk overlap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair. We can add them all as optional arguments and override the initial config with the values someone specifies
- Add database migration for websites table
- Add WebsiteDb model and API types
- Add API endpoints for website indexing (/sources/website/{domain}/...)
- Add integration with IndexSourceDb for job tracking
- Add Turbopuffer sync functionality for vector search
- Update OpenAPI spec with new endpoints
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <[email protected]>
- Remove obvious comments that don't add value - Keep comments that explain non-obvious logic (e.g., status determination, fresh DB sessions)
- Extract crawl_website_job to utils/website/jobs.py for better separation - Add WebsiteCrawlConfig domain model with default values - Implement selective sync functions for websites (sync_websites_to_tpuf, sync_websites_to_query_index) - Track website IDs during crawl for incremental syncing - Update delete operations to use selective deletion - Add comprehensive test suite (12 route tests + 10 sync tests) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
…ation - Fixed af951c45da91 to reference 1a06a4d351f9 instead of missing 2d743e49aaa1 - Created merge migration to combine two branches from initial schema - Regenerated websites table migration with proper revision chain - Migration chain: 1a06a4d351f9 -> [af951c45da91, 62afaf912daa] -> 7440621afbb0 -> 8e63cf285ea3 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
0879e22 to
46c7b9f
Compare
Summary
Details
This PR builds on the website crawler infrastructure (#4656) to add the API and database layer:
Database:
websitestable migrationAPI Endpoints:
POST /sources/website/{domain}/index- Start website crawlingGET /sources/website/{domain}/status- Check crawl job statusGET /sources/website/{domain}/{website_id}- Get specific pageGET /sources/website/{domain}- List all indexed pagesPOST /sources/website/{domain}/reindex- Re-crawl websiteDELETE /sources/website/{domain}/delete- Delete specific websiteDELETE /sources/website/{domain}/delete-all- Delete all websitesFeatures:
Dependencies
Test plan