# 🐛 Bug Fix: Sitemap Crawling Not Working for TypeScript and Other Major Docs Sites #885
Nachx639
started this conversation in
Ideas / Feature Requests
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
🐛 Bug Fix: Sitemap Crawling Not Working for TypeScript and Other Major Docs Sites
Problem
I discovered that Archon's sitemap crawler has two critical bugs that prevent indexing of major documentation sites like TypeScript, Next.js, and others that use multi-level sitemap structures:
Example
When crawling
https://www.typescriptlang.org/docs/:Solution
I've identified and fixed both issues. Here are the code changes:
Fix 1: Enable Sitemap URL Crawling When Discovered
File:
python/src/server/services/crawling/crawling_service.pyChange: Comment out the "single-file mode" logic that was preventing sitemap URL extraction.
Why this fixes it: Previously, when discovery found a sitemap, the code would return early with just the sitemap XML. Now it continues to
parse_sitemap()to extract actual page URLs.Fix 2: Add Recursive Sitemap Index Parsing
File:
python/src/server/services/crawling/strategies/sitemap.pyChange: Replace the entire
parse_sitemapmethod with recursive parsing logic.Key changes:
.xmlor contains 'sitemap')Testing
Tested successfully with:
https://www.typescriptlang.org/docs/(sitemap-index → sitemap-0.xml → ~300 pages)How to verify the fix:
Impact
This fix enables Archon to properly index any documentation site using multi-level sitemap structures, including:
Before this fix, these sites would fail to index properly, storing only the sitemap XML instead of actual documentation.
Questions
.xmland 'sitemap')?Let me know if you'd like me to create a PR with these changes!
Beta Was this translation helpful? Give feedback.
All reactions