# 🐛 Bug Fix: Sitemap Crawling Not Working for TypeScript and Other Major Docs Sites #885

Nachx639 · 2025-11-26T12:58:36Z

Nachx639
Nov 26, 2025

🐛 Bug Fix: Sitemap Crawling Not Working for TypeScript and Other Major Docs Sites

Problem

I discovered that Archon's sitemap crawler has two critical bugs that prevent indexing of major documentation sites like TypeScript, Next.js, and others that use multi-level sitemap structures:

Discovered sitemaps are stored as single files instead of being crawled
Sitemap indexes (sitemaps that reference other sitemaps) are not parsed recursively

Example

When crawling https://www.typescriptlang.org/docs/:

❌ Before: Only stored the sitemap-index.xml file (1 chunk)
✅ After: Successfully crawled all documentation pages (300+ chunks)

Solution

I've identified and fixed both issues. Here are the code changes:

Fix 1: Enable Sitemap URL Crawling When Discovered

File: python/src/server/services/crawling/crawling_service.py

Change: Comment out the "single-file mode" logic that was preventing sitemap URL extraction.

# Around line 1022-1033

        elif self.url_handler.is_sitemap(url):
            # Handle sitemaps
            crawl_type = "sitemap"
            await update_crawl_progress(
                50,  # 50% of crawling stage
                "Detected sitemap, parsing URLs...",
                crawl_type=crawl_type
            )

            # DISABLED: If this sitemap was selected by discovery, just return the sitemap itself (single-file mode)
            # This was preventing actual crawling of sitemap URLs for sites like TypeScript
            # if request.get("is_discovery_target"):
            #     logger.info(f"Discovery single-file mode: returning sitemap itself without crawling URLs from {url}")
            #     crawl_type = "discovery_sitemap"
            #     # Return the sitemap file as the result
            #     crawl_results = [{
            #         'url': url,
            #         'markdown': f"# Sitemap: {url}\n\nThis is a sitemap file discovered and returned in single-file mode.",
            #         'title': f"Sitemap - {self.url_handler.extract_display_name(url)}",
            #         'crawl_type': crawl_type
            #     }]
            #     return crawl_results, crawl_type

            sitemap_urls = self.parse_sitemap(url)

Why this fixes it: Previously, when discovery found a sitemap, the code would return early with just the sitemap XML. Now it continues to parse_sitemap() to extract actual page URLs.

Fix 2: Add Recursive Sitemap Index Parsing

File: python/src/server/services/crawling/strategies/sitemap.py

Change: Replace the entire parse_sitemap method with recursive parsing logic.

def parse_sitemap(self, sitemap_url: str, cancellation_check: Callable[[], None] | None = None) -> list[str]:
    """
    Parse a sitemap and extract URLs with comprehensive error handling.
    Supports recursive parsing of sitemap indexes.
    
    Args:
        sitemap_url: URL of the sitemap to parse
        cancellation_check: Optional function to check for cancellation
        
    Returns:
        List of URLs extracted from the sitemap (page URLs, not sitemap URLs)
    """
    urls = []

    try:
        # Check for cancellation before making the request
        if cancellation_check:
            try:
                cancellation_check()
            except asyncio.CancelledError:
                logger.info("Sitemap parsing cancelled by user")
                raise  # Re-raise to let the caller handle progress reporting

        logger.info(f"Parsing sitemap: {sitemap_url}")
        resp = requests.get(sitemap_url, timeout=30)

        if resp.status_code != 200:
            logger.error(f"Failed to fetch sitemap: HTTP {resp.status_code}")
            return urls

        try:
            tree = ElementTree.fromstring(resp.content)
            extracted_urls = [loc.text for loc in tree.findall('.//{*}loc') if loc.text]
            
            if not extracted_urls:
                logger.warning(f"No URLs found in sitemap: {sitemap_url}")
                return urls
            
            # Detect if this is a sitemap index (contains references to other sitemaps)
            # Heuristic: If URLs end with .xml or contain 'sitemap', they're likely child sitemaps
            sitemap_urls = []
            page_urls = []
            
            for url in extracted_urls:
                # Check if this looks like a sitemap URL
                if url.endswith('.xml') or 'sitemap' in url.lower():
                    sitemap_urls.append(url)
                else:
                    page_urls.append(url)
            
            # If we found sitemap URLs, this is a sitemap index - parse them recursively
            if sitemap_urls:
                logger.info(f"Detected sitemap index with {len(sitemap_urls)} child sitemaps. Parsing recursively...")
                
                for child_sitemap_url in sitemap_urls:
                    if cancellation_check:
                        try:
                            cancellation_check()
                        except asyncio.CancelledError:
                            logger.info("Sitemap parsing cancelled during recursive parsing")
                            raise
                    
                    # Recursively parse child sitemap
                    logger.info(f"Parsing child sitemap: {child_sitemap_url}")
                    child_urls = self.parse_sitemap(child_sitemap_url, cancellation_check)
                    page_urls.extend(child_urls)
                
                logger.info(f"Sitemap index parsing completed: {len(page_urls)} total URLs from {len(sitemap_urls)} child sitemaps")
            else:
                logger.info(f"Successfully extracted {len(page_urls)} page URLs from sitemap")
            
            urls = page_urls

        except ElementTree.ParseError:
            logger.exception(f"Error parsing sitemap XML from {sitemap_url}")
        except Exception:
            logger.exception(f"Unexpected error parsing sitemap from {sitemap_url}")

    except requests.exceptions.RequestException:
        logger.exception(f"Network error fetching sitemap from {sitemap_url}")
    except Exception:
        logger.exception(f"Unexpected error in sitemap parsing for {sitemap_url}")

    return urls

Key changes:

Sitemap detection: Check if URLs are sitemaps (.xml or contains 'sitemap')
Recursive parsing: When child sitemaps are detected, parse each one recursively
Return only pages: Filter out sitemap URLs, only return actual documentation page URLs
Enhanced logging: Clear messages when sitemap indexes are detected

Testing

Tested successfully with:

✅ TypeScript: https://www.typescriptlang.org/docs/ (sitemap-index → sitemap-0.xml → ~300 pages)
✅ Single-level sitemaps: Still work correctly (backwards compatible)

How to verify the fix:

# Check logs during crawl
docker compose logs archon-server -f | grep sitemap

# You should see:
# "Detected sitemap index with 1 child sitemaps. Parsing recursively..."
# "Parsing child sitemap: https://www.typescriptlang.org/sitemap-0.xml"
# "Successfully extracted XXX page URLs from sitemap"

Impact

This fix enables Archon to properly index any documentation site using multi-level sitemap structures, including:

TypeScript (sitemap-index.xml)
Next.js
React
Many other major framework docs

Before this fix, these sites would fail to index properly, storing only the sitemap XML instead of actual documentation.

Questions

Should we add a max recursion depth limit to prevent infinite loops on malformed sitemaps?
Should sitemap URLs be filtered more strictly (any other patterns besides .xml and 'sitemap')?

Let me know if you'd like me to create a PR with these changes!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

# 🐛 Bug Fix: Sitemap Crawling Not Working for TypeScript and Other Major Docs Sites #885

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

# 🐛 Bug Fix: Sitemap Crawling Not Working for TypeScript and Other Major Docs Sites #885

Uh oh!

Uh oh!

Nachx639 Nov 26, 2025

🐛 Bug Fix: Sitemap Crawling Not Working for TypeScript and Other Major Docs Sites

Problem

Example

Solution

Fix 1: Enable Sitemap URL Crawling When Discovered

Fix 2: Add Recursive Sitemap Index Parsing

Testing

How to verify the fix:

Impact

Questions

Replies: 0 comments

Nachx639
Nov 26, 2025