Skip to content

fix: ignore whitespace-only Disallow paths in extractUrlsFromRobotsTxt#1973

Open
juliosuas wants to merge 1 commit intosmicallef:masterfrom
juliosuas:fix/robots-txt-whitespace-disallow
Open

fix: ignore whitespace-only Disallow paths in extractUrlsFromRobotsTxt#1973
juliosuas wants to merge 1 commit intosmicallef:masterfrom
juliosuas:fix/robots-txt-whitespace-disallow

Conversation

@juliosuas
Copy link
Copy Markdown

Summary

Fixes #701. Resolves the TODO in the docstring.

Problem

The regex r'disallow:\s*(.[^ #]*)' used . as the first character of the capture group, which matches any character including a space. This caused Disallow: (whitespace-only path) to be returned as ' ', adding an invalid disallowed path to the crawl-exclusion list.

# Before
extractUrlsFromRobotsTxt('Disallow: ')   # returns [' '] ← wrong
extractUrlsFromRobotsTxt('Disallow:  ')  # returns [' '] ← wrong

Per the robots.txt spec, Disallow: with no path (or only whitespace) means allow all and should produce no exclusion entries.

Fix

Replace the leading . with \S so only paths that begin with a non-whitespace character are captured:

# After
m = re.match(r'disallow:\s*(\S[^ #]*)', line, re.IGNORECASE)
extractUrlsFromRobotsTxt('Disallow: ')          # returns []  ✓
extractUrlsFromRobotsTxt('Disallow: /admin')    # returns ['/admin']  ✓
extractUrlsFromRobotsTxt('Disallow: /p#comment') # returns ['/p']  ✓

The regex r'disallow:\s*(.[^ #]*)' used '.' as the first character of the
capture group, which matches any character including a space.  This caused
'Disallow: ' (a whitespace-only path) to be returned as ' ', adding an
invalid disallowed path to the list.

Per the robots.txt specification, 'Disallow: ' with no non-whitespace
content means 'allow all' and should be treated as an empty/no-op rule.

Fix: replace the leading '.' with '\S' so only paths that start with a
non-whitespace character are captured.  This resolves the TODO comment
that had been in the docstring since the original implementation.

Fixes smicallef#701
@juliosuas
Copy link
Copy Markdown
Author

A bit more detail: per the robots.txt spec, Disallow: with no path (or only whitespace) means "allow all" — it should not add any entry to the disallowed list. The old regex r'disallow:\s*(.[^ #]*)' used . which matches any character including space, so Disallow: (space-only) was returned as ' ' — an invalid path. The \S fix ensures only paths starting with a real non-whitespace character are captured. Also removed the TODO comment from the docstring since it's now addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TODO: sflib.py: fix whitespace parsing; ie, " " is not a valid disallowed path

1 participant