Skip to content

Non-space whitespace characters are removed from anchor URL #266

@ranvis

Description

@ranvis

Leading and trailing whitespace characters are removed from the link value during the removal of space characters, making extracting/following the link fail.

my $mech = WWW::Mechanize->new();
$mech->update_html(qq'<a href="\x0b">link</a>');
say length $mech->links->[0]->URI->as_string; # 0
$mech->update_html(qq'<a href="\x{3000}">link</a>');
say length $mech->links->[0]->URI->as_string; # 0

According to HTML5 spec, space characters are /[\x09\x0a\x0c\x0d\x20]/:

https://www.w3.org/TR/html52/infrastructure.html#infrastructure-urls
A string is a valid URL potentially surrounded by spaces if, after stripping leading and trailing white space from it, it is a valid URL.
A string is a valid non-empty URL potentially surrounded by spaces if, after stripping leading and trailing white space from it, it is a valid non-empty URL.

Re: stripping leading and trailing white space
https://www.w3.org/TR/html52/infrastructure.html#strip-leading-and-trailing-white-space
When a user agent is to strip leading and trailing white space from a string, the user agent must remove all space characters that are at the start or end of the string.

Re: space characters
https://www.w3.org/TR/html52/infrastructure.html#space-characters
The space characters, for the purposes of this specification, are U+0020 SPACE, U+0009 CHARACTER TABULATION (tab), U+000A LINE FEED (LF), U+000C FORM FEED (FF), and U+000D CARRIAGE RETURN (CR).

URI->new() is causing this, as its document says: it removes white space characters (\s,) which depends on a version of Unicode spec each version of Perl confirms.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions