-
Notifications
You must be signed in to change notification settings - Fork 51
Description
Leading and trailing whitespace characters are removed from the link value during the removal of space characters, making extracting/following the link fail.
my $mech = WWW::Mechanize->new();
$mech->update_html(qq'<a href="\x0b">link</a>');
say length $mech->links->[0]->URI->as_string; # 0
$mech->update_html(qq'<a href="\x{3000}">link</a>');
say length $mech->links->[0]->URI->as_string; # 0
According to HTML5 spec, space characters are /[\x09\x0a\x0c\x0d\x20]/:
https://www.w3.org/TR/html52/infrastructure.html#infrastructure-urls
A string is a valid URL potentially surrounded by spaces if, after stripping leading and trailing white space from it, it is a valid URL.
A string is a valid non-empty URL potentially surrounded by spaces if, after stripping leading and trailing white space from it, it is a valid non-empty URL.Re: stripping leading and trailing white space
https://www.w3.org/TR/html52/infrastructure.html#strip-leading-and-trailing-white-space
When a user agent is to strip leading and trailing white space from a string, the user agent must remove all space characters that are at the start or end of the string.Re: space characters
https://www.w3.org/TR/html52/infrastructure.html#space-characters
The space characters, for the purposes of this specification, are U+0020 SPACE, U+0009 CHARACTER TABULATION (tab), U+000A LINE FEED (LF), U+000C FORM FEED (FF), and U+000D CARRIAGE RETURN (CR).
URI->new()
is causing this, as its document says: it removes white space characters (\s,) which depends on a version of Unicode spec each version of Perl confirms.