-
Notifications
You must be signed in to change notification settings - Fork 259
Proper normalize attribute value normalization #379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
e45064f
to
401bb77
Compare
Suggestions :
TODO:
|
401bb77
to
9307786
Compare
538e5cd
to
ac7b67b
Compare
Codecov Report
@@ Coverage Diff @@
## master #379 +/- ##
==========================================
+ Coverage 61.37% 61.48% +0.10%
==========================================
Files 20 20
Lines 10157 10229 +72
==========================================
+ Hits 6234 6289 +55
- Misses 3923 3940 +17
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
I plan to continue working on this over the next week or two |
f206a71
to
1a138d6
Compare
@Mingun Questions: Currently we have the functions
|
|
I'm basically going off of the lack of any kind of discussion of attribute value normalization in the HTML living spec, and this discussion on stackoverflow https://html.spec.whatwg.org/multipage/dom.html#attributes
I agree
Can they? It looks like all these fields are private. But, I feel like this is still the best option. There is already The unfortunate thing will be, that the code between the two is almost the same, just enough so that it will be really annoying to duplicate. |
08c0eea
to
00a37a0
Compare
src/events/attributes.rs
Outdated
let codepoint = escapei::parse_number(entity, idx..end)?; | ||
escapei::push_utf8(&mut normalized, codepoint); | ||
} else if let Some(value) = custom_entities.and_then(|hm| hm.get(entity)) { | ||
// TODO: recursively apply entity substitution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Mingun Does the normal unescape()
function need to do this as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think so
src/events/attributes.rs
Outdated
/// This will allocate unless the raw attribute value does not require normalization. | ||
/// | ||
/// See also [`normalized_value_with_custom_entities()`](#method.normalized_value_with_custom_entities) | ||
pub fn normalized_value(&'a self) -> Result<Cow<'a, [u8]>, EscapeError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, it will be valuable to add examples here. Actually, probably you can convert your new tests to doc examples and that will be enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we should add a decoder parameter to that (and similar) functions and decode first before normalization.
Maybe also try another approach: introduce a new type of always UTF-8 encoded attributes and add a Attribute::decode(&self) -> Result<Utf8Attribute>
function.
Instead of completely new type we could try to use const bool
generic parameter (stable since 1.51, current MSRV is 1.41.1, from memchr
)
src/events/attributes.rs
Outdated
let codepoint = escapei::parse_number(entity, idx..end)?; | ||
escapei::push_utf8(&mut normalized, codepoint); | ||
} else if let Some(value) = custom_entities.and_then(|hm| hm.get(entity)) { | ||
// TODO: recursively apply entity substitution |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can they? It looks like all these fields are private.
No, they can't. I confused with the ability to disable with_checks
in the middle of an iteration.
But, I feel like this is still the best option. There is already
attributes()
andhtml_attributes()
, it's only a matter of changing the types.
Ok, then let's do that.
The unfortunate thing will be, that the code between the two is almost the same, just enough so that it will be really annoying to duplicate.
If you talk about make_normalized_value
, you can convert it to a free function and use it in both API methods.
I did a cursory review of the changes and it looks fine. Would you prefer the commits squashed or kept separate? I'll address everything else tomorrow. |
I prefer to kept separate. It is somehow psychologically uncomfortable for review when one commit changes more than ~200 lines in each (or just several) files, even if half of them -- new tests. ¯_(ツ)_/¯ I think, that at least separating normalization method to it's own commit would be a good idea. This is pretty isolated thing which, however, is big enough. That commit needed to be updated:
|
7f55cd8
to
add31b6
Compare
add31b6
to
a72a441
Compare
f317d76
to
69a1934
Compare
69a1934
to
ff42db2
Compare
ff42db2
to
deed851
Compare
deed851
to
def940d
Compare
def940d
to
5817baf
Compare
How should one configure the limit? At some point it becomes unwieldy to keep all of this state external and provide it in each method call (we also have the XML / HTML divergence in attribute handling to consider). Should we consider keeping the state in
Also I've noticed that some implementations detect an entity loop immediately instead of processing until the recursion limit is reached. Should that be two separate errors in your opinion, or one error? |
@Mingun ^ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How should one configure the limit? At some point it becomes unwieldy to keep all of this state external and provide it in each method call (we also have the XML / HTML divergence in attribute handling to consider).
Most naturally would be have a new option in reader::Config
. That mean, we need somehow propagate it to the actual method. Some methods already takes Reader
, so it will simple for them. Maybe we just need to start from only those methods and add shortcuts only when them explicitly will requested.
Actually, I already though about storing Decoder
in the attribute itself (but because currently Attribute
is a struct with public fields it will be a breaking change and I not very like the idea of making that new decoder
field public, because it is implementation detail. Maybe in the end we will store already decoded data)
Also I've noticed that some implementations detect an entity loop immediately instead of processing until the recursion limit is reached.
Yes, of course we should return error as soon as we found loop or if recursion limit was exceeded.
Should that be two separate errors in your opinion, or one error?
Two different. libxml2 also have two different errors, as you could notice from your link: one is "Detected an entity reference loop", other is "Maximum entity nesting depth exceeded"
Obviously I haven't gotten around to finishing this If you feel inclined to do so, I wouldn't mind if you picked it up. If not, I'll get around to it eventually, I just haven't been doing a lot of coding in my free time and what I have done, was on a different project. |
c2d2fbd
to
5c0d5d6
Compare
cb2ef33
to
bb16e17
Compare
…ne Handling" section of XML 1.1 spec https://www.w3.org/TR/xml11/#sec-line-ends
…::decode` and `BytesRef::decode` methods
…ttribute-Value Normalization" section of XML 1.1. spec https://www.w3.org/TR/xml11/#AVNormalize
bb16e17
to
5bbaa90
Compare
/// Decodes using UTF-8 then unescapes the value. | ||
/// Returns the attribute value normalized as per [the XML specification]. | ||
/// | ||
/// Do not use this method with HTML attributes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dralley, you originally wrote this line of documentation. It seems to me, that we don't need two methods: one that performs normalization and one that doesn't. Why normalization methods should not be used for HTML attributes? Could you give a link for proof?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind. You already answered this question at the beginning of issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think, this warning is useless and we can just change behavior of existing methods instead of adding new or more precisely deprecate them in favor of new names.
Actually, I doubt that we need HTML style of attributes, because we anyway cannot correcly parse HTML. We could work with XHTML because this is XML with semantics. But event in XHTML HTML-style attributes are not allowed. So in reality we hardly ever get an Attribute
from HTML source and therefore will not need to get the value without normalization.
I think, in the future we should also remove html
feature. All that it does just includes bunch of HTML entities which increases size of binaries and compile time. I do not want to support actuality of this list, and those consumers who need HTML-entities can use the corresponding entity_resolver
.
Maybe we can keep ability to parse HTML-like attributes for XML-like-but-not-really-XML formats as long as it does not significantly affect parsing speed.
closes #371