Proper normalize attribute value normalization #379

dralley · 2022-04-03T01:16:16Z

closes #371

dralley · 2022-04-03T15:52:47Z

Suggestions :

Move this functionality directly to Attribute
Provide a fast path to get the raw value

TODO:

Character reference & entity reference substitution with associated error handling
- Figure out what the API needs to look like

codecov-commenter · 2022-06-20T19:03:15Z

Codecov Report

Merging #379 (538e5cd) into master (e701c4d) will increase coverage by 0.10%.
The diff coverage is 86.95%.

❗ Current head 538e5cd differs from pull request most recent head ac7b67b. Consider uploading reports for the commit ac7b67b to get more accurate results

@@            Coverage Diff             @@
##           master     #379      +/-   ##
==========================================
+ Coverage   61.37%   61.48%   +0.10%     
==========================================
  Files          20       20              
  Lines       10157    10229      +72     
==========================================
+ Hits         6234     6289      +55     
- Misses       3923     3940      +17

Flag	Coverage Δ
unittests	`61.48% <86.95%> (+0.10%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/errors.rs	`9.52% <ø> (-2.85%)`	⬇️
src/escapei.rs	`13.90% <0.00%> (ø)`
src/reader.rs	`88.36% <ø> (+0.94%)`	⬆️
src/events/attributes.rs	`94.12% <88.88%> (+3.45%)`	⬆️
src/lib.rs	`21.09% <0.00%> (-4.92%)`	⬇️
src/de/escape.rs	`65.15% <0.00%> (-1.28%)`	⬇️
src/de/seq.rs	`91.83% <0.00%> (-0.76%)`	⬇️
src/se/mod.rs	`93.81% <0.00%> (-0.01%)`	⬇️
src/writer.rs	`90.36% <0.00%> (+0.02%)`	⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e701c4d...ac7b67b. Read the comment docs.

dralley · 2022-06-22T17:53:39Z

I plan to continue working on this over the next week or two

dralley · 2022-06-23T02:16:42Z

@Mingun Questions:

Currently we have the functions unescaped_value, unescaped_value_with_custom_entities and their decode equivalents, that do the escaping part but don't implement the rest of the XML attribute-value-normalization spec. I'm not sure I see any reason for those to continue to exist as far as XML is concerned, but for HTML it makes some sense, as HTML only seems to do unescaping without any other normalization of the value.

Does that sound accurate / align with your knowledge
Do you think it would make more sense to stick with functional names, like normalized_value / unescaped_value and rely on the documentation to tell users what they ought to be using, or switch more descriptive names such as html_value / xml_value
The behavior of Attributes depends on whether or not .html is set, should we look at doing something similar here, or would that not be worth the additional complication

Mingun · 2022-06-23T18:28:36Z

I haven't studied a situation about HTML attributes, therefore, I rely on your understanding of the situation. Then, if you include references to relevant resources in the documentation, I'll be able to learn something about that
I think that functional names are better
Probably we should inverse things and introduce different types for XML / HTML attributes, then implement only relevant methods on each. This will also solve some unpleasant things in the current API -- we can change htmlity of the attributes in the middle of iteration that probably could to lead to sophisticated bugs. I would like to avoid such dangerous usage.

dralley · 2022-06-25T03:10:01Z

I'm basically going off of the lack of any kind of discussion of attribute value normalization in the HTML living spec, and this discussion on stackoverflow

https://html.spec.whatwg.org/multipage/dom.html#attributes
https://html.spec.whatwg.org/multipage/syntax.html#attributes-2
https://stackoverflow.com/questions/63906320/html5-attribute-value-normalization

I think that functional names are better

I agree

Probably we should inverse things and introduce different types for XML / HTML attributes, then implement only relevant methods on each. This will also solve some unpleasant things in the current API -- we can change htmlity of the attributes in the middle of iteration that probably could to lead to sophisticated bugs. I would like to avoid such dangerous usage.

Can they? It looks like all these fields are private. But, I feel like this is still the best option. There is already attributes() and html_attributes(), it's only a matter of changing the types.

The unfortunate thing will be, that the code between the two is almost the same, just enough so that it will be really annoying to duplicate.

dralley · 2022-07-04T17:20:01Z

src/events/attributes.rs

+                        let codepoint = escapei::parse_number(entity, idx..end)?;
+                        escapei::push_utf8(&mut normalized, codepoint);
+                    } else if let Some(value) = custom_entities.and_then(|hm| hm.get(entity)) {
+                        // TODO: recursively apply entity substitution


@Mingun Does the normal unescape() function need to do this as well?

Yes, I think so

benches/microbenches.rs

Mingun · 2022-07-08T06:02:34Z

src/events/attributes.rs

+    /// This will allocate unless the raw attribute value does not require normalization.
+    ///
+    /// See also [`normalized_value_with_custom_entities()`](#method.normalized_value_with_custom_entities)
+    pub fn normalized_value(&'a self) -> Result<Cow<'a, [u8]>, EscapeError> {


I think, it will be valuable to add examples here. Actually, probably you can convert your new tests to doc examples and that will be enough

Actually, we should add a decoder parameter to that (and similar) functions and decode first before normalization.

Maybe also try another approach: introduce a new type of always UTF-8 encoded attributes and add a Attribute::decode(&self) -> Result<Utf8Attribute> function.

Instead of completely new type we could try to use const bool generic parameter (stable since 1.51, current MSRV is 1.41.1, from memchr)

src/events/attributes.rs

Mingun · 2022-07-08T06:09:35Z

src/events/attributes.rs

+                        let codepoint = escapei::parse_number(entity, idx..end)?;
+                        escapei::push_utf8(&mut normalized, codepoint);
+                    } else if let Some(value) = custom_entities.and_then(|hm| hm.get(entity)) {
+                        // TODO: recursively apply entity substitution


Yes, I think so

Mingun

Can they? It looks like all these fields are private.

No, they can't. I confused with the ability to disable with_checks in the middle of an iteration.

But, I feel like this is still the best option. There is already attributes() and html_attributes(), it's only a matter of changing the types.

Ok, then let's do that.

The unfortunate thing will be, that the code between the two is almost the same, just enough so that it will be really annoying to duplicate.

If you talk about make_normalized_value, you can convert it to a free function and use it in both API methods.

dralley · 2023-01-30T06:43:57Z

I did a cursory review of the changes and it looks fine. Would you prefer the commits squashed or kept separate?

I'll address everything else tomorrow.

Mingun · 2023-01-30T14:34:15Z

I prefer to kept separate. It is somehow psychologically uncomfortable for review when one commit changes more than ~200 lines in each (or just several) files, even if half of them -- new tests. ¯_(ツ)_/¯

I think, that at least separating normalization method to it's own commit would be a good idea. This is pretty isolated thing which, however, is big enough. That commit needed to be updated:

Add tests for non-ASCII input
Introduce new error kind and return it when depth become 0. That means that we reach recursion limit
(could be postponed) I would like to have limit configurable
(could be postponed) Add a way to explicitly detect recursion (i.e. track the resolved entities and report which entity was defined recursively)
(could be postponed) Add an ability to pass metainfo around entities. That way we can provide a way to report in error where the erroneous entity is defined, if resolver function provide that information

src/events/attributes.rs

dralley · 2023-11-12T16:21:22Z

Introduce new error kind and return it when depth become 0. That means that we reach recursion limit

(could be postponed) I would like to have limit configurable

How should one configure the limit? At some point it becomes unwieldy to keep all of this state external and provide it in each method call (we also have the XML / HTML divergence in attribute handling to consider).

Should we consider keeping the state in Reader and doing something along the lines of

reader.normalize_attribute_value(attr), or
attr.normalize_value_with(resolve_entity, reader) or attr.normalize_value_with(reader) (moving the resolve_entity into Reader entirely)?

Also I've noticed that some implementations detect an entity loop immediately instead of processing until the recursion limit is reached. Should that be two separate errors in your opinion, or one error?

dralley · 2023-11-15T05:40:33Z

@Mingun ^

Mingun

How should one configure the limit? At some point it becomes unwieldy to keep all of this state external and provide it in each method call (we also have the XML / HTML divergence in attribute handling to consider).

Most naturally would be have a new option in reader::Config. That mean, we need somehow propagate it to the actual method. Some methods already takes Reader, so it will simple for them. Maybe we just need to start from only those methods and add shortcuts only when them explicitly will requested.

Actually, I already though about storing Decoder in the attribute itself (but because currently Attribute is a struct with public fields it will be a breaking change and I not very like the idea of making that new decoder field public, because it is implementation detail. Maybe in the end we will store already decoded data)

Also I've noticed that some implementations detect an entity loop immediately instead of processing until the recursion limit is reached.

Yes, of course we should return error as soon as we found loop or if recursion limit was exceeded.

Should that be two separate errors in your opinion, or one error?

Two different. libxml2 also have two different errors, as you could notice from your link: one is "Detected an entity reference loop", other is "Maximum entity nesting depth exceeded"

src/escapei.rs

Changelog.md

dralley · 2024-06-14T18:32:01Z

Obviously I haven't gotten around to finishing this

If you feel inclined to do so, I wouldn't mind if you picked it up. If not, I'll get around to it eventually, I just haven't been doing a lot of coding in my free time and what I have done, was on a different project.

…ne Handling" section of XML 1.1 spec https://www.w3.org/TR/xml11/#sec-line-ends

…::decode` and `BytesRef::decode` methods

…ttribute-Value Normalization" section of XML 1.1. spec https://www.w3.org/TR/xml11/#AVNormalize

Mingun · 2025-07-16T15:12:16Z

src/events/attributes.rs

-    /// Decodes using UTF-8 then unescapes the value.
+    /// Returns the attribute value normalized as per [the XML specification].
+    ///
+    /// Do not use this method with HTML attributes.


@dralley, you originally wrote this line of documentation. It seems to me, that we don't need two methods: one that performs normalization and one that doesn't. Why normalization methods should not be used for HTML attributes? Could you give a link for proof?

Never mind. You already answered this question at the beginning of issue

I think, this warning is useless and we can just change behavior of existing methods instead of adding new or more precisely deprecate them in favor of new names.

Actually, I doubt that we need HTML style of attributes, because we anyway cannot correcly parse HTML. We could work with XHTML because this is XML with semantics. But event in XHTML HTML-style attributes are not allowed. So in reality we hardly ever get an Attribute from HTML source and therefore will not need to get the value without normalization.

I think, in the future we should also remove html feature. All that it does just includes bunch of HTML entities which increases size of binaries and compile time. I do not want to support actuality of this list, and those consumers who need HTML-entities can use the corresponding entity_resolver.

Maybe we can keep ability to parse HTML-like attributes for XML-like-but-not-really-XML formats as long as it does not significantly affect parsing speed.

dralley force-pushed the attr-val-normalization branch 6 times, most recently from e45064f to 401bb77 Compare April 3, 2022 15:30

dralley force-pushed the attr-val-normalization branch from 401bb77 to 9307786 Compare April 3, 2022 18:01

dralley force-pushed the attr-val-normalization branch 5 times, most recently from 538e5cd to ac7b67b Compare June 20, 2022 18:55

dralley force-pushed the attr-val-normalization branch 2 times, most recently from f206a71 to 1a138d6 Compare June 23, 2022 01:52

dralley force-pushed the attr-val-normalization branch 4 times, most recently from 08c0eea to 00a37a0 Compare July 4, 2022 17:13

dralley commented Jul 4, 2022

View reviewed changes

benches/microbenches.rs Outdated Show resolved Hide resolved

Mingun reviewed Jul 8, 2022

View reviewed changes

Mingun mentioned this pull request Jul 9, 2022

Make attribute creation more uniform #413

Closed

dralley mentioned this pull request Jul 10, 2022

Closure-based unescaping with custom entities #415

Merged

Mingun mentioned this pull request Jan 30, 2023

Release 0.28.0 #549

Closed

13 tasks

dralley force-pushed the attr-val-normalization branch 2 times, most recently from 7f55cd8 to add31b6 Compare January 31, 2023 04:46

dralley force-pushed the attr-val-normalization branch from add31b6 to a72a441 Compare March 13, 2023 00:57

dralley force-pushed the attr-val-normalization branch 2 times, most recently from f317d76 to 69a1934 Compare June 19, 2023 22:48

dralley force-pushed the attr-val-normalization branch from 69a1934 to ff42db2 Compare July 10, 2023 19:40

dralley force-pushed the attr-val-normalization branch from ff42db2 to deed851 Compare August 11, 2023 01:23

dralley mentioned this pull request Oct 7, 2023

are there some stuff here that need some help? dralley/rpmrepo_metadata#2

Open

francisdb mentioned this pull request Oct 20, 2023

xml serde roundtrip loses CR/LF encoding #670

Open

dralley force-pushed the attr-val-normalization branch from deed851 to def940d Compare October 23, 2023 03:21

dralley commented Oct 23, 2023

View reviewed changes

src/events/attributes.rs Outdated Show resolved Hide resolved

dralley force-pushed the attr-val-normalization branch from def940d to 5817baf Compare November 12, 2023 05:52

Mingun reviewed Nov 15, 2023

View reviewed changes

src/escapei.rs Outdated Show resolved Hide resolved

src/escapei.rs Outdated Show resolved Hide resolved

src/escapei.rs Outdated Show resolved Hide resolved

src/escapei.rs Outdated Show resolved Hide resolved

Changelog.md Outdated Show resolved Hide resolved

dralley force-pushed the attr-val-normalization branch 3 times, most recently from c2d2fbd to 5c0d5d6 Compare June 14, 2024 18:44

Mingun force-pushed the attr-val-normalization branch 2 times, most recently from cb2ef33 to bb16e17 Compare July 5, 2025 10:15

Mingun and others added 4 commits July 13, 2025 00:38

Implement EOL normalization procedure as described in "2.11 End-of-Li…

72647ef

…ne Handling" section of XML 1.1 spec https://www.w3.org/TR/xml11/#sec-line-ends

Properly normalize EOL characters in BytesText::decode, `BytesCData…

8cfcbb5

…::decode` and `BytesRef::decode` methods

Implement an attribute normalization routine as described in "3.3.3 A…

df7df69

…ttribute-Value Normalization" section of XML 1.1. spec https://www.w3.org/TR/xml11/#AVNormalize

More correctly normalize attribute values

5bbaa90

Mingun force-pushed the attr-val-normalization branch from bb16e17 to 5bbaa90 Compare July 15, 2025 17:45

Mingun reviewed Jul 16, 2025

View reviewed changes

Proper normalize attribute value normalization #379

Are you sure you want to change the base?

Proper normalize attribute value normalization #379

Uh oh!

Conversation

dralley commented Apr 3, 2022

Uh oh!

dralley commented Apr 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Jun 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

dralley commented Jun 22, 2022

Uh oh!

dralley commented Jun 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mingun commented Jun 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dralley commented Jun 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dralley Jul 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Mingun Jul 8, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Mingun Jul 8, 2022

Choose a reason for hiding this comment

Uh oh!

Mingun Jul 9, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Mingun Jul 8, 2022

Choose a reason for hiding this comment

Uh oh!

Mingun left a comment

Choose a reason for hiding this comment

Uh oh!

dralley commented Jan 30, 2023

Uh oh!

Mingun commented Jan 30, 2023

Uh oh!

Uh oh!

dralley commented Nov 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dralley commented Nov 15, 2023

Uh oh!

Mingun left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dralley commented Jun 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mingun Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

Mingun Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

Mingun Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dralley commented Apr 3, 2022 •

edited

Loading

codecov-commenter commented Jun 20, 2022 •

edited

Loading

dralley commented Jun 23, 2022 •

edited

Loading

Mingun commented Jun 23, 2022 •

edited

Loading

dralley commented Jun 25, 2022 •

edited

Loading

dralley Jul 4, 2022 •

edited

Loading

dralley commented Nov 12, 2023 •

edited

Loading

dralley commented Jun 14, 2024 •

edited

Loading