-
-
Notifications
You must be signed in to change notification settings - Fork 7
Support for prolog of XML documents #132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
eisenwave
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot!
The few comments I've left should be applied throughout the PR consistently.
Please add comments with links to the XML standard for the relevant constructs that are being matched, so it's easy to figure out what the code applies to.
I also wonder if it's really necessary to match these things in such detail. Remember the ulight is more about speed, simplicity, and robustness, rather than about matching the grammar and all its features with perfect accuracy.
src/main/cpp/lang/xml.cpp
Outdated
|
|
||
| bool valid = true; | ||
| for (std::size_t i = 0; i < str.size(); i++) { | ||
| auto [code_point, length] = utf8::decode_and_length_or_replacement(str.substr(i)); | ||
| valid = valid & is_xml_name(code_point); | ||
| } | ||
|
|
||
| return valid; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| bool valid = true; | |
| for (std::size_t i = 0; i < str.size(); i++) { | |
| auto [code_point, length] = utf8::decode_and_length_or_replacement(str.substr(i)); | |
| valid = valid & is_xml_name(code_point); | |
| } | |
| return valid; | |
| return utf8::all_of(str, [](char32_t c) { return is_xml_name(c); }); |
|
|
||
| [[nodiscard]] | ||
| std::size_t match_entity_reference(std::u8string_view str) | ||
| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| { | |
| { | |
| // https://www.w3.org/TR/xml/#sec-references |
src/main/cpp/lang/xml.cpp
Outdated
| static constexpr std::array<char8_t, 8> non_name_chars | ||
| = { u8'(', u8')', u8'|', u8'*', u8'+', u8'?', u8'>', ',' }; | ||
|
|
||
| constexpr auto is_after_name = [](std::u8string_view str) { | ||
| return std::ranges::find(non_name_chars, str.front()) != std::end(non_name_chars) | ||
| || match_whitespace(str); | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like something that should be handled in part in xml_chars.hpp.
| } | ||
|
|
||
| bool expect_default_att_decl() | ||
| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| { | |
| { | |
| // https://www.w3.org/TR/xml/#sec-attr-defaults |
src/main/cpp/lang/xml.cpp
Outdated
| advance(match_whitespace(remainder)); | ||
| } | ||
|
|
||
| return expect_attribute_value(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is problematic because it means you could already emit the tokens for #FIXED but then return false, indicating that nothing was matched.
expect_* functions should be "all or nothing", i.e. only return false if nothing was emitted.
A possible solution is to just say that once you've matched #FIXED, the result is true no matter whether you actually find an attribute value or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I think I just missed adding the return statement.
src/main/cpp/lang/xml.cpp
Outdated
| } | ||
| }; | ||
|
|
||
| auto highlight_string = [&]() { expect_attribute_value(); }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a misuse of the name highlight. It should just be called expect_string.
The prefix highlight_ is used for functions that don't do any string matching, but take already matched data as input and then just emit the corresponding highlights. For example, highlight_number.
src/main/cpp/lang/xml.cpp
Outdated
| return false; | ||
| } | ||
| emit_and_advance(attlist_decl_string.length(), Highlight_Type::name_macro); | ||
| advance(match_whitespace(remainder)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| advance(match_whitespace(remainder)); | |
| skip_whitespace(); |
It would be better to make some utility function for this thing that appears a in a lot of places.
| TEST(XML, match_name_permissive) | ||
| TEST(XML, match_entity_reference) | ||
| { | ||
|
|
||
| Name_Match_Result result; | ||
| result = match_name_permissive(u8"simple_name", [](std::u8string_view) { return false; }); | ||
| EXPECT_EQ(result.length, 11); | ||
| EXPECT_EQ(result.error_indicies.size(), 0); | ||
|
|
||
| result = match_name_permissive(u8"n&a&m&e", [](std::u8string_view) { return false; }); | ||
| EXPECT_EQ(result.length, 7); | ||
| EXPECT_TRUE(result.error_indicies.contains(1)); | ||
| EXPECT_TRUE(result.error_indicies.contains(3)); | ||
| EXPECT_TRUE(result.error_indicies.contains(5)); | ||
| EXPECT_EQ(match_entity_reference(u8"this is not a reference"), 0); | ||
| EXPECT_EQ(match_entity_reference(u8"%reference;"), 11); | ||
| EXPECT_EQ(match_entity_reference(u8"%reference; trailing"), 11); | ||
| EXPECT_EQ(match_entity_reference(u8"%re-f; illegal char"), 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not also keep the old test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could not find match_name_permissive in xml or xml.hpp. grep also did not find any match in include/ To my knowledge this function does not exist anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently, test_xml.cpp was never added to CMakeLists.txt, so the tests in there never ran.
src/main/cpp/lang/xml.cpp
Outdated
|
|
||
| bool expect_entity_decl() | ||
| { | ||
| advance(match_whitespace(remainder)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a general principle, leading whitespace should always be skipped by the surrounding code, not by the matched construct, unless whitespace is grammatically part of that construct.
src/main/cpp/lang/xml.cpp
Outdated
| } | ||
| emit_and_advance(1, Highlight_Type::symbol_punc); | ||
|
|
||
| while (expect_markup_decl()) { }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| while (expect_markup_decl()) { }; | |
| while (expect_markup_decl()) { } |
|
Thanks for the review. We could make it less detailed by only checking the general structure as |
|
I would be more comfortable with the simpler approach, at least at first. One of the goals is to not have "highlight jumps" as you type, which would happen if e.g. you type |
|
How much of a superset would be fine for you ? I think for https://www.w3.org/TR/xml/#NT-AttlistDecl it would be advantageous if we just match e.g Edit: I meant superset not subset. |
|
I implemented the highlighting slightly different. If you approve of the highlighting behavior I would clean up the code and provide further documentation. |
|
Thanks a lot. I'm having some difficulty finding time for this PR during Christmas; I'll probably deal with it this weekend |
eisenwave
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new behavior looks good. Thanks a lot. Can you clean up the remaining minor suggestions in the previous review?
I could probably also take care of these if you cannot find the time.
|
I think I have enough time to do it myself, maybe not today. I probably can finish it by the end of the week. |
* fix tests
|
I think I got everything from the first review, let me know if something is missing. |
|
Thanks, I just need to find the time to review and merge. I was really busy with another part of the code base up until last weekend, but I'm done now. |
This PR adds support for the prolog of XML documents.
Previously highlighting for
was not supported.
Note that in some places the highlighting is more permissive than the XML Standard (e.g in
PublicIDis matched withexpect_attribute_value).This PR is not complete yet since I have not tested everything, I will add tests soon.