Skip to content

Conversation

@fkbnz
Copy link

@fkbnz fkbnz commented Nov 17, 2025

This PR adds support for the prolog of XML documents.
Previously highlighting for

<?xml version=12312.00 encoding=UTF-8>
<!DOCTYPE xml-doc [
    <!ENTITY doc "doc">
   <!NOTATION my-notation PUBLIC "notation">
]>

was not supported.
Note that in some places the highlighting is more permissive than the XML Standard (e.g in PublicID is matched with expect_attribute_value).
This PR is not complete yet since I have not tested everything, I will add tests soon.

@eisenwave eisenwave linked an issue Nov 17, 2025 that may be closed by this pull request
@fkbnz fkbnz marked this pull request as ready for review November 22, 2025 08:43
Copy link
Owner

@eisenwave eisenwave left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot!

The few comments I've left should be applied throughout the PR consistently.

Please add comments with links to the XML standard for the relevant constructs that are being matched, so it's easy to figure out what the code applies to.

I also wonder if it's really necessary to match these things in such detail. Remember the ulight is more about speed, simplicity, and robustness, rather than about matching the grammar and all its features with perfect accuracy.

Comment on lines 47 to 54

bool valid = true;
for (std::size_t i = 0; i < str.size(); i++) {
auto [code_point, length] = utf8::decode_and_length_or_replacement(str.substr(i));
valid = valid & is_xml_name(code_point);
}

return valid;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bool valid = true;
for (std::size_t i = 0; i < str.size(); i++) {
auto [code_point, length] = utf8::decode_and_length_or_replacement(str.substr(i));
valid = valid & is_xml_name(code_point);
}
return valid;
return utf8::all_of(str, [](char32_t c) { return is_xml_name(c); });


[[nodiscard]]
std::size_t match_entity_reference(std::u8string_view str)
{
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{
{
// https://www.w3.org/TR/xml/#sec-references

Comment on lines 163 to 169
static constexpr std::array<char8_t, 8> non_name_chars
= { u8'(', u8')', u8'|', u8'*', u8'+', u8'?', u8'>', ',' };

constexpr auto is_after_name = [](std::u8string_view str) {
return std::ranges::find(non_name_chars, str.front()) != std::end(non_name_chars)
|| match_whitespace(str);
};
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like something that should be handled in part in xml_chars.hpp.

}

bool expect_default_att_decl()
{
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{
{
// https://www.w3.org/TR/xml/#sec-attr-defaults

advance(match_whitespace(remainder));
}

return expect_attribute_value();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is problematic because it means you could already emit the tokens for #FIXED but then return false, indicating that nothing was matched.

expect_* functions should be "all or nothing", i.e. only return false if nothing was emitted.

A possible solution is to just say that once you've matched #FIXED, the result is true no matter whether you actually find an attribute value or not.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I think I just missed adding the return statement.

}
};

auto highlight_string = [&]() { expect_attribute_value(); };
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a misuse of the name highlight. It should just be called expect_string.

The prefix highlight_ is used for functions that don't do any string matching, but take already matched data as input and then just emit the corresponding highlights. For example, highlight_number.

return false;
}
emit_and_advance(attlist_decl_string.length(), Highlight_Type::name_macro);
advance(match_whitespace(remainder));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
advance(match_whitespace(remainder));
skip_whitespace();

It would be better to make some utility function for this thing that appears a in a lot of places.

Comment on lines -20 to +25
TEST(XML, match_name_permissive)
TEST(XML, match_entity_reference)
{

Name_Match_Result result;
result = match_name_permissive(u8"simple_name", [](std::u8string_view) { return false; });
EXPECT_EQ(result.length, 11);
EXPECT_EQ(result.error_indicies.size(), 0);

result = match_name_permissive(u8"n&a&m&e", [](std::u8string_view) { return false; });
EXPECT_EQ(result.length, 7);
EXPECT_TRUE(result.error_indicies.contains(1));
EXPECT_TRUE(result.error_indicies.contains(3));
EXPECT_TRUE(result.error_indicies.contains(5));
EXPECT_EQ(match_entity_reference(u8"this is not a reference"), 0);
EXPECT_EQ(match_entity_reference(u8"%reference;"), 11);
EXPECT_EQ(match_entity_reference(u8"%reference; trailing"), 11);
EXPECT_EQ(match_entity_reference(u8"%re-f; illegal char"), 0);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not also keep the old test?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not find match_name_permissive in xml or xml.hpp. grep also did not find any match in include/ To my knowledge this function does not exist anymore.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently, test_xml.cpp was never added to CMakeLists.txt, so the tests in there never ran.


bool expect_entity_decl()
{
advance(match_whitespace(remainder));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a general principle, leading whitespace should always be skipped by the surrounding code, not by the matched construct, unless whitespace is grammatically part of that construct.

}
emit_and_advance(1, Highlight_Type::symbol_punc);

while (expect_markup_decl()) { };
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
while (expect_markup_decl()) { };
while (expect_markup_decl()) { }

@fkbnz
Copy link
Author

fkbnz commented Nov 22, 2025

Thanks for the review.

We could make it less detailed by only checking the general structure as entity_decl, attlist_decl ... have a similar structure (i.e start with <! followed by some keyword followed by a name). That would also highlight xml that is not valid e.g <!abcdefg would be highlighted just as <!ENTITY.

@eisenwave
Copy link
Owner

I would be more comfortable with the simpler approach, at least at first.

One of the goals is to not have "highlight jumps" as you type, which would happen if e.g. you type <!ENTIT and then it goes from some bad highlight to the right one when you add the Y. Therefore, it's really not that bad to match a bit of a superset instead of being really precise.

@fkbnz
Copy link
Author

fkbnz commented Dec 18, 2025

How much of a superset would be fine for you ? I think for https://www.w3.org/TR/xml/#NT-AttlistDecl it would be advantageous if we just match e.g IDREF or #REQUIRED if we can, even though it might be not the correct place.
This would also make the highlighting more aggressive.

Edit: I meant superset not subset.

@fkbnz fkbnz marked this pull request as draft December 20, 2025 17:36
@fkbnz
Copy link
Author

fkbnz commented Dec 20, 2025

I implemented the highlighting slightly different. If you approve of the highlighting behavior I would clean up the code and provide further documentation.

@fkbnz fkbnz requested a review from eisenwave December 24, 2025 16:09
@eisenwave
Copy link
Owner

Thanks a lot. I'm having some difficulty finding time for this PR during Christmas; I'll probably deal with it this weekend

Copy link
Owner

@eisenwave eisenwave left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new behavior looks good. Thanks a lot. Can you clean up the remaining minor suggestions in the previous review?

I could probably also take care of these if you cannot find the time.

@fkbnz
Copy link
Author

fkbnz commented Dec 31, 2025

I think I have enough time to do it myself, maybe not today. I probably can finish it by the end of the week.

@fkbnz fkbnz marked this pull request as ready for review January 7, 2026 16:11
@fkbnz fkbnz requested a review from eisenwave January 10, 2026 11:00
@eisenwave eisenwave added the inaccurate-highlight For syntax highlighting that is not wrong, but could be more granular, categorize identifiers, etc. label Jan 10, 2026
@fkbnz
Copy link
Author

fkbnz commented Jan 12, 2026

I think I got everything from the first review, let me know if something is missing.

@eisenwave
Copy link
Owner

Thanks, I just need to find the time to review and merge. I was really busy with another part of the code base up until last weekend, but I'm done now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

inaccurate-highlight For syntax highlighting that is not wrong, but could be more granular, categorize identifiers, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[XML] Add Highlighting for prolog

2 participants