Support for prolog of XML documents #132

fkbnz · 2025-11-17T17:48:47Z

This PR adds support for the prolog of XML documents.
Previously highlighting for

<?xml version=12312.00 encoding=UTF-8>
<!DOCTYPE xml-doc [
    <!ENTITY doc "doc">
   <!NOTATION my-notation PUBLIC "notation">
]>

was not supported.
Note that in some places the highlighting is more permissive than the XML Standard (e.g in PublicID is matched with expect_attribute_value).
This PR is not complete yet since I have not tested everything, I will add tests soon.

eisenwave

Thanks a lot!

The few comments I've left should be applied throughout the PR consistently.

Please add comments with links to the XML standard for the relevant constructs that are being matched, so it's easy to figure out what the code applies to.

I also wonder if it's really necessary to match these things in such detail. Remember the ulight is more about speed, simplicity, and robustness, rather than about matching the grammar and all its features with perfect accuracy.

eisenwave · 2025-11-22T11:04:00Z

src/main/cpp/lang/xml.cpp

+
+    bool valid = true;
+    for (std::size_t i = 0; i < str.size(); i++) {
+        auto [code_point, length] = utf8::decode_and_length_or_replacement(str.substr(i));
+        valid = valid & is_xml_name(code_point);
+    }
+
+    return valid;


Suggested change

bool valid = true;

for (std::size_t i = 0; i < str.size(); i++) {

auto [code_point, length] = utf8::decode_and_length_or_replacement(str.substr(i));

valid = valid & is_xml_name(code_point);

}

return valid;

return utf8::all_of(str, [](char32_t c) { return is_xml_name(c); });

eisenwave · 2025-11-22T11:11:14Z

src/main/cpp/lang/xml.cpp


+[[nodiscard]]
+std::size_t match_entity_reference(std::u8string_view str)
+{


Suggested change

{

{

// https://www.w3.org/TR/xml/#sec-references

eisenwave · 2025-11-22T11:16:13Z

src/main/cpp/lang/xml.cpp

+        static constexpr std::array<char8_t, 8> non_name_chars
+            = { u8'(', u8')', u8'|', u8'*', u8'+', u8'?', u8'>', ',' };
+
+        constexpr auto is_after_name = [](std::u8string_view str) {
+            return std::ranges::find(non_name_chars, str.front()) != std::end(non_name_chars)
+                || match_whitespace(str);
+        };


This feels like something that should be handled in part in xml_chars.hpp.

eisenwave · 2025-11-22T11:18:14Z

src/main/cpp/lang/xml.cpp

+    }
+
+    bool expect_default_att_decl()
+    {


Suggested change

{

{

// https://www.w3.org/TR/xml/#sec-attr-defaults

eisenwave · 2025-11-22T11:22:02Z

src/main/cpp/lang/xml.cpp

+            advance(match_whitespace(remainder));
+        }
+
+        return expect_attribute_value();


This is problematic because it means you could already emit the tokens for #FIXED but then return false, indicating that nothing was matched.

expect_* functions should be "all or nothing", i.e. only return false if nothing was emitted.

A possible solution is to just say that once you've matched #FIXED, the result is true no matter whether you actually find an attribute value or not.

I agree, I think I just missed adding the return statement.

eisenwave · 2025-11-22T11:34:04Z

src/main/cpp/lang/xml.cpp

+            }
+        };
+
+        auto highlight_string = [&]() { expect_attribute_value(); };


This is a misuse of the name highlight. It should just be called expect_string.

The prefix highlight_ is used for functions that don't do any string matching, but take already matched data as input and then just emit the corresponding highlights. For example, highlight_number.

eisenwave · 2025-11-22T11:35:51Z

src/main/cpp/lang/xml.cpp

+            return false;
+        }
+        emit_and_advance(attlist_decl_string.length(), Highlight_Type::name_macro);
+        advance(match_whitespace(remainder));


Suggested change

advance(match_whitespace(remainder));

skip_whitespace();

It would be better to make some utility function for this thing that appears a in a lot of places.

eisenwave · 2025-11-22T11:37:09Z

src/test/cpp/test_xml.cpp

-TEST(XML, match_name_permissive)
+TEST(XML, match_entity_reference)
 {
-
-    Name_Match_Result result;
-    result = match_name_permissive(u8"simple_name", [](std::u8string_view) { return false; });
-    EXPECT_EQ(result.length, 11);
-    EXPECT_EQ(result.error_indicies.size(), 0);
-
-    result = match_name_permissive(u8"n&a&m&e", [](std::u8string_view) { return false; });
-    EXPECT_EQ(result.length, 7);
-    EXPECT_TRUE(result.error_indicies.contains(1));
-    EXPECT_TRUE(result.error_indicies.contains(3));
-    EXPECT_TRUE(result.error_indicies.contains(5));
+    EXPECT_EQ(match_entity_reference(u8"this is not a reference"), 0);
+    EXPECT_EQ(match_entity_reference(u8"%reference;"), 11);
+    EXPECT_EQ(match_entity_reference(u8"%reference; trailing"), 11);
+    EXPECT_EQ(match_entity_reference(u8"%re-f; illegal char"), 0);


Why not also keep the old test?

I could not find match_name_permissive in xml or xml.hpp. grep also did not find any match in include/ To my knowledge this function does not exist anymore.

Apparently, test_xml.cpp was never added to CMakeLists.txt, so the tests in there never ran.

eisenwave · 2025-11-22T11:45:30Z

src/main/cpp/lang/xml.cpp

+
+    bool expect_entity_decl()
+    {
+        advance(match_whitespace(remainder));


As a general principle, leading whitespace should always be skipped by the surrounding code, not by the matched construct, unless whitespace is grammatically part of that construct.

eisenwave · 2025-11-22T11:47:23Z

src/main/cpp/lang/xml.cpp

+        }
+        emit_and_advance(1, Highlight_Type::symbol_punc);
+
+        while (expect_markup_decl()) { };


Suggested change

while (expect_markup_decl()) { };

while (expect_markup_decl()) { }

fkbnz · 2025-11-22T15:49:19Z

Thanks for the review.

We could make it less detailed by only checking the general structure as entity_decl, attlist_decl ... have a similar structure (i.e start with <! followed by some keyword followed by a name). That would also highlight xml that is not valid e.g <!abcdefg would be highlighted just as <!ENTITY.

eisenwave · 2025-11-22T16:01:59Z

I would be more comfortable with the simpler approach, at least at first.

One of the goals is to not have "highlight jumps" as you type, which would happen if e.g. you type <!ENTIT and then it goes from some bad highlight to the right one when you add the Y. Therefore, it's really not that bad to match a bit of a superset instead of being really precise.

fkbnz · 2025-12-18T21:00:46Z

How much of a superset would be fine for you ? I think for https://www.w3.org/TR/xml/#NT-AttlistDecl it would be advantageous if we just match e.g IDREF or #REQUIRED if we can, even though it might be not the correct place.
This would also make the highlighting more aggressive.

Edit: I meant superset not subset.

fkbnz · 2025-12-20T17:38:06Z

I implemented the highlighting slightly different. If you approve of the highlighting behavior I would clean up the code and provide further documentation.

eisenwave · 2025-12-24T16:54:50Z

Thanks a lot. I'm having some difficulty finding time for this PR during Christmas; I'll probably deal with it this weekend

eisenwave

The new behavior looks good. Thanks a lot. Can you clean up the remaining minor suggestions in the previous review?

I could probably also take care of these if you cannot find the time.

fkbnz · 2025-12-31T13:34:56Z

I think I have enough time to do it myself, maybe not today. I probably can finish it by the end of the week.

* fix tests

fkbnz · 2026-01-12T18:42:47Z

I think I got everything from the first review, let me know if something is missing.

eisenwave · 2026-01-13T08:21:36Z

Thanks, I just need to find the time to review and merge. I was really busy with another part of the code base up until last weekend, but I'm done now.

support for prolog of xml documents

16042ec

eisenwave linked an issue Nov 17, 2025 that may be closed by this pull request

[XML] Add Highlighting for prolog #80

Open

fkbnz added 3 commits November 20, 2025 22:23

xml highlighting tests, highlighting related fixes

e52581f

fix tests, format code

09c3463

fix trailing whitespace

a9ae13e

fkbnz marked this pull request as ready for review November 22, 2025 08:43

eisenwave requested changes Nov 22, 2025

View reviewed changes

implement requested changes

face51d

fkbnz marked this pull request as draft December 20, 2025 17:36

make highlighting more robust

f6499aa

fkbnz requested a review from eisenwave December 24, 2025 16:09

cleanup, fix tests, improve highlighting

9b63fa3

eisenwave reviewed Dec 31, 2025

View reviewed changes

implement remaining changes

c20b31f

* fix tests

fkbnz marked this pull request as ready for review January 7, 2026 16:11

fkbnz requested a review from eisenwave January 10, 2026 11:00

eisenwave added the inaccurate-highlight For syntax highlighting that is not wrong, but could be more granular, categorize identifiers, etc. label Jan 10, 2026

	while (expect_markup_decl()) { };
	while (expect_markup_decl()) { }

Uh oh!

Support for prolog of XML documents #132

Are you sure you want to change the base?

Support for prolog of XML documents #132

Uh oh!

Conversation

fkbnz commented Nov 17, 2025

Uh oh!

eisenwave left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fkbnz commented Nov 22, 2025

Uh oh!

eisenwave commented Nov 22, 2025

Uh oh!

fkbnz commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fkbnz commented Dec 20, 2025

Uh oh!

eisenwave commented Dec 24, 2025

Uh oh!

eisenwave left a comment

Choose a reason for hiding this comment

Uh oh!

fkbnz commented Dec 31, 2025

Uh oh!

fkbnz commented Jan 12, 2026

Uh oh!

eisenwave commented Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fkbnz commented Dec 18, 2025 •

edited

Loading