Adding support for "role" attributes for the DocBook reader #10665

yanntrividic · 2025-03-04T22:30:53Z

As the title says, this PR adds support for the role attributes to be parsed by the DocBook reader. This enhancement has been outlined in #9089. The PR is implementing a solution that was discussed in the issue by @jgm and @lifeunleaded in particular.

I haven't written any unit test for this, as I couldn't find a unit test file for the DocBook reader, and I wasn't ready to start one... I hope it won't be blocking for this PR to be merged!

This change didn't feel like it required a modification to the MANUAL.txt file.

Don't hesitate to tell me if I should adapt something. From my understanding, it is ready to be integrated to the main branch :)

Started working on the support for the "role" attributes, as mentioned in jgm#9089. It is still missing wrapping Para element by Div element with attributes when necessary. It also needs testing.

Added a wrapper function parseMixedWithRole around the parseMixed function, that checks if a role attribute is on the parsed element. If so, the element is wrapped in a Div with this attribute, and if not, only parseMixed is applied.

jgm · 2025-03-05T17:31:39Z

Thanks for the PR! I wonder if we couldn't simplify things considerably by using the function
https://github.com/jgm/pandoc/blob/main/src/Text/Pandoc/Shared.hs#L293-L302
Using this, you could just addAttributes [("role", therole)] on any pandoc Inlines or Blocks.
This might allow you to just have one place where the role is checked, instead of separate places for each type of element.

lifeunleaded · 2025-03-06T05:49:41Z

Excellent addition, thank you! 61ff730 Should point you to DocBook reader tests, but I could take a stab at adding some if you don’t feel comfortable doing it.

yanntrividic · 2025-03-06T15:07:59Z

Hello, thank you both for your help!

@jgm, thank you for pointing me in this direction. Indeed it seems to be exactly what we need. I updated the code to work in this direction. Is it what you expected?

@lifeunleaded, I don't know how I missed those files... I'm not sure how we should tackle this though. Should we add in each "zone" of the code a test with a new node that has a role attribute? Tbh, if you feel like you are fine taking care of it, I wouldn't mind at all.

lifeunleaded · 2025-03-06T15:46:14Z

@yanntrividic I can take a look within a few days, as soon as I know what the changes do. I'll let @jgm determine whether that should block the PR. Otherwise I can submit a separate PR for tests.

lifeunleaded · 2025-03-06T17:38:00Z

@yanntrividic I think it would be a good idea to just add a role attribute to some intended supported elements in test/docbook-reader.docbook and then run e.g. stack test --test-arguments='-p docbook -j4 --hide-successes' to get a view of what it expects. I just added a quick test role to the sect1 and got unexpected results. Not sure if my setup has issues or if the code needs adjusting. I'll try to look deeper soon.

lifeunleaded · 2025-03-06T19:38:19Z

Building the branch and running ~/.local/bin/pandoc -f docbook -t native ~/sandbox/test.xml with this input:

<?xml version="1.0" encoding="utf-8" ?>                                                                                                                                                 
<article xmlns="http://docbook.org/ns/docbook"
         xmlns:xlink="http://www.w3.org/1999/xlink"
         xmlns:mml="http://www.w3.org/1998/Math/MathML" version="5.0">
    <title>Pandoc Test Suite</title>                                                                                                                                                    

    <section id="headers" role="sect1role">
      <title>title1</title>
      <para>Test</para>
      <section><title>title2</title>                                                                                                                                                    
      <para>Test2</para>                                                                                                                                                                
      </section>
    </section>
</article>

Gives me this:

[ Header
    1
    ( "headers"
    , []
    , [ ( "role" , "sect1role" ) , ( "role" , "sect1role" ) ]
    )
    [ Str "title1" ]
, Div
    ( ""
    , []
    , [ ( "wrapper" , "1" ) , ( "role" , "sect1role" ) ]
    )
    [ Para [ Str "Test" ] ]
, Header
    2
    ( "" , [] , [ ( "role" , "sect1role" ) ] )
    [ Str "title2" ]
, Div
    ( ""
    , []
    , [ ( "wrapper" , "1" ) , ( "role" , "sect1role" ) ]
    )
    [ Para [ Str "Test2" ] ]
]

It seems to put role twice in the actual element, and then on all the child elements. I assume that's not intended?

src/Text/Pandoc/Readers/DocBook.hs

jgm · 2025-03-06T20:59:23Z

src/Text/Pandoc/Readers/DocBook.hs

@@ -1329,6 +1334,9 @@ parseInline (Elem e) =
        -- <?asciidor-br?> to in handleInstructions, above.
        "pi-asciidoc-br" -> return linebreak
        _          -> skip >> innerInlines id
+  return $ case qName (elName e) of
+    "emphasis" -> parsedInline


why this special case for "emphasis"?

I tried something, which failed, now I need to figure out a way to handle this.

Currently, Pandoc supports role attributes for some Inlines. It helps specifying Strong, Strikeout, Underline and Emph elements there. All those elements need a wrapper to add attributes to them. But if we try to apply this updated code to:

<emphasis role="strong">word</emphasis>

We get:

Span ( "" , [] , [ ( "wrapper" , "1" ) , ( "role" , "strong" ) ] ) [ Strong [ Str "word" ] ]

Which is not what we would want... So my previous attempt in f0827ee was to circumvent this issue, but then no emphasis DocBook element would have attributes, which is not what we want either.

Oh, right. Well, you're just preventing a role attribute from going on something parsed from an emphasis tag, and that's fine given that we already handle the roles in another way.

But then, that would mean that we can't have any other role value than bf, bold, strong, strikethrough, or underline, on emphasis elements ? Maybe in that case the role attributes should be assigned to phrase elements... That's a compromise I'm willing to make yes, if you feel it makes sense.

If other roles are used, then in the part of the code that checks for role on emphasis and gives you Strong, Underline or whatever in response, you could also check for other roles and simply add them as attributes.

From what I understand from the documentation, I think that DocBook documents expect only one role attribute per element. To specify more precise roles, it is recommended to parameterize a pattern to do so, but you still only have one role attribute.

So, knowing this, I think that it shouldn't be possible to have a role on a Strong element, because that would imply that there are necessarily two role attributes on the DocBook element, which shouldn't be possible.

And in that case, we would only need to change the last line from this part:

"emphasis" -> case attrValue "role" e of "bf" -> innerInlines strong "bold" -> innerInlines strong "strong" -> innerInlines strong "strikethrough" -> innerInlines strikeout "underline" -> innerInlines underline _ -> innerInlines emph

Does it make sense like this? What do you think?

@yanntrividic Not strictly true, docbook would allow multiple tokens in a role attribute, so e.g. <emphasis role="bold special"> is entirely valid. However, I think that is an extreme edge case and I've never seen it used. I think it would be perfectly fine to only process the known role tokens for emphasis as you describe.

I see, you are right... Well I guess it would not be so much work to support this feature, but in that case we would support only one pattern, that is to say, space-separated values for role attributes. I think at this point, I would prefer leaving this for a filter to handle.

On another note, I've been trying to work on something to discriminate the last case from the others... Not sure how to do it properly though... Here is my attempt: 3703757.

I know there is a parsing error in the code, but as I said, I'm quite new to Haskell, and I am really not sure how I could make this kind of type assertion.

Basically, the thought process here is that if the element is an emphasis element, and that has been parsed as an Emph element, then we can add a role attribute. But I'm really not sure what would be the right way to check the second condition. Any ideas?

Edit: Oops, it is better with this line corrected: 06214b6

…n some blocks

yanntrividic · 2025-03-09T15:48:25Z

Hey @lifeunleaded, thanks for taking the time to dig into some tests. You're right with what you mention in #10665 (comment), it is not intended... And I'm not sure how this recursion happens.

With the updated code, here is what we get:

<?xml version="1.0" encoding="utf-8" ?>
<article xmlns="http://docbook.org/ns/docbook"
         xmlns:xlink="http://www.w3.org/1999/xlink"
         xmlns:mml="http://www.w3.org/1998/Math/MathML" version="5.0">
    <title>Pandoc Test Suite</title>
    <section id="headers" role="sect1role">
      <title>title1</title>
      <para role="para1role">Test</para>
      <section role="sect2role"><title>title2</title>
      <para role="para2role">Test2</para>
      </section>
    </section>
</article>

[ Header
    1
    ( "headers" , [] , [ ( "role" , "sect1role" ) ] )
    [ Str "title1" ]
, Div
    ( ""
    , []
    , [ ( "role" , "sect1role" )
      , ( "wrapper" , "1" )
      , ( "role" , "para1role" )
      ]
    )
    [ Para [ Str "Test" ] ]
, Header
    2
    ( ""
    , []
    , [ ( "role" , "sect1role" ) , ( "role" , "sect2role" ) ]
    )
    [ Str "title2" ]
, Div
    ( ""
    , []
    , [ ( "role" , "sect1role" )
      , ( "role" , "sect2role" )
      , ( "wrapper" , "1" )
      , ( "role" , "para2role" )
      ]
    )
    [ Para [ Str "Test2" ] ]
]

I really don't get how the attribute propagates in the child elements, it just doesn't make sense to me. Any hints on your side? Maybe this is something obvious, and if so, I am sorry. I gotta tell you that I'm quite new to this, as those are my very first lines of Haskell 😅

jgm · 2025-03-09T18:46:49Z

Take a look at jgm/commonmark-hs , in particular commonmark-pandoc, which is where the addAttributes used in addPandocAttributes is defined.
You'll see

instance HasAttributes (Cm a B.Blocks) where
  addAttributes attrs b = fmap (addBlockAttrs attrs) <$> b

instance HasAttributes (Cm a B.Inlines) where
  addAttributes attrs il = fmap (addInlineAttrs attrs) <$> il

So addAttributes, applied to a Blocks (which can be a sequence of Block elements), will set the attribute in every Block in the sequence. I think that's what is going on here, does it help?

yanntrividic · 2025-03-10T10:12:22Z

So, regarding this propagation of role attributes to the children, you think it is fine like this? Wouldn't we want to have only the role attribute on the element that holds it?

…lines

jgm · 2025-03-10T15:27:14Z

So, regarding this propagation of role attributes to the children, you think it is fine like this? Wouldn't we want to have only the role attribute on the element that holds it?

No, it isn't okay. It's important to figure out why that is happening and avoid it.

lifeunleaded · 2025-03-10T19:11:59Z

Given that addPandocAttributes applies to Blocks (list of block), I'm not sure it can be applied so generally here (although I don't fully understand how it gets a Blocks in parseBlock.

If it were me, I would go through the list in parsedBlock and try to be a bit more granular. For the ones without attributes, there is divWith (like informalequation already uses), and for some there are other places to apply it. E.g. for sections, one could extend sect n to be:

sect n = sectWith(attrValue "id" e) [] (getRoleAttr e) n

This seems to pass the trivial beginning of test coverage I've tried here.

A more elegant solution would perhaps be polymorphic functions for adding the role attribute, defined separately for inlines and blocks with and without attributes, but that's not something I can whip up an example of here, if indeed it is even possible. That function name could then be used throughout.

lifeunleaded · 2025-03-10T19:28:18Z

Worth noting that it looks like phrase and indexterm in parseInline already has handling for role. I wonder if that would also see strange effects from applying addPandocAttributes to the result.

jgm · 2025-03-10T21:24:57Z

Here's why you're getting the role applied to all the children of section:

pandoc/src/Text/Pandoc/Readers/DocBook.hs

Line 1107 in 17d3ad2

    
           return $ headerWith (elId, classes, maybeToList titleabbrevElAsAttr++attrs) n' headerText <> b

The sect function returns a Blocks that is a sequence of Block elements starting with the Header and including all the children.

Perhaps it should put all of this in a Div, either always or at least in the case where there is a role attribute.

Then the role will just be attached to the Div.

Pandoc sometimes uses this structure for sections:

Div with identifier and classes "section" and "level1" (or "level2" etc.)
- Header
- other contents

So that would be quite reasonable in this case.

lifeunleaded · 2025-03-11T17:49:10Z

If I may be so bold: I think the DocBook writer is based on the role being in the Header attributes, so the roundtrip becomes a bit awkward if the reader creates an enclosing Div instead.

I would suggest moving the getRoleAttr into sect and use addPandocAttributes for the cases that do not use sect.

So sect becomes sect n = sectWith(attrValue "id" e) [] (getRoleAttr e) n and e.g. para uses this pattern in parseBlock:

parseBlock (Elem e) = do
  parsedBlock <- case qName (elName e) of
        "toc"   -> skip -- skip TOC, since in pandoc it's autogenerated                                                                                                                 
        "index" -> skip -- skip index, since page numbers meaningless                                                                                                                   
        "para"  -> addPandocAttributes (getRoleAttr e) <$> parseMixed para (elContent e)

This becomes more work, since it needs to be applied for each case in parseBlock not using sect, but the upside is that the behaviour can be more based on which element it is.

jgm · 2025-03-11T20:45:47Z

The docbook writer can handle this structure just fine:

% pandoc -f native -t docbook
[ Div
    ( "" , [ "section" ] , [] )
    [ Header 1 ( "test" , [] , [] ) [ Str "Test" ]
    , Para [ Str "paragraph" ]
    ]
]
<section xmlns="http://docbook.org/ns/docbook" xmlns:xlink="http://www.w3.org/1999/xlink" xml:id="test">
  <title>Test</title>
  <para>
    paragraph
  </para>
</section>

lifeunleaded · 2025-03-11T20:54:41Z

Oh! Excellent, thank you, I wasn't aware and didn't manage to test before your reply. Then I have nothing to add, enclosing sect in Div with the role sounds like the best way forward.

yanntrividic · 2025-03-12T09:25:43Z

Worth noting that it looks like phrase and indexterm in parseInline already has handling for role. I wonder if that would also see strange effects from applying addPandocAttributes to the result.

With those elements, the role attributes are parsed as classes. For consistency, wouldn't it make more sense just to remove this so that every role is treated as an attribute ?

lifeunleaded · 2025-03-12T11:33:36Z

With those elements, the role attributes are parsed as classes. For consistency, wouldn't it make more sense just to remove this so that every role is treated as an attribute ?

Yes, I agree that would be better. Unless we want to be very careful about not breaking existing filters out there that expect it in classes. If we do, I suppose there is no great problem with leaving them in classes and still adding them to attributes. @jgm , any opinion?

…to the children (see jgm#10665 (comment))

yanntrividic · 2025-03-12T13:47:20Z

With those elements, the role attributes are parsed as classes. For consistency, wouldn't it make more sense just to remove this so that every role is treated as an attribute ?

Yes, I agree that would be better. Unless we want to be very careful about not breaking existing filters out there that expect it in classes. If we do, I suppose there is no great problem with leaving them in classes and still adding them to attributes. @jgm , any opinion?

I have a working solution for this--I think--but I'll wait for the third opinion on the matter before committing anything!

…ent))

… for inline elements

jgm · 2025-03-12T15:46:29Z

Probably safest to keep them as classes and also add them as roles. But I don't know if anyone is actually relying on this behavior. If the duplication is bad, we could mark it as a potentially breaking change in the changelog.

lifeunleaded · 2025-03-12T18:50:29Z

@yanntrividic pulled recent changes to start looking at tests, but it doesn't build right now:

pandoc                           > /home/erik/git/personal/pandoc/src/Text/Pandoc/Readers/DocBook.hs:1342:25: error: [GHC-83865]
pandoc                           >     • Couldn't match expected type ‘Inline’ with actual type ‘Element’
pandoc                           >     • In the expression: e :: Inline
pandoc                           >       In an equation for ‘inlineElement’: inlineElement = e :: Inline
pandoc                           >       In the expression:
pandoc                           >         do let inlineElement = ...
pandoc                           >            if not (isEmphElement inlineElement) then
pandoc                           >                addPandocAttributes (getRoleAttr e) parsedInline
pandoc                           >            else
pandoc                           >                parsedInline
pandoc                           >      |
pandoc                           > 1342 |     let inlineElement = e :: Inline
pandoc                           >      |                         ^

yanntrividic · 2025-03-12T20:19:52Z

@yanntrividic pulled recent changes to start looking at tests, but it doesn't build right now:

Yes sorry! That's what I tried to explain here: #10665 (comment), I think I'm close to a solution (that's why I committed this and made this comment).

lifeunleaded · 2025-03-12T20:51:15Z

@yanntrividic pulled recent changes to start looking at tests, but it doesn't build right now:

Yes sorry! That's what I tried to explain here: #10665 (comment), I think I'm close to a solution (that's why I committed this and made this comment).

No worries. I see the problem and don't have an immediate solution. If you get stuck, maybe it's easier to apply addPandocAttributes (or use spanWith directly) in the case matches in parseInline and just omit it on "emphasis", if that was the idea.

Good luck and thank you for the effort.

lifeunleaded · 2025-03-12T20:53:08Z

Also, if you don't yet, I recommend using stack test before pushing to identify issues quicker than the pipeline. E.g. stack test --test-arguments='-p docbook -j4 --hide-successes' will run fairly quickly (if you use haskell-stack)

jgm · 2025-03-13T02:00:04Z

The problem is with

    let inlineElement = e :: Inline

e here does not have type Inline, it has type Element.
Inline is a pandoc AST concept, Element is a type from the xml parser.

yanntrividic · 2025-03-13T10:51:44Z

No worries. I see the problem and don't have an immediate solution. If you get stuck, maybe it's easier to apply addPandocAttributes (or use spanWith directly) in the case matches in parseInline and just omit it on "emphasis", if that was the idea.

Okay! So I tried just that, and it seems to do what you proposed. I added a class emphasis on the Span so that this information is not just gone, as this is done with other DocBook elements. What do you think? Should we try to make it into an Emph element or is it fine as such?

yanntrividic · 2025-03-13T10:54:35Z

Also, if you don't yet, I recommend using stack test before pushing to identify issues quicker than the pipeline. E.g. stack test --test-arguments='-p docbook -j4 --hide-successes' will run fairly quickly (if you use haskell-stack)

Ah yes, thank you! Up to this point, I was mainly doing a full build and trying out different files... It is a lot more efficient! (I'm using cabal, so it is cabal test --test-options='-p docbook -j4 --hide-successes').

If you two feel it is all right like this, I can continue the work further and adapt the unit tests.

lifeunleaded · 2025-03-13T18:48:26Z

No worries. I see the problem and don't have an immediate solution. If you get stuck, maybe it's easier to apply addPandocAttributes (or use spanWith directly) in the case matches in parseInline and just omit it on "emphasis", if that was the idea.

Okay! So I tried just that, and it seems to do what you proposed. I added a class emphasis on the Span so that this information is not just gone, as this is done with other DocBook elements. What do you think? Should we try to make it into an Emph element or is it fine as such?

I think it's important that the existing mappings that create Strong, Emph, and so on are left intact, yes.

lifeunleaded · 2025-03-13T19:25:04Z

A quick test and a look at the latest change looks like it works as I would expect, so I think we could start writing tests.

@yanntrividic Would you like me to write up a few and push to your fork, or do you feel comfortable starting them yourself?

yanntrividic · 2025-03-14T10:40:27Z

While trying to write a few tests, I realized that what I did in a232161 wasn't really working as expected, and Headers were getting the role attribute alongside the Div that wrapped them. It is now solved in fa815ce.

And now that I am facing those tests for real, I am not sure were to start. Maybe @lifeunleaded you could show me the way and I will continue if needed?

lifeunleaded · 2025-03-15T13:41:50Z

@yanntrividic I started looking at the failing tests, and it looks like emphasis without role just becomes the Span with role, but the inlines are just the strings, not an Emph. I would expect that emphasis without role becomes Emph, but potentially within a Span with the role attribute. Is this something that could be fixed?

lifeunleaded · 2025-03-15T18:17:39Z

src/Text/Pandoc/Readers/DocBook.hs

@@ -1329,7 +1325,8 @@ parseInline (Elem e) = do
                             "strong"        -> innerInlines strong
                             "strikethrough" -> innerInlines strikeout
                             "underline"     -> innerInlines underline
-                             _               -> innerInlines emph
+                             _               -> innerInlines $


This seems to create a span with the role attribute but then just the string content. I think it needs to become an Emph to not break existing writers too much. If it's difficult to achieve, I think reverting the _ case to innerInlines emph (and not getting the Span for this particular case) is better than not creating an Emph

Yes, this is what I thought you meant in your first comment. As I said, I am not sure how to achieve what we want there... So if nobody has an idea to pull that without too much effort, I think it is fine to admit that if you want a specific element on an Emph, it could be put on a phrase element and do the work. What do you think @jgm?

I don't understand the context; you'd have to explain the issue to me more fully.

@jgm As far as I can discern from testing, <emphasis>emphasis</emphasis> (without a role attribute or with a role attribute that is not bf, strong, bold, strikethrough, or underline) becomes Span ( "" , [ "emphasis" ] , [] ) [ Str "emphasis" ], I would prefer Span ( "" , [ "emphasis" ] , [] ) [ Emph [ Str "emphasis" ] ] or, failing that, revert back to the current behavior Emph [ Str "emphasis" ]. I fear not creating the Emph will be breaking for a lot of conversions.

@yanntrividic My apologies if I've made an unclear comment. I may have misunderstood the intentions of a particular change at some point. The overall change is a really useful one and I think we're close to a good implementation.

No problem! Though yes, I tried again to work around those types for a bit, but I just can't figure out a nice way to do wrap the Emph around a Span. I'm fine with reverting the changes if nobody has a better proposition :)

I'm a bit lost here: surely, <emphasis>emphasis</emphasis> should be converted as simply Emph [ Str "emphasis" ]. I'm not sure why one would even consider the Span conversion? I may be missing more of the context, though?

Ok, so let's start again from the beginning.

By default, addPandocAttributes is now applied to all the Inline elements to add a role attribute, if present. But for a few DocBook elements, and especially here for emphasis, the role attribute was already taken into account.

For the emphasis element, the role attribute is to discriminate Strong, Strikeout, Underline and Emph. So if we apply the addPandocAttributes function after doing that, we can get outputs such as:

Span ( "" , [] , [ ( "wrapper" , "1" ) , ( "role" , "strong" ) ] ) [ Strong [ Str "word" ] ]

Which is unnecessary.

On the other end, we would like to be able to get outputs such as:

Span ( "" , [] , [ ( "wrapper" , "1" ) , ( "role" , "special" ) ] ) [ Emph [ Str "word" ] ]

Or even:

Span ( "" , [] , [ ( "wrapper" , "1" ) , ( "role" , "special" ) ] ) [ Strong [ Str "word" ] ]

But not (I think?):

Span ( "" , [] , [ ( "wrapper" , "1" ) , ( "role" , "special" ), ( "role" , "strong" ) ] ) [ Strong [ Str "word" ] ]

But if we want to get this output, we have to modify this bit of code.

But I don't know how to modify those lines (or modify others?) in a good way to achieve this, because the emph function's types work well with the innerInlines function, but not with the addPandocAttributes function.

Is it a bit clearer now @jgm?

yanntrividic added 4 commits February 19, 2025 23:18

Adding support for "role" attributes for the DocBook reader

f361555

Started working on the support for the "role" attributes, as mentioned in jgm#9089. It is still missing wrapping Para element by Div element with attributes when necessary. It also needs testing.

Making lines shorter than 80 characters.

193fa21

Merge branch 'jgm:main' into main

2ab3c3f

yanntrividic added 4 commits March 6, 2025 15:09

Modifying the approach following the advice in jgm#10665 (comment)

2af15a6

Merge branch 'jgm:main' into main

578c9c1

When inlines are of type "emphasis", don't add "role" attributes

f0827ee

Merge branch 'main' of https://github.com/yanntrividic/pandoc

57d61da