Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for "role" attributes for the DocBook reader #10665

Open
wants to merge 16 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
Show all changes
16 commits
Select commit Hold shift + click to select a range
f361555
Adding support for "role" attributes for the DocBook reader
yanntrividic Feb 19, 2025
193fa21
Making lines shorter than 80 characters.
yanntrividic Feb 26, 2025
48f14ee
Wrapping elements with role attribute in a Div when needed
yanntrividic Mar 4, 2025
2ab3c3f
Merge branch 'jgm:main' into main
yanntrividic Mar 4, 2025
2af15a6
Modifying the approach following the advice in https://github.com/jgm…
yanntrividic Mar 6, 2025
578c9c1
Merge branch 'jgm:main' into main
yanntrividic Mar 6, 2025
f0827ee
When inlines are of type "emphasis", don't add "role" attributes
yanntrividic Mar 6, 2025
57d61da
Merge branch 'main' of https://github.com/yanntrividic/pandoc
yanntrividic Mar 6, 2025
007e0c8
Removing some code to avoid double execution of addPandocAttributes o…
yanntrividic Mar 9, 2025
f4109d6
Restoring the code from 2af15a6c425a16b5b59d117f43a6aab71b99f22e rega…
yanntrividic Mar 9, 2025
17d3ad2
Putting back again the discrimination between `emphasis` and other In…
yanntrividic Mar 10, 2025
a232161
Wrapping section in Div so that role attributes don't get propagated …
yanntrividic Mar 12, 2025
3703757
Attempt at solving parsing for emphasis elements (see https://github.…
yanntrividic Mar 12, 2025
06214b6
Got things mixed up, function addPandocAttributes added to the return…
yanntrividic Mar 12, 2025
e104e39
Removing attempt from 370375728f298a6f00340bb36eb5405214be35bd for a …
yanntrividic Mar 13, 2025
fa815ce
Headers were getting the role attributes, now only sections do!
yanntrividic Mar 14, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 21 additions & 8 deletions src/Text/Pandoc/Readers/DocBook.hs
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ import Text.Pandoc.Builder
import Text.Pandoc.Class.PandocMonad (PandocMonad, report)
import Text.Pandoc.Options
import Text.Pandoc.Logging (LogMessage(..))
import Text.Pandoc.Shared (safeRead, extractSpaces)
import Text.Pandoc.Shared (safeRead, extractSpaces, addPandocAttributes)
import Text.Pandoc.Sources (ToSources(..), sourcesToText)
import Text.Pandoc.Transforms (headerShift)
import Text.TeXMath (readMathML, writeTeX)
Expand Down Expand Up @@ -851,15 +851,19 @@ getBlocks :: PandocMonad m => Element -> DB m Blocks
getBlocks e = mconcat <$>
mapM parseBlock (elContent e)

getRoleAttr :: Element -> [(Text, Text)] -- extract role attribute and add it to the attribute list
getRoleAttr e = case attrValue "role" e of
"" -> []
r -> [("role", r)]

parseBlock :: PandocMonad m => Content -> DB m Blocks
parseBlock (Text (CData CDataRaw _ _)) = return mempty -- DOCTYPE
parseBlock (Text (CData _ s _)) = if T.all isSpace s
then return mempty
else return $ plain $ trimInlines $ text s
parseBlock (CRef x) = return $ plain $ str $ T.toUpper x
parseBlock (Elem e) =
case qName (elName e) of
parseBlock (Elem e) = do
parsedBlock <- case qName (elName e) of
"toc" -> skip -- skip TOC, since in pandoc it's autogenerated
"index" -> skip -- skip index, since page numbers meaningless
"para" -> parseMixed para (elContent e)
Expand Down Expand Up @@ -973,6 +977,7 @@ parseBlock (Elem e) =
"title" -> return mempty -- handled in parent element
"subtitle" -> return mempty -- handled in parent element
_ -> skip >> getBlocks e
return $ addPandocAttributes (getRoleAttr e) parsedBlock
where skip = do
let qn = qName $ elName e
let name = if "pi-" `T.isPrefixOf` qn
Expand Down Expand Up @@ -1099,7 +1104,12 @@ parseBlock (Elem e) =
modify $ \st -> st{ dbSectionLevel = n }
b <- getBlocks e
modify $ \st -> st{ dbSectionLevel = n - 1 }
return $ headerWith (elId, classes, maybeToList titleabbrevElAsAttr++attrs) n' headerText <> b
let content = headerWith (elId, classes, maybeToList titleabbrevElAsAttr)
n' headerText <> b
return $ case attrValue "role" e of
"" -> content
_ -> divWith ("", ["section"],
("level", T.pack $ show n') : attrs) content
titleabbrevElAsAttr =
case filterChild (named "titleabbrev") e `mplus`
(filterChild (named "info") e >>=
Expand All @@ -1124,7 +1134,6 @@ parseBlock (Elem e) =
Nothing -> return b
Just t -> return $ divWith (attrValue "id" e,[],[])
(divWith ("", ["title"], []) (plain t) <> b)

-- Admonitions are parsed into a div. Following other Docbook tools that output HTML,
-- we parse the optional title as a div with the @title@ class, and give the
-- block itself a class corresponding to the admonition name.
Expand Down Expand Up @@ -1206,8 +1215,8 @@ parseInline :: PandocMonad m => Content -> DB m Inlines
parseInline (Text (CData _ s _)) = return $ text s
parseInline (CRef ref) =
return $ text $ fromMaybe (T.toUpper ref) $ lookupEntity ref
parseInline (Elem e) =
case qName (elName e) of
parseInline (Elem e) = do
parsedInline <- case qName (elName e) of
"anchor" -> do
return $ spanWith (attrValue "id" e, [], []) mempty
"phrase" -> do
Expand Down Expand Up @@ -1320,7 +1329,8 @@ parseInline (Elem e) =
"strong" -> innerInlines strong
"strikethrough" -> innerInlines strikeout
"underline" -> innerInlines underline
_ -> innerInlines emph
_ -> innerInlines $
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to create a span with the role attribute but then just the string content. I think it needs to become an Emph to not break existing writers too much. If it's difficult to achieve, I think reverting the _ case to innerInlines emph (and not getting the Span for this particular case) is better than not creating an Emph

Copy link
Author

@yanntrividic yanntrividic Mar 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is what I thought you meant in your first comment. As I said, I am not sure how to achieve what we want there... So if nobody has an idea to pull that without too much effort, I think it is fine to admit that if you want a specific element on an Emph, it could be put on a phrase element and do the work. What do you think @jgm?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the context; you'd have to explain the issue to me more fully.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jgm As far as I can discern from testing, <emphasis>emphasis</emphasis> (without a role attribute or with a role attribute that is not bf, strong, bold, strikethrough, or underline) becomes Span ( "" , [ "emphasis" ] , [] ) [ Str "emphasis" ], I would prefer Span ( "" , [ "emphasis" ] , [] ) [ Emph [ Str "emphasis" ] ] or, failing that, revert back to the current behavior Emph [ Str "emphasis" ]. I fear not creating the Emph will be breaking for a lot of conversions.

@yanntrividic My apologies if I've made an unclear comment. I may have misunderstood the intentions of a particular change at some point. The overall change is a really useful one and I think we're close to a good implementation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem! Though yes, I tried again to work around those types for a bit, but I just can't figure out a nice way to do wrap the Emph around a Span. I'm fine with reverting the changes if nobody has a better proposition :)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit lost here: surely, <emphasis>emphasis</emphasis> should be converted as simply Emph [ Str "emphasis" ]. I'm not sure why one would even consider the Span conversion? I may be missing more of the context, though?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so let's start again from the beginning.

By default, addPandocAttributes is now applied to all the Inline elements to add a role attribute, if present. But for a few DocBook elements, and especially here for emphasis, the role attribute was already taken into account.

For the emphasis element, the role attribute is to discriminate Strong, Strikeout, Underline and Emph. So if we apply the addPandocAttributes function after doing that, we can get outputs such as:

Span
    ( ""
        , []
        , [ ( "wrapper" , "1" ) , ( "role" , "strong" ) ]
    )
    [ Strong [ Str "word" ] ]

Which is unnecessary.

On the other end, we would like to be able to get outputs such as:

Span
    ( ""
        , []
        , [ ( "wrapper" , "1" ) , ( "role" , "special" ) ]
    )
    [ Emph [ Str "word" ] ]

Or even:

Span
    ( ""
        , []
        , [ ( "wrapper" , "1" ) , ( "role" , "special" ) ]
    )
    [ Strong [ Str "word" ] ]

But not (I think?):

Span
    ( ""
        , []
        , [ ( "wrapper" , "1" ) , ( "role" , "special" ), ( "role" , "strong" ) ]
    )
    [ Strong [ Str "word" ] ]

But if we want to get this output, we have to modify this bit of code.

But I don't know how to modify those lines (or modify others?) in a good way to achieve this, because the emph function's types work well with the innerInlines function, but not with the addPandocAttributes function.

Is it a bit clearer now @jgm?

spanWith ("", ["emphasis"], getRoleAttr e)
"footnote" -> note . mconcat <$>
mapM parseBlock (elContent e)
"title" -> return mempty
Expand All @@ -1329,6 +1339,9 @@ parseInline (Elem e) =
-- <?asciidor-br?> to in handleInstructions, above.
"pi-asciidoc-br" -> return linebreak
_ -> skip >> innerInlines id
return $ case qName (elName e) of
"emphasis" -> parsedInline
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this special case for "emphasis"?

Copy link
Author

@yanntrividic yanntrividic Mar 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried something, which failed, now I need to figure out a way to handle this.

Currently, Pandoc supports role attributes for some Inlines. It helps specifying Strong, Strikeout, Underline and Emph elements there. All those elements need a wrapper to add attributes to them. But if we try to apply this updated code to:

<emphasis role="strong">word</emphasis>

We get:

Span
    ( ""
        , []
        , [ ( "wrapper" , "1" ) , ( "role" , "strong" ) ]
    )
    [ Strong [ Str "word" ] ]

Which is not what we would want... So my previous attempt in f0827ee was to circumvent this issue, but then no emphasis DocBook element would have attributes, which is not what we want either.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right. Well, you're just preventing a role attribute from going on something parsed from an emphasis tag, and that's fine given that we already handle the roles in another way.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then, that would mean that we can't have any other role value than bf, bold, strong, strikethrough, or underline, on emphasis elements ? Maybe in that case the role attributes should be assigned to phrase elements... That's a compromise I'm willing to make yes, if you feel it makes sense.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If other roles are used, then in the part of the code that checks for role on emphasis and gives you Strong, Underline or whatever in response, you could also check for other roles and simply add them as attributes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand from the documentation, I think that DocBook documents expect only one role attribute per element. To specify more precise roles, it is recommended to parameterize a pattern to do so, but you still only have one role attribute.

So, knowing this, I think that it shouldn't be possible to have a role on a Strong element, because that would imply that there are necessarily two role attributes on the DocBook element, which shouldn't be possible.

And in that case, we would only need to change the last line from this part:

"emphasis" -> case attrValue "role" e of
                             "bf"            -> innerInlines strong
                             "bold"          -> innerInlines strong
                             "strong"        -> innerInlines strong
                             "strikethrough" -> innerInlines strikeout
                             "underline"     -> innerInlines underline
                             _               -> innerInlines emph

Does it make sense like this? What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yanntrividic Not strictly true, docbook would allow multiple tokens in a role attribute, so e.g. <emphasis role="bold special"> is entirely valid. However, I think that is an extreme edge case and I've never seen it used. I think it would be perfectly fine to only process the known role tokens for emphasis as you describe.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, you are right... Well I guess it would not be so much work to support this feature, but in that case we would support only one pattern, that is to say, space-separated values for role attributes. I think at this point, I would prefer leaving this for a filter to handle.

Copy link
Author

@yanntrividic yanntrividic Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On another note, I've been trying to work on something to discriminate the last case from the others... Not sure how to do it properly though... Here is my attempt: 3703757.

I know there is a parsing error in the code, but as I said, I'm quite new to Haskell, and I am really not sure how I could make this kind of type assertion.

Basically, the thought process here is that if the element is an emphasis element, and that has been parsed as an Emph element, then we can add a role attribute. But I'm really not sure what would be the right way to check the second condition. Any ideas?

Edit: Oops, it is better with this line corrected: 06214b6

_ -> addPandocAttributes (getRoleAttr e) parsedInline
where skip = do
let qn = qName $ elName e
let name = if "pi-" `T.isPrefixOf` qn
Expand Down
Loading