-
-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding support for "role" attributes for the DocBook reader #10665
base: main
Are you sure you want to change the base?
Conversation
Started working on the support for the "role" attributes, as mentioned in jgm#9089. It is still missing wrapping Para element by Div element with attributes when necessary. It also needs testing.
Added a wrapper function parseMixedWithRole around the parseMixed function, that checks if a role attribute is on the parsed element. If so, the element is wrapped in a Div with this attribute, and if not, only parseMixed is applied.
Thanks for the PR! I wonder if we couldn't simplify things considerably by using the function |
Excellent addition, thank you! 61ff730 Should point you to DocBook reader tests, but I could take a stab at adding some if you don’t feel comfortable doing it. |
Hello, thank you both for your help! @jgm, thank you for pointing me in this direction. Indeed it seems to be exactly what we need. I updated the code to work in this direction. Is it what you expected? @lifeunleaded, I don't know how I missed those files... I'm not sure how we should tackle this though. Should we add in each "zone" of the code a test with a new node that has a |
@yanntrividic I can take a look within a few days, as soon as I know what the changes do. I'll let @jgm determine whether that should block the PR. Otherwise I can submit a separate PR for tests. |
@yanntrividic I think it would be a good idea to just add a role attribute to some intended supported elements in |
Building the branch and running
Gives me this:
It seems to put role twice in the actual element, and then on all the child elements. I assume that's not intended? |
@@ -1329,6 +1334,9 @@ parseInline (Elem e) = | |||
-- <?asciidor-br?> to in handleInstructions, above. | |||
"pi-asciidoc-br" -> return linebreak | |||
_ -> skip >> innerInlines id | |||
return $ case qName (elName e) of | |||
"emphasis" -> parsedInline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this special case for "emphasis"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried something, which failed, now I need to figure out a way to handle this.
Currently, Pandoc supports role
attributes for some Inlines
. It helps specifying Strong
, Strikeout
, Underline
and Emph
elements there. All those elements need a wrapper to add attributes to them. But if we try to apply this updated code to:
<emphasis role="strong">word</emphasis>
We get:
Span
( ""
, []
, [ ( "wrapper" , "1" ) , ( "role" , "strong" ) ]
)
[ Strong [ Str "word" ] ]
Which is not what we would want... So my previous attempt in f0827ee was to circumvent this issue, but then no emphasis
DocBook element would have attributes, which is not what we want either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, right. Well, you're just preventing a role attribute from going on something parsed from an emphasis
tag, and that's fine given that we already handle the roles in another way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then, that would mean that we can't have any other role
value than bf
, bold
, strong
, strikethrough
, or underline
, on emphasis
elements ? Maybe in that case the role attributes should be assigned to phrase
elements... That's a compromise I'm willing to make yes, if you feel it makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If other roles are used, then in the part of the code that checks for role on emphasis and gives you Strong, Underline or whatever in response, you could also check for other roles and simply add them as attributes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I understand from the documentation, I think that DocBook documents expect only one role
attribute per element. To specify more precise roles, it is recommended to parameterize a pattern to do so, but you still only have one role
attribute.
So, knowing this, I think that it shouldn't be possible to have a role
on a Strong
element, because that would imply that there are necessarily two role
attributes on the DocBook element, which shouldn't be possible.
And in that case, we would only need to change the last line from this part:
"emphasis" -> case attrValue "role" e of
"bf" -> innerInlines strong
"bold" -> innerInlines strong
"strong" -> innerInlines strong
"strikethrough" -> innerInlines strikeout
"underline" -> innerInlines underline
_ -> innerInlines emph
Does it make sense like this? What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yanntrividic Not strictly true, docbook would allow multiple tokens in a role attribute, so e.g. <emphasis role="bold special">
is entirely valid. However, I think that is an extreme edge case and I've never seen it used. I think it would be perfectly fine to only process the known role tokens for emphasis as you describe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, you are right... Well I guess it would not be so much work to support this feature, but in that case we would support only one pattern, that is to say, space-separated values for role
attributes. I think at this point, I would prefer leaving this for a filter to handle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On another note, I've been trying to work on something to discriminate the last case from the others... Not sure how to do it properly though... Here is my attempt: 3703757.
I know there is a parsing error in the code, but as I said, I'm quite new to Haskell, and I am really not sure how I could make this kind of type assertion.
Basically, the thought process here is that if the element is an emphasis
element, and that has been parsed as an Emph
element, then we can add a role
attribute. But I'm really not sure what would be the right way to check the second condition. Any ideas?
Edit: Oops, it is better with this line corrected: 06214b6
Hey @lifeunleaded, thanks for taking the time to dig into some tests. You're right with what you mention in #10665 (comment), it is not intended... And I'm not sure how this recursion happens. With the updated code, here is what we get: <?xml version="1.0" encoding="utf-8" ?>
<article xmlns="http://docbook.org/ns/docbook"
xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns:mml="http://www.w3.org/1998/Math/MathML" version="5.0">
<title>Pandoc Test Suite</title>
<section id="headers" role="sect1role">
<title>title1</title>
<para role="para1role">Test</para>
<section role="sect2role"><title>title2</title>
<para role="para2role">Test2</para>
</section>
</section>
</article>
I really don't get how the attribute propagates in the child elements, it just doesn't make sense to me. Any hints on your side? Maybe this is something obvious, and if so, I am sorry. I gotta tell you that I'm quite new to this, as those are my very first lines of Haskell 😅 |
Take a look at jgm/commonmark-hs , in particular commonmark-pandoc, which is where the instance HasAttributes (Cm a B.Blocks) where
addAttributes attrs b = fmap (addBlockAttrs attrs) <$> b
instance HasAttributes (Cm a B.Inlines) where
addAttributes attrs il = fmap (addInlineAttrs attrs) <$> il So |
So, regarding this propagation of |
No, it isn't okay. It's important to figure out why that is happening and avoid it. |
Given that If it were me, I would go through the list in
This seems to pass the trivial beginning of test coverage I've tried here. A more elegant solution would perhaps be polymorphic functions for adding the role attribute, defined separately for inlines and blocks with and without attributes, but that's not something I can whip up an example of here, if indeed it is even possible. That function name could then be used throughout. |
Worth noting that it looks like |
Here's why you're getting the role applied to all the children of section: pandoc/src/Text/Pandoc/Readers/DocBook.hs Line 1107 in 17d3ad2
The Perhaps it should put all of this in a Div, either always or at least in the case where there is a role attribute. Then the role will just be attached to the Div. Pandoc sometimes uses this structure for sections:
So that would be quite reasonable in this case. |
If I may be so bold: I think the DocBook writer is based on the role being in the Header attributes, so the roundtrip becomes a bit awkward if the reader creates an enclosing Div instead. I would suggest moving the So
This becomes more work, since it needs to be applied for each case in |
The docbook writer can handle this structure just fine:
|
Oh! Excellent, thank you, I wasn't aware and didn't manage to test before your reply. Then I have nothing to add, enclosing sect in Div with the role sounds like the best way forward. |
With those elements, the |
Yes, I agree that would be better. Unless we want to be very careful about not breaking existing filters out there that expect it in |
…to the children (see jgm#10665 (comment))
I have a working solution for this--I think--but I'll wait for the third opinion on the matter before committing anything! |
Probably safest to keep them as classes and also add them as roles. But I don't know if anyone is actually relying on this behavior. If the duplication is bad, we could mark it as a potentially breaking change in the changelog. |
@yanntrividic pulled recent changes to start looking at tests, but it doesn't build right now:
|
Yes sorry! That's what I tried to explain here: #10665 (comment), I think I'm close to a solution (that's why I committed this and made this comment). |
No worries. I see the problem and don't have an immediate solution. If you get stuck, maybe it's easier to apply Good luck and thank you for the effort. |
Also, if you don't yet, I recommend using |
The problem is with let inlineElement = e :: Inline
|
Okay! So I tried just that, and it seems to do what you proposed. I added a class |
Ah yes, thank you! Up to this point, I was mainly doing a full build and trying out different files... It is a lot more efficient! (I'm using cabal, so it is If you two feel it is all right like this, I can continue the work further and adapt the unit tests. |
I think it's important that the existing mappings that create Strong, Emph, and so on are left intact, yes. |
A quick test and a look at the latest change looks like it works as I would expect, so I think we could start writing tests. @yanntrividic Would you like me to write up a few and push to your fork, or do you feel comfortable starting them yourself? |
While trying to write a few tests, I realized that what I did in a232161 wasn't really working as expected, and And now that I am facing those tests for real, I am not sure were to start. Maybe @lifeunleaded you could show me the way and I will continue if needed? |
@yanntrividic I started looking at the failing tests, and it looks like |
@@ -1329,7 +1325,8 @@ parseInline (Elem e) = do | |||
"strong" -> innerInlines strong | |||
"strikethrough" -> innerInlines strikeout | |||
"underline" -> innerInlines underline | |||
_ -> innerInlines emph | |||
_ -> innerInlines $ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to create a span with the role attribute but then just the string content. I think it needs to become an Emph to not break existing writers too much. If it's difficult to achieve, I think reverting the _ case to innerInlines emph
(and not getting the Span for this particular case) is better than not creating an Emph
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is what I thought you meant in your first comment. As I said, I am not sure how to achieve what we want there... So if nobody has an idea to pull that without too much effort, I think it is fine to admit that if you want a specific element on an Emph
, it could be put on a phrase
element and do the work. What do you think @jgm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the context; you'd have to explain the issue to me more fully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jgm As far as I can discern from testing, <emphasis>emphasis</emphasis>
(without a role
attribute or with a role attribute that is not bf
, strong
, bold
, strikethrough
, or underline
) becomes Span ( "" , [ "emphasis" ] , [] ) [ Str "emphasis" ]
, I would prefer Span ( "" , [ "emphasis" ] , [] ) [ Emph [ Str "emphasis" ] ]
or, failing that, revert back to the current behavior Emph [ Str "emphasis" ]
. I fear not creating the Emph
will be breaking for a lot of conversions.
@yanntrividic My apologies if I've made an unclear comment. I may have misunderstood the intentions of a particular change at some point. The overall change is a really useful one and I think we're close to a good implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem! Though yes, I tried again to work around those types for a bit, but I just can't figure out a nice way to do wrap the Emph
around a Span
. I'm fine with reverting the changes if nobody has a better proposition :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit lost here: surely, <emphasis>emphasis</emphasis>
should be converted as simply Emph [ Str "emphasis" ]
. I'm not sure why one would even consider the Span conversion? I may be missing more of the context, though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so let's start again from the beginning.
By default, addPandocAttributes
is now applied to all the Inline
elements to add a role
attribute, if present. But for a few DocBook elements, and especially here for emphasis
, the role
attribute was already taken into account.
For the emphasis
element, the role
attribute is to discriminate Strong
, Strikeout
, Underline
and Emph
. So if we apply the addPandocAttributes
function after doing that, we can get outputs such as:
Span
( ""
, []
, [ ( "wrapper" , "1" ) , ( "role" , "strong" ) ]
)
[ Strong [ Str "word" ] ]
Which is unnecessary.
On the other end, we would like to be able to get outputs such as:
Span
( ""
, []
, [ ( "wrapper" , "1" ) , ( "role" , "special" ) ]
)
[ Emph [ Str "word" ] ]
Or even:
Span
( ""
, []
, [ ( "wrapper" , "1" ) , ( "role" , "special" ) ]
)
[ Strong [ Str "word" ] ]
But not (I think?):
Span
( ""
, []
, [ ( "wrapper" , "1" ) , ( "role" , "special" ), ( "role" , "strong" ) ]
)
[ Strong [ Str "word" ] ]
But if we want to get this output, we have to modify this bit of code.
But I don't know how to modify those lines (or modify others?) in a good way to achieve this, because the emph
function's types work well with the innerInlines
function, but not with the addPandocAttributes
function.
Is it a bit clearer now @jgm?
As the title says, this PR adds support for the
role
attributes to be parsed by the DocBook reader. This enhancement has been outlined in #9089. The PR is implementing a solution that was discussed in the issue by @jgm and @lifeunleaded in particular.I haven't written any unit test for this, as I couldn't find a unit test file for the DocBook reader, and I wasn't ready to start one... I hope it won't be blocking for this PR to be merged!
This change didn't feel like it required a modification to the
MANUAL.txt
file.Don't hesitate to tell me if I should adapt something. From my understanding, it is ready to be integrated to the main branch :)