Building Media Overlay support on top of Readium #377

smoores-dev · 2024-01-12T19:05:31Z

smoores-dev
Jan 12, 2024

Hi folks!

This library is outstanding, and I've been using it to re-write the synced narration EPUB reading system in the Storyteller mobile apps.

I know that synced narration support is planned/being worked on already, but in the meantime, Storyteller needs something, so I've been working on hacking it together on top of what already exists in Readium!

After much tinkering, I have a working system that can sync between the audio position and EPUB reading position on this branch: https://gitlab.com/smoores/storyteller-mobile/-/tree/readium?ref_type=heads. The Swift code that interacts with Readium is in modules/readium/ios. The plan is to add back in sentence-level narration highlighting as well, with Decorations.

In order to get there, I had to customize a few different pieces of the EPUBParser (which makes sense, since the current version doesn't have anything Media Overlay-specific in there!). What I'm a little less happy about in the current approach is the hacks I had to add in order to find a location/locator based on a Media Overlay "fragment". Essentially, the Media Overlay SMIL files describe a set of entries mapping "clips" (an audio resource + a start and end time) to and from "fragments" (a text resource + a URL fragment pointing to the ID of a specific element in the text), but I had a lot of trouble working out a good way to actually utilize these fragments with the current swift-toolkit, and I was hoping that someone here might have some ideas/things that I missed.

What I would like to be able to do:

Given a Locator, identify the corresponding Clip.
At the moment, since Media Overlays only know about clips and fragments, this looks like:
- Injecting some Javascript when the Navigator's location is updated that detects the first visible span with an id starting with the string sentence, which is a hack that only works for Storyteller-generated books, because I know that those elements are the ones referenced in the Media Overlays
- Adding the detected span's ID to the locator as its location's cssSelector. I think I would rather use partialCfis here, but in the moment I didn't want to build a whole CFI system into the HTMLResourceContentIterator.
- Use that locator's cssSelector as the fragment, search through the corresponding Media Overlay for the correct Clip
Given an audio file and time position, identify the corresponding locator.
This currently looks like:
- Search through the Media Overlay to find the fragment corresponding to this clip
- Using publication.content(), starting at the link for the resource identified in the Media Overlay fragment, search through the content until I find a Text Segment with an id starting with sentence whose sentence count is higher than the one in the fragment. Again, this is relying on Storyteller internals, and not the Media Overlay spec, and in order to make this work I had to also customize the HTMLResourceContentIterator to break up Text Segments more frequently so that there would always be at least one Segment per sentence span (the default only flushes Segments when the language changes).

I think that I could clean this up a bit if, as mentioned, I generated CFIs from the Navigator and HTML iterator, rather than relying purely on cssSelectors/element IDs. The CFI spec supports adding IDs to the CFI paths, so it would be possible to go from CFI -> fragment, though going from fragment -> CFI would require iterating the HTML as well, which currently isn't the case.

Anyway, I felt it was possible that I was missing something obvious here that would make this easier, so I figured I would ask, in case that was true! If not, this does literally work for now, and I think the CFI approach (though hairy) would allow me to generalize this to all valid Media Overlay EPUBs in the future.

mickael-menu · 2024-01-24T10:40:33Z

mickael-menu
Jan 24, 2024
Maintainer

Hi Shane,

Apologies for the delay, I overlooked this notification.

We've been working on a JSON and in-memory model to represent a format-agnostic guided navigation compatible with Media Overlays. Although it's currently paused (because lack of time), the goal is to eventually implement Media Overlays with it in the Readium toolkits. You can take a look at the draft and discussions here: readium/architecture#181. I think it would be useful to bring up your hurdles there and see if we can come up with a solution in the model.

So far, we have avoided using CFIs in the mobile toolkits. If you have a compelling case that cannot be addressed by any native Web technologies, such as CSS selectors, it is worth reconsidering. Maybe opening a dedicated issue on https://github.com/readium/architecture/discussions?

I'm not super familiar with Media Overlays myself, so it's a bit unclear to me why CSS selectors don't work in your case, or why you need to check explicitly for sentence IDs. But here are a few rough ideas that could help to work with any Media Overlays:

Keep a set of HTML IDs known to be mapped in the SMIL file. When you look for a match in the HTML, you can search for any element with a known ID in the set.
Pre-process the HTML files with a TransformingFetcher to add a custom class or data- attribute to tag the elements that match Media Overlays clips.

Given a Locator, identify the corresponding Clip.

What's your use case here? Where is the locator coming from?

Given an audio file and time position, identify the corresponding locator.

Using the Media Overlays mapping, I guess you can figure out a locator that looks like: Locator(href: ..., cssSelector: "#id"). Why do you need to search for it with publication.content()? What do you use the locator for?

I had to also customize the HTMLResourceContentIterator to break up Text Segments more frequently so that there would always be at least one Segment per sentence span (the default only flushes Segments when the language changes).

You might want to use a ContentTokenizer for that, applied on the output of the HTMLResourceContentIterator. This is what is used in the PublicationSpeechSynthesizer to break the content into sentences:

swift-toolkit/Sources/Navigator/TTS/PublicationSpeechSynthesizer.swift

Lines 145 to 151 in ba378e8

    
           makeTextContentTokenizer( 
        
               defaultLanguage: defaultLanguage, 
        
               contextSnippetLength: 50, 
        
               textTokenizerFactory: { language in 
        
                   makeDefaultTextTokenizer(unit: .sentence, language: language) 
        
               } 
        
           )

Side note: if your CSS selectors always contain only a single HTML ID, you can also use Locator(href: "href", fragments: ["id"]). I'm not sure it matters much, but it can be converted into a valid URL as Link(href: "href#id").

4 replies

smoores-dev Jan 29, 2024
Author

Thank you so much for your response @mickael-menu! I really appreciate it; there were definitely some things I was misunderstanding in how I was approaching this. Most of the parts of what I had been doing that seemed confusing to you amounted to unnecessary work to retrieve locators from the pre-computed position locators, rather than constructing new locators as needed (because I didn't realize I could do that!)

Just addressing these questions first since they're sort of the most fundamental:

Given a Locator, identify the corresponding Clip.

What's your use case here? Where is the locator coming from?

I have a Locator from Readium, specifically from the navigator(_ navigator: Navigator, locationDidChange locator: Locator) delegate method. So every time the user turns the page, I update the Locator in state, and then I need to be able to determine what the correct starting point in the correct audio resource is, given that new Locator. This doesn't seem trivial to do without the cssSelector (or fragment, fragment makes more sense for sure!) location provided in the Locator, but the mobile toolkits don't seem to provide either of these by default.

Given an audio file and time position, identify the corresponding locator.

Using the Media Overlays mapping, I guess you can figure out a locator that looks like: Locator(href: ..., cssSelector: "#id"). Why do you need to search for it with publication.content()? What do you use the locator for?

Essentially inverting the above use case, if a user is playing the audiobook, when the position in the audio resource changes, I need to be able to determine the Locator that corresponds to their new position in the text. I actually didn't realize that I'd just be able to construct an ad-hoc Locator here; I had (apparently incorrectly) been assuming that it would be necessary to also provide the position, though I do see that those are listed as optional in the spec.

...

Ok, I took a break to actually try this out, and it totally works (with one caveat), and is much, much nicer. The process is now:

In the navigator delegate, when the location changes, inject some javascript that produces a fragments array pointing to the first visible text node's parent element.
In getClip, simply grab the fragment from the locator and find the clip that matches
In getFragment, simply create a Locator like Locator(href: ..., type: "application/xhtml+xml", locations: Locator.locations(fragments: [foundFragment]))

The one caveat is that in that last step. Simply constructing a new Locator works, in that the Navigator renders the correct content when I call navigator.go(newLocator), but that Locator doesn't have any progression/totalProgression, and I'm not sure how to find them! I rely on totalProgression for a progress bar on the home page of the app, so if your last interaction with a given book was just listening to the audiobook, the progress bar on the home page will be empty until you open the text again, causing the navigator to compute the correct progression.

I'm not sure how to best get that totalProgression back. It'd be nice to avoid parsing the HTML content when the text isn't open if possible; my current best thought would be to do a dumb string match (indexOf("id=\"#\(foundFragment)\"")) against the raw HTML, treat that as the progression in the chapter (close enough?), and use that to calculate the totalProgression in the book. Again, I'm assuming I'm missing a nicer way to do this haha. But if we can figure out this last thing, I think this would be 100% generic/Media Overlay compliant, and I think it would be compatible with the Guided Navigation proposal!

mickael-menu Jan 29, 2024
Maintainer

I have a Locator from Readium, specifically from the navigator(_ navigator: Navigator, locationDidChange locator: Locator) delegate method. So every time the user turns the page, I update the Locator in state, and then I need to be able to determine what the correct starting point in the correct audio resource is, given that new Locator. This doesn't seem trivial to do without the cssSelector (or fragment, fragment makes more sense for sure!) location provided in the Locator, but the mobile toolkits don't seem to provide either of these by default.

You can use EPUBNavigatorViewController.firstVisibleElementLocator() to retrieve a Locator for the current position containing the cssSelector as well as some text context. It is not done by default with locationDidChange as it is too CPU-intensive to do regularly. I would call this only when you really need it, for example when the user requests to start the audio.

I actually didn't realize that I'd just be able to construct an ad-hoc Locator here; I had (apparently incorrectly) been assuming that it would be necessary to also provide the position, though I do see that those are listed as optional in the spec.

You identified a shortcoming of the current API. We use Locator objects everywhere as an exchange type, but each API supports or returns different properties for locator.locations. It's something we want to fix in the future with more type-safe APIs, keeping Locator objects only for the JSON serialization.

Simply constructing a new Locator works, in that the Navigator renders the correct content when I call navigator.go(newLocator), but that Locator doesn't have any progression/totalProgression, and I'm not sure how to find them! I rely on totalProgression for a progress bar on the home page of the app, so if your last interaction with a given book was just listening to the audiobook, the progress bar on the home page will be empty until you open the text again, causing the navigator to compute the correct progression.

Yeah that's annoying, and an issue we faced with the TTS as well. The problem is that progression and totalProgression are fuzzy and depend on the rendering modality. That's why they are mostly meant for UX feedback and used as a navigation location as a fallback only.

In the navigator it is actually the webview scroll progression.
In publication.positions(), it is computed from position / positionCount.
In HTMLResourceContentIterator, it is computed from element / elementCount.

In your specific use case, and assuming that you have an ID fragment or CSS selector (note CSS selectors don't seem implemented yet in navigator.go(to: Locator)) and the progression is only for UX feedback, I would compute the progression and totalProgression using the current clip time and duration. Although it might fail if the book is using media overlays only for some chapters. Maybe an hybrid approach could work, taking into account the text in the HTML resources as well.

smoores-dev Jan 30, 2024
Author

Thanks again, @mickael-menu. I ended up injecting some Javascript to calculate the fragments array on each locationDidChange: https://gitlab.com/smoores/storyteller-mobile/-/blob/main/modules/readium/ios/EPUBView.swift#L152-187. It seems more than fast enough for my use case, we'll see how it goes. For very very long chapters it could end up calling quite a lot of getBoundingClientRects. And I'm also just quickly searching for the location of the fragment in the HTML document to estimate progress during audio playback; the fact that audio files don't always map 1:1 to chapters made using the progress in the track fairly fraught. I really appreciate the tips!

I spent some time over the past two days dramatically cleaning up my usage of both the Swift and Kotlin toolkits; I think I'm using them much much more correctly now. One remaining note that I have is that in order to add mediaOverlay info to links, I've had to parse the OPF document myself; this is sort of fine, but because the OPF parsers in both toolkits are internal, it was a little cumbersome (I basically had to copy their implementations into my package). I imagine that this would dramatically expand the surface area for breaking changes/API maintenance for the libraries, but I wonder if there's a future where the toolkits expose some more of the underlying tools for parsing EPUBs to make it easier for consumers to customize the publication. As it is, there's a really nice Transformer interface, but it's hard to make use of it without re-implementing a lot of parsing work!

mickael-menu Feb 2, 2024
Maintainer

I think it should be handled on a case-by-case basis. What specific API extension do you see for the EPUBParser? As you mentioned, we should avoid exposing too much in order to maintain control over the API surface area.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Building Media Overlay support on top of Readium #377

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Building Media Overlay support on top of Readium #377

smoores-dev Jan 12, 2024

Replies: 1 comment · 4 replies

mickael-menu Jan 24, 2024 Maintainer

smoores-dev Jan 29, 2024 Author

mickael-menu Jan 29, 2024 Maintainer

smoores-dev Jan 30, 2024 Author

mickael-menu Feb 2, 2024 Maintainer

smoores-dev
Jan 12, 2024

Replies: 1 comment 4 replies

mickael-menu
Jan 24, 2024
Maintainer

smoores-dev Jan 29, 2024
Author

mickael-menu Jan 29, 2024
Maintainer

smoores-dev Jan 30, 2024
Author

mickael-menu Feb 2, 2024
Maintainer