Skip to content

Conversation

@TransZAllen
Copy link
Contributor

@TransZAllen TransZAllen commented Aug 27, 2025

What is it?

  • Bugfix (user facing)
  • Feature (user facing)
  • Codebase improvement (dev facing)
  • Meta improvement to the project (dev facing)

Description of the changes in your PR

Problem

This issue persists in the latest version (0.28.0) and the most recent commit on the dev branch.
Downloaded SRT subtitles for some videos are empty, containing only timestamps and sequence numbers, as reported in #10030. This affects videos with styled subtitles, such as:

Problem Analysis

The issue occurs because some YouTube subtitles use TTML format with nested tags, which the original parser did not handle correctly.

Problematic TTML Example
<p begin="00:00:01.000" end="00:00:03.000">
  <span style="s4">Hello World!</span>
</p>

The text ("Hello World!") is nested inside a tag for styling (e.g., colors or karaoke effects). The original parser only processed direct child nodes, missing the text inside , resulting in empty SRT output.
Non-Problematic TTML Example

<p begin="00:00:01.000" end="00:00:03.000" style="s2">
  Hello World!
</p>

This TTML has text directly under<p>, which was parsed correctly by the original code.

Root Cause

The original SrtFromTtmlWriter.build method used a non-recursive loop, failing to extract text from nested tags like <span> in styled subtitles (e.g., rainbow or karaoke captions).

Solution

Added a new extractText() method to recursively extract text from all nodes, including TextNode and <br> tags, handling nested tags like <span>.
Replaced the non-recursive loop in SrtFromTtmlWriter.build with a call to extractText().

Before/After Screenshots/Screen Record

  • Before: None
  • After: None

Fixes the following issue(s)

APK testing

The APK can be found by going to the "Checks" tab below the title. On the left pane, click on "CI", scroll down to "artifacts" and click "app" to download the zip file which contains the debug APK of this PR. You can find more info and a video demonstration on this wiki page.

  • Fixed Cases (previously empty SRT files)

https://youtu.be/mtb-qa8xvFU (【HimeHina MV】Roki (Cover))
https://www.youtube.com/watch?v=zbQRY8KSVbU (Ousama Ranking Opening 2 Full『Hadaka no Yuusha』by Vaundy)
https://youtu.be/-eDYT_20YhM (炜WARD ROMANCE ft. Feng Yi)
https://www.youtube.com/watch?v=L-BgxLtMxh0 (Styled subtitles for YouTube demonstration)
https://www.youtube.com/watch?v=Cc2nkx77U24 (Test: Rainbow Captions In Youtube)
https://www.youtube.com/watch?v=lUDPjyfmJrs (【original anime MV】III【hololive/宝鐘マリン&こぼ・かなえる】)

  • Regression Testing

Tested a video with simple subtitles that downloaded correctly in the original NewPipe: https://www.youtube.com/watch?v=BVAIImxcv4g .
Confirmed SRT output remains correct after the fix, ensuring no regression.

All tested videos now produce correct SRT files with subtitle text.

Due diligence

…issue TeamNewPipe#10030)

- Previously, *.SRT files only contained timestamps and sequence numbers, without the actual text content.
- Added recursive text extraction to handle nested tags in TTML
  files.(e.g.: <span> tags)
@github-actions github-actions bot added the size/small PRs with less than 50 changed lines label Aug 27, 2025
@TobiGr TobiGr added bug Issue is related to a bug downloader Issue is related to the downloader labels Aug 27, 2025
@TobiGr
Copy link
Contributor

TobiGr commented Aug 27, 2025

Thank you for the bug fix. I applied a few JDoc and codestyle fixes. You already compiled a good list of subtitles which can be used for testing. Do you mind creating a few unit tests to test this?

@TobiGr TobiGr added this to v0.28.x Aug 27, 2025
@github-project-automation github-project-automation bot moved this to Todo in v0.28.x Aug 27, 2025
@TobiGr TobiGr moved this from Todo to In Progress in v0.28.x Aug 27, 2025
@TransZAllen
Copy link
Contributor Author

TransZAllen commented Aug 27, 2025

Thank you for the bug fix. I applied a few JDoc and codestyle fixes. You already compiled a good list of subtitles which can be used for testing. Do you mind creating a few unit tests to test this?

I'm a newcomer to the NewPipe project and this is my first code submission. When I heard about adding unit tests, I got a bit confused 😄. I quickly started preparing them, but then I saw that your review had already been approved!
So, I’ll skip the unit tests this time. Thanks for your review and feedback! 👍
And by the way, I noticed the changes were merged into the v0.28.x branch.

Comment on lines 71 to 83
if (node instanceof TextNode textNode) {
text.append((textNode).text());
} else if (node instanceof Element element) {
// <br> is a self-closing HTML tag used to insert a line break.
if (element.tagName().equalsIgnoreCase("br")) {
// Add a newline for <br> tags
text.append(NEW_LINE);
}
}
// Recursively process child nodes
for (final Node child : node.childNodes()) {
extractText(child, text);
}
Copy link
Member

@Isira-Seneviratne Isira-Seneviratne Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The build method could be rewritten like this:

Suggested change
if (node instanceof TextNode textNode) {
text.append((textNode).text());
} else if (node instanceof Element element) {
// <br> is a self-closing HTML tag used to insert a line break.
if (element.tagName().equalsIgnoreCase("br")) {
// Add a newline for <br> tags
text.append(NEW_LINE);
}
}
// Recursively process child nodes
for (final Node child : node.childNodes()) {
extractText(child, text);
}
final List<Pair<Element, String>> pairList = doc.selectStream("body > div > p")
.map(paragraph -> {
// Element.text extracts from child nodes as well
return new Pair<>(paragraph, paragraph.text());
})
.filter(pair -> !ignoreEmptyFrames || !pair.second.isEmpty())
.toList();
for (final var pair : pairList) {
final var paragraph = pair.first;
final var text = pair.second;
final String begin = getTimestamp(paragraph, "begin");
final String end = getTimestamp(paragraph, "end");
writeFrame(begin, end, text);
}

Copy link
Contributor

@TobiGr TobiGr Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Element.text extracts from child nodes as well

But does it convert <br> tags to a new line? At least not from what I could find in the docs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Isira-Seneviratne

Thanks for the suggestion!

I gave it a try locally, but it failed to build (built with dev branch) because:

  1. Pair requires the commons-lang3 dependency, which is currently not part of NewPipe.
  2. selectStream(...) is only available in newer Jsoup versions. The current setup doesn’t support it.

In addition, the recursive extractText() method is dependency-free, has been tested extensively, and keeps the logic clearer for future modifications (e.g. handling special tags).

For this reason, I’d prefer to keep the current approach extractText() instead of introducing new dependencies or relying on newer APIs.

Copy link
Member

@Isira-Seneviratne Isira-Seneviratne Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Android has a built-in Pair type. Also, the current Jsoup version in the project has the method.

@TobiGr You're right, but it works fine for the subtitles from the linked videos. Alternatively, the map operation could be changed to use the existing logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly don't see a good reason to use streams here. It makes the whole code harder to read and I am not sure if we achieve an increased performance by using a stream here. If stream is significantly faster here, we could make use of it though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I changed the parameter to String as well, forgot to add that.

I accidentally edited your reply, sorry about that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switching to paragraph.text() without verifying whether all edge cases (like special characters or tags) are handled properly could introduce new issues.

The extractText() function consolidates NewPipe’s scattered code into a single, recursive function that processes all text content from various tags. This code has been tested for a long time and is well-suited to NewPipe’s design requirements. It guarantees the handling of special characters, especially line breaks (<br>), which are explicitly replaced with \r\n—something that paragraph.text() may not do correctly. Some software, for instance, defines <br> as \n or even as spaces, and we can't be sure that paragraph.text() handles these cases the same way as NewPipe does. And any other special characters, we don't know.

Switching to paragraph.text() introduces the risk of losing control over special character handling. If we need to add special handling later, paragraph.text() could lock us out of such flexibility and we can't change its code. Essentially, it becomes a black box, and we’re unsure of how it’ll behave with our specific use cases.

For these reasons, I recommend continuing to use extractText().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the paragraph.text() definition from https://github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/nodes/Element.java#L1552

    public String text() {
        final StringBuilder accum = StringUtil.borrowBuilder();
        new TextAccumulator(accum).traverse(this);
        return StringUtil.releaseBuilder(accum).trim();
    }

And it uses the trim() function.
After researching it on Google, I found that trim() is a standard Java String method used to remove leading and trailing whitespace from a string.

Since it removes whitespace, it is not suitable for subtitles.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the implementation in this PR is already good enough. @Isira-Seneviratne If you do not have any objections, I'd proceed with merging this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I am also in favour of keeping the current code

@TobiGr
Copy link
Contributor

TobiGr commented Aug 28, 2025

I'm a newcomer to the NewPipe project and this is my first code submission.

Hi and welcome!

. When I heard about adding unit tests, I got a bit confused 😄. I quickly started preparing them, but then I saw that your review had already been approved!

If you wrote some, feel free to add them to the PR. Our test coverage is dramatically low.

And by the way, I noticed the changes were merged into the v0.28.x branch.

This PR has not been merged yet. I just added it to the project to ensure it is merged before the 0.28.1 release.

@Stypox
Copy link
Member

Stypox commented Sep 17, 2025

@TransZAllen it would indeed help if you have some time to write a couple of tests that take some short HTML strings, parse them, and check whether the extracted text is correct. E.g. one test for the old format without styles, and one for the new format. Let us know if you can make it, otherwise this PR is good to go even like this. Thanks!

@TransZAllen
Copy link
Contributor Author

@Stypox
Okay, I will definitely add the test cases, at least two.
I will start working on them tomorrow, but I'm not sure exactly how long it will take. I'm a bit slow at coding, so please bear with me and I’ll keep you updated.

@TransZAllen
Copy link
Contributor Author

Hi all, just a quick update:

  1. About the test target: currently the unit test is verifying SrtFromTtmlWriter.extractText() instead of SrtFromTtmlWriter.build(). The reason is that I wanted to check whether the plain text content is correctly extracted, without involving paragraph numbering or timestamps. Do you think this approach makes sense, or should we rather test build()?

  2. The test case is not yet complete, but during testing I discovered an issue:

    • Leading and trailing spaces in subtitle text are removed.
    • Multiple consecutive spaces inside the text are collapsed into a single space.
      Because of this, I had to modify extractText() to handle spaces properly (TextNode.text()TextNode.getWholeText()).
      The fix compiles and tests pass now, but I’m still verifying edge cases.
      Current work is in branch debug_issue10030_add_test_case.
  3. Another problem I noticed: the generated *.srt files start paragraph numbering from 0 instead of 1. According to the SubRip (.srt) specification, numbering should start from 1. Starting at 0 may still work in some players, but it can NOT with strict parsers. Here is the beginning of a *.SRT file generated by NewPipe:

a51x:/storage/emulated/0/Newpipe $ cat Bear\ Family\ Searches\ for\ Water\ _\ Wild\ Mexico\ _\ BBC\ Earth-en.srt               
0
00:00:00,640 --> 00:00:07,160
This mother has three young cubs just 8

1
00:00:03,840 --> 00:00:07,160
months old.

2
00:00:08,520 --> 00:00:22,899
[Music]

3
00:00:24,560 --> 00:00:29,039
It's autumn and the family needs to

4
00:00:27,119 --> 00:00:32,039
flatten up before the winter

@Stypox
Copy link
Member

Stypox commented Sep 23, 2025

@TransZAllen thanks for the thorough analysis!

Do you think this approach makes sense, or should we rather test build()?

Yeah it makes sense.

Current work is in branch debug_issue10030_add_test_case.

Okok thanks for handling the edge cases. Feel free to push here once you are done.

numbering should start from 1

I guess you could replace writeString(String.valueOf(frameIndex++)); with writeString(String.valueOf(++frameIndex)); here. And make sure to add a comment explaining that the first frame should have index 1.

@TransZAllen
Copy link
Contributor Author

@Stypox Thanks! As for the numbering in *.srt:

You’re right, the issue is probably right there. Every time I see ++, I get a bit confused. They’re not very clear to read, so I’d rather avoid them. I prefer using something more explicit like frameIndex += 1 or frameIndex = frameIndex + 1, and I’ll add a comment to clarify that numbering starts from 1.

@TransZAllen
Copy link
Contributor Author

@Stypox

Regarding the numbering in *.srt:

Instead of fixing it directly in this PR, I’ll first open a separate issue to track this problem, since it’s a distinct concern from the text extraction tests. Is this Okay?

This way we keep the scope of the current PR focused on extractText() and its tests, while the numbering fix can be reviewed independently.

@TransZAllen
Copy link
Contributor Author

TransZAllen commented Sep 29, 2025

Hi all, just a quick update:

1.Unit Tests

I've been working on the unit tests over the past few days. They’re not yet complete, but I’m continuing to check what’s still missing or needs adjustment.

2.SRT Numbering

  • On Ubuntu, I tested SRT files with numbering starting from 0 on multiple players (mpv 0.37.0, VLC 3.0.20, MPlayer 1.5, Totem 43.0).

    • All of them displayed the subtitles correctly, even when numbering started at 0.
    • In fact, VLC was even able to display subtitles with numbering starting at -1.
  • This is not a playback-breaking bug. However, it doesn’t follow the SRT specification, which requires numbering to start at 1. To avoid potential issues with stricter software, I think it’s best to fix this.

    My plan:

  • I’ll open a separate issue to track the numbering problem today, and then follow up with a PR.

  • Roughly, the change will be to split the current line writeString(String.valueOf(frameIndex++)); into two lines: one to use frameIndex, and another to increment it.

     I'll get started on it today.

@Stypox
Copy link
Member

Stypox commented Sep 29, 2025

Sure, it's fine if you make a separate PR. Thanks!

@TransZAllen
Copy link
Contributor Author

TransZAllen commented Sep 29, 2025

Hi everyone,

Here is the issue and PR for SRT numbering:
issue: #12670
PR: #12671

@TransZAllen
Copy link
Contributor Author

TransZAllen commented Oct 10, 2025

hi all, just a quick update —

the Unit Test is basically done now. I don’t think there’s much left to change, maybe just some tiny comment tweaks or small clean-ups.
overall, it’s pretty much ready.
if nothing new comes up these days, I’ll mark it ready for your review soon.

@ale5000-git
Copy link

ale5000-git commented Oct 10, 2025

  • Leading and trailing spaces in subtitle text are removed.

Many players do trim every line, I think there is NOT any useful use case for these spaces
but consecutive spaces inside the text are usually NOT collapsed.

  • All of them displayed the subtitles correctly, even when numbering started at 0.
  • In fact, VLC was even able to display subtitles with numbering starting at -1.

I think most players (if not all) just ignore indexes completely, you can also write 100000 in the index.
Many player also support subtitles with indexes removed.

@TransZAllen
Copy link
Contributor Author

Thanks for the extra input, @ale5000-git!
Yes, those earlier points about trimming and SRT numbering were already reviewed before —

I’m currently finalizing the Unit Test part and preparing the content to present during the review. Will ping again once it’s ready. 😊

…characters

- Replaced `text()` with `getWholeText()`:
  - avoids losing whitespaces at the beginning, end, or within the text;
  - avoids merging two or more consecutive spaces into a single space ' ';
  - avoids converting '\r', '\n', and '\r\n' within the text into a single space ' ';
  For subtitle conversion, the goal is to preserve every character exactly as intended by the subtitle author.
- Normalized tabs, line breaks, and other special characters for SRT-safe output.
- Added comprehensive unit tests in `SrtFromTtmlWriterTest.java`, including cases for simple and nested tags.
@github-actions github-actions bot added the size/large PRs with less than 750 changed lines label Oct 16, 2025
@github-actions github-actions bot removed the size/small PRs with less than 50 changed lines label Oct 16, 2025
@TransZAllen
Copy link
Contributor Author

TransZAllen commented Oct 16, 2025

Hi all,

I just pushed a commit that includes the Unit Test (SrtFromTtmlWriterTest.java).
I also spent some time writing clear code comments, hoping it helps explain the thought process behind the changes.

Here's the log message for this commit (22ee01b):

refactor(ttml): improve extractText() to preserve spaces and special characters

- Replaced `text()` with `getWholeText()`:
  - avoids losing whitespaces at the beginning, end, or within the text;
  - avoids merging two or more consecutive spaces into a single space ' ';
  - avoids converting '\r', '\n', and '\r\n' within the text into a single space ' ';
  For subtitle conversion, the goal is to preserve every character exactly as intended by the subtitle author.
- Normalized tabs, line breaks, and other special characters for SRT-safe output.
- Added comprehensive unit tests in `SrtFromTtmlWriterTest.java`, including cases for simple and nested tags.

…gs()`

- Extracted child-node traversal logic from `extractText()`
  into a helper method `traverseChildNodesForNestedTags()`.
- No functional change.
@TransZAllen
Copy link
Contributor Author

TransZAllen commented Oct 17, 2025

Hi everyone,

I've just pushed a new commit to optimize the code. Now ,there are four commits. Here is the log message for this commit(3516667):

refactor(ttml): extract recursion into `traverseChildNodesForNestedTags()`

- Extracted child-node traversal logic from `extractText()`
  into a helper method `traverseChildNodesForNestedTags()`.
- No functional change.

Please review it, 😃 there are no plans to modify the code for now.

Copy link
Contributor

@TobiGr TobiGr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you

@Stypox
Copy link
Member

Stypox commented Oct 28, 2025

From my side I don't have more comments than what Tobi already pointed out. Thanks for the tests! After those comments are solved, this is ready to be merged imo. We'll see if any user complains about misformatted subtitles, but your processing looks reasonable to me.

TransZAllen and others added 2 commits October 29, 2025 19:25
- update class header with proper technical references and remove author tag.
- update comments of replacing NBSP('\u00A0'), especially adding examples
  of rendering incorrectly.
Copy link
Member

@Stypox Stypox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Contributor

@TobiGr TobiGr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the effort and your patience!

@TobiGr TobiGr merged commit 0a89276 into TeamNewPipe:dev Oct 30, 2025
5 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in v0.28.x Oct 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Issue is related to a bug downloader Issue is related to the downloader size/large PRs with less than 750 changed lines

Projects

Status: Done

5 participants