Skip to content

Conversation

@ArnyminerZ
Copy link
Member

@ArnyminerZ ArnyminerZ commented Sep 28, 2025

Context

Right now we are loading the whole iCalendar into memory when applying the preprocessors, which breaks the whole point to use Readers in the first place. This can lead to Out Of Memory exceptions on super large iCalendar documents, or in devices with limited memory.

More info and reproduction in #90

Changes

  • Added a new test (ICalPreprocessorInstrumentedTest) that generates a very large iCalendar file.
    With Int.MAX_VALUE events.
  • Changed the way ICalPreprocessor.preprocessStream works:
    • Instead of applying the StreamPreprocessors on the whole document at once, they are applied line-wise.
      Note: to avoid having to apply regex conditions (which can get expensive) hundred of thousands of times, the lines are chunked in groups of 1000 lines (arbitrary, can be adjusted).
    • If the Reader given to the ICalPreprocessor support reset, the result of this function will be a SequenceReader. This is a new class, that converts sequences into a Reader. Note that this function does not support reset, by the way sequences work in Kotlin.
    • If the Reader doesn't support reset, it will have to be loaded fully into memory anyway.
  • Since StreamPreprocessor is now simpler (most logic has been moved into ICalPreprocessor), they are now interfaces.

Note

It is possible to run the pre-processors in a line-basis is because they are applied line-wise. If at some point we require a preprocessor that needs to fix multiple lines at once (maybe description fixes? which allow multi-line), we might have to re-do this.

@ArnyminerZ ArnyminerZ self-assigned this Sep 28, 2025
@ArnyminerZ ArnyminerZ added the refactoring Quality improvement of existing functions label Sep 28, 2025
@ArnyminerZ ArnyminerZ requested a review from a team as a code owner September 28, 2025 13:05
@ArnyminerZ ArnyminerZ linked an issue Sep 28, 2025 that may be closed by this pull request
@ArnyminerZ ArnyminerZ requested a review from Copilot September 28, 2025 13:07
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses Out Of Memory (OOM) issues when processing super large iCalendar files by implementing chunked processing instead of loading entire files into memory. The changes move from a Reader-based preprocessing approach to a line-by-line chunked processing system that processes iCalendar data in groups of 1000 lines to maintain memory efficiency while preserving functionality.

Key changes:

  • Refactored stream preprocessing to process data in configurable chunks rather than loading entire files
  • Converted StreamPreprocessor from abstract class to interface with simplified API
  • Added SequenceReader utility class to convert processed line sequences back to Reader interface

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
SequenceReader.kt New utility class that converts String sequences into Reader interface
StreamPreprocessor.kt Simplified from abstract class to interface, removing preprocessing logic
ICalPreprocessor.kt Refactored to implement chunked processing with configurable chunk sizes
FixInvalidUtcOffsetPreprocessor.kt Updated to implement StreamPreprocessor interface
FixInvalidDayOffsetPreprocessor.kt Updated to implement StreamPreprocessor interface
ICalPreprocessorInstrumentedTest.kt Added test with large iCalendar file generator to verify memory efficiency

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@ArnyminerZ ArnyminerZ marked this pull request as draft September 28, 2025 13:08
@ArnyminerZ
Copy link
Member Author

@bitfireAT/app-dev should be ready :)

A bit ugly, but I don't know how to simplify it to be honest. Mainly SequenceReader is a bit verbose.

@ArnyminerZ ArnyminerZ marked this pull request as ready for review September 28, 2025 13:46
Copy link
Member

@sunkup sunkup left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty cool, but unfortunately not very necessary and might introduce new problems. The current focus for synctools is the refactoring for stability, which is already breaking things a bit ... So it's not really a good time to be adding this. I will do an actual review if @rfc2822 thinks we should add it to synctools now anyways.

Copy link
Member

@rfc2822 rfc2822 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the large file / OOM was a real problem, I think it's a good idea to make the preprocessors more stable, too.

Some comments.

.lineSequence()
.chunked(chunkSize)
.map { chunk -> applyPreprocessors(chunk.joinToString("\n")) }
return SequenceReader(chunkedFixedLines)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use Guava's CharSource like that:

BufferedReader(original).use { reader ->
val chunkedFixedLines = reader.lineSequence()
    .chunked(chunkSize)
    .map { chunk ->
        val fixed = applyPreprocessors(chunk.joinToString("\n"))
        CharSource.wrap(fixed)  // String to CharSource
    }.asIterable()
    return CharSource.concat(chunkedFixedLines).openStream()
}

Then we could avoid creating a custom reader.

Because we need the Iterable for CharSource, it may be advantageous to use Stream/Iterable instead of Sequence, but I don't know whether there's a chunk() then. Maybe also in Guava.

But please double-check @ArnyminerZ, also that not all strings are kept in memory at once (but only one String at a time).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say this works, because the test that failed with huge iCalendars is passing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment in the test: it only tests whether the input is processed as a stream, but not whether it's actually processed (correctly).

Copy link
Member

@rfc2822 rfc2822 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments :)

}

@Test
fun testParse_SuperLargeFiles() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this test (class) do?

As I understand it, this test basically tests whether the input is put into the memory as one chunk (which would fail with OOM) or whether it's processed in streaming mode so that it doesn't load too big files into the memory.

So I think there are two things that can be tested:

  1. Whether the preprocessor does anything when its result Reader is not consumed / being read. I think this is what testParse_SuperLargeFiles currently does: it verifies that preprocessStream doesn't actually load the input as long as the resulting Reader is not consumed.

However, when the input is not read at all, why then generate an iCalendar at all? We could as well pass an infinite stream of the same character.

  1. Whether the resulting Reader returns a correct result when consumed. I think we can also use a fake input string and then check with mockk that applyPreprocessors is called. We can return another fake value and verify that it's what we want.

This is currently not done by the tests. I noticed it because if we pass Int.MAX_VALUE events to it, it should take quite a time to process 2147483647 events – however the test returns immediately.


And everything for the two cases that

  1. the input stream supports reset(), and
  2. that it doesn't support reset().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As there's no Android dependency, it should be possible to make this test a unit test (faster and easier to run than an instrumentation test).

Comment on lines +17 to +18
@VisibleForTesting
val regexpForProblem = Regex(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be enough to use internal? I usually combine @VisibleForTesting with internal so that tests can access the field, but other packages can't.

*
* @param original The complete iCalendar string
* @return The complete iCalendar string, but fixed
* @param lines The iCalendar lines to fix. Those may be the full iCalendar file, or just a part of it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should explicitly mention that lines must only contain complete lines. Because when we have to add another StreamPreprocessor at some day, we won't know anymore what we did now, but we still need to know whether our input (lines) can contain parts of lines or only complete lines.

.lineSequence()
.chunked(chunkSize)
.map { chunk -> applyPreprocessors(chunk.joinToString("\n")) }
return SequenceReader(chunkedFixedLines)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment in the test: it only tests whether the input is processed as a stream, but not whether it's actually processed (correctly).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

refactoring Quality improvement of existing functions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OOM Exception with large icalendar files

3 participants