Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] ConcurrentModificationException is thrown when calling Roaring64NavigableMap.getLongCardinality #1782

Open
2 of 3 tasks
rickyma opened this issue Jun 13, 2024 · 4 comments · May be fixed by #2100
Open
2 of 3 tasks
Labels
flaky test a flaky test help wanted Extra attention is needed

Comments

@rickyma
Copy link
Contributor

rickyma commented Jun 13, 2024

Code of Conduct

Search before asking

  • I have searched in the issues and found no similar issues.

Describe the bug

https://github.com/apache/incubator-uniffle/actions/runs/9381222136/job/25830065788?pr=1763

Affects Version(s)

master

Uniffle Server Log Output

[INFO] Running org.apache.uniffle.server.buffer.ShuffleBufferManagerTest
Error:  Tests run: 13, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 6.347 s <<< FAILURE! - in org.apache.uniffle.server.buffer.ShuffleBufferManagerTest
Error:  bufferSizeTest  Time elapsed: 0.646 s  <<< ERROR!
java.util.ConcurrentModificationException
	at java.util.TreeMap$NavigableSubMap$SubMapIterator.nextEntry(TreeMap.java:1703)
	at java.util.TreeMap$NavigableSubMap$SubMapEntryIterator.next(TreeMap.java:1751)
	at java.util.TreeMap$NavigableSubMap$SubMapEntryIterator.next(TreeMap.java:1745)
	at org.roaringbitmap.longlong.Roaring64NavigableMap.ensureCumulatives(Roaring64NavigableMap.java:553)
	at org.roaringbitmap.longlong.Roaring64NavigableMap.getLongCardinality(Roaring64NavigableMap.java:278)
	at org.apache.uniffle.server.buffer.ShuffleBufferManagerTest.waitForFlush(ShuffleBufferManagerTest.java:637)
	at org.apache.uniffle.server.buffer.ShuffleBufferManagerTest.bufferSizeTest(ShuffleBufferManagerTest.java:450)

Uniffle Engine Log Output

No response

Uniffle Server Configurations

No response

Uniffle Engine Configurations

No response

Additional context

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!
@rickyma rickyma added help wanted Extra attention is needed flaky test a flaky test labels Jun 18, 2024
@maobaolong
Copy link
Member

maobaolong commented Sep 7, 2024

@rickyma After thinking about this issue and study Roaring64NavigableMap, I found that Roaring64NavigableMap is not thread-safe.

I decreased the sleep time and increased the retry max count to increase the race condition change like following changes.

 private void waitForFlush(
      ShuffleFlushManager shuffleFlushManager,
      String appId,
      int shuffleId,
      int expectedBlockNum,
      long expectedUsedMemory)
      throws Exception {
    int retry = 0;
    long committedCount = 0;
    do {
      committedCount =
          shuffleFlushManager.getCommittedBlockIds(appId, shuffleId).getLongCardinality();
      if (committedCount < expectedBlockNum) {
        Thread.sleep(5);
      }
      retry++;
      if (retry > 1000) {
        fail("Flush data time out");
      }
    } while (committedCount < expectedBlockNum);

    // Need to wait for `event.doCleanup` to be executed
    // to ensure the correctness of subsequent checks of
    // `shuffleBufferManager.getUsedMemory()` and `shuffleBufferManager.getInFlushSize()`.
    Awaitility.await()
        .atMost(Duration.ofSeconds(5))
        .until(() -> shuffleBufferManager.getUsedMemory() == expectedUsedMemory);
  }

I encountered ConcurrentModificationException, ArrayIndexOutOfBoundsException and AssertError after 20 or more time executed test.

java.lang.ArrayIndexOutOfBoundsException: 3

	at org.roaringbitmap.longlong.Roaring64NavigableMap.getLongCardinality(Roaring64NavigableMap.java:286)
	at org.apache.uniffle.server.buffer.ShuffleBufferManagerTest.waitForFlush(ShuffleBufferManagerTest.java:657)
	at org.apache.uniffle.server.buffer.ShuffleBufferManagerTest.bufferSizeTest(ShuffleBufferManagerTest.java:451)
image
java.lang.AssertionError: Computed 1 differs from dummy binary-search index: -1

	at org.roaringbitmap.longlong.Roaring64NavigableMap.ensureOne(Roaring64NavigableMap.java:642)
	at org.roaringbitmap.longlong.Roaring64NavigableMap.ensureCumulatives(Roaring64NavigableMap.java:567)
	at org.roaringbitmap.longlong.Roaring64NavigableMap.getLongCardinality(Roaring64NavigableMap.java:278)
	at org.apache.uniffle.server.buffer.ShuffleBufferManagerTest.waitForFlush(ShuffleBufferManagerTest.java:664)
	at org.apache.uniffle.server.buffer.ShuffleBufferManagerTest.waitForFlush(ShuffleBufferManagerTest.java:648)
	at org.apache.uniffle.server.buffer.ShuffleBufferManagerTest.bufferSizeTest(ShuffleBufferManagerTest.java:443)
[2024-09-07 09:50:26.752] [main] [ERROR] ShuffleBufferManagerTest.waitForFlush - Ignored exception.
java.util.ConcurrentModificationException
	at java.util.TreeMap$NavigableSubMap$SubMapIterator.nextEntry(TreeMap.java:1703)
	at java.util.TreeMap$NavigableSubMap$SubMapEntryIterator.next(TreeMap.java:1751)
	at java.util.TreeMap$NavigableSubMap$SubMapEntryIterator.next(TreeMap.java:1745)
	at org.roaringbitmap.longlong.Roaring64NavigableMap.ensureCumulatives(Roaring64NavigableMap.java:553)
	at org.roaringbitmap.longlong.Roaring64NavigableMap.getLongCardinality(Roaring64NavigableMap.java:278)
	at org.apache.uniffle.server.buffer.ShuffleBufferManagerTest.waitForFlush(ShuffleBufferManagerTest.java:664)
	at org.apache.uniffle.server.buffer.ShuffleBufferManagerTest.bufferSizeTest(ShuffleBufferManagerTest.java:456)

In conclusion, I suggest to use lock or clone to avoid race condition, and we can use try catch to ignore this kind exception for testing.

@rickyma
Copy link
Contributor Author

rickyma commented Sep 8, 2024

I suggest to use lock or clone to avoid race condition

Seems fine.

we can use try catch to ignore this kind exception for testing

I think this is more of a hack than a fix.

@maobaolong
Copy link
Member

I think this is more of a hack than a fix.

I'm not sure how many days and how difficult to solve this issue. As it is for testing purpose, do you think a fail and retry could be a simple way to make the test passed?

As motioned, the Roaring64NavigableMap is not thread-safe, and it is not worth to add lock for production code only to make test code avoid race.

@rickyma
Copy link
Contributor Author

rickyma commented Sep 9, 2024

You can refactor the test cases, or maybe just disable the tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky test a flaky test help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants