Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#2060]fix(server): Fix memory leak when reach memory limit #2058

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

lwllvyb
Copy link
Contributor

@lwllvyb lwllvyb commented Aug 19, 2024

What changes were proposed in this pull request?

Fix shuffle server memory leak when reach memory limit.

Why are the changes needed?

Enable netty. One the shuffle server side, Netty will allocate memory for SEND_SHUFFLE_DATA_REQUEST request. However, when the memory limit is reached, an OutOfDirectMemoryError will be shown and the decode for this message will fail. This will cause the bytebuf allocated successfully in the previous batch in this message to not be released, resulting in memory leak.

Fix: #2060

Does this PR introduce any user-facing change?

No.

How was this patch tested?

(Please test your changes, and provide instructions on how to test it:

If you add a feature or fix a bug, add a test to cover your changes.
If you fix a flaky test, repeat it for many times to prove it works.)

@lwllvyb
Copy link
Contributor Author

lwllvyb commented Aug 19, 2024

The error stack:

[2024-08-08 20:49:31.754] [epollEventLoopGroup-3-2] [ERROR] ShuffleServerNettyHandler - exception caught /*:38381
io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 65536 byte(s) of direct memory (used: 17179815063, max: 17179869184)
at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:843)
at io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:772)
at io.netty.buffer.UnpooledUnsafeNoCleanerDirectByteBuf.allocateDirect(UnpooledUnsafeNoCleanerDirectByteBuf.java:30)
at io.netty.buffer.UnpooledByteBufAllocator$InstrumentedUnpooledUnsafeNoCleanerDirectByteBuf.allocateDirect(UnpooledByteBufAllocator.java:186)
at io.netty.buffer.UnpooledDirectByteBuf.(UnpooledDirectByteBuf.java:64) at io.netty.buffer.UnpooledUnsafeDirectByteBuf.(UnpooledUnsafeDirectByteBuf.java:41)
at io.netty.buffer.UnpooledUnsafeNoCleanerDirectByteBuf.(UnpooledUnsafeNoCleanerDirectByteBuf.java:25)
at io.netty.buffer.UnpooledByteBufAllocator$InstrumentedUnpooledUnsafeNoCleanerDirectByteBuf.(UnpooledByteBufAllocator.java:181)
at io.netty.buffer.UnpooledByteBufAllocator.newDirectBuffer(UnpooledByteBufAllocator.java:91)
at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:188)
at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:179)
at io.netty.channel.unix.PreferredDirectByteBufAllocator.ioBuffer(PreferredDirectByteBufAllocator.java:53)
at io.netty.channel.DefaultMaxMessagesRecvByteBufAllocator$MaxMessageHandle.allocate(DefaultMaxMessagesRecvByteBufAllocator.java:120)
at io.netty.channel.epoll.EpollRecvByteAllocatorHandle.allocate(EpollRecvByteAllocatorHandle.java:75)
at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:786)
at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:501)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:399)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:750)

Copy link

Test Results

 2 792 files  ±0   2 792 suites  ±0   5h 51m 21s ⏱️ -5s
   988 tests ±0     987 ✅ ±0   1 💤 ±0  0 ❌ ±0 
12 403 runs  ±0  12 388 ✅ ±0  15 💤 ±0  0 ❌ ±0 

Results for commit 03ab441. ± Comparison against base commit ba2302c.

@lwllvyb lwllvyb changed the title fix(server): Fix memory leak when reach memory limit [#2060]fix(server): Fix memory leak when reach memory limit Aug 19, 2024
@jerqi jerqi requested a review from rickyma August 19, 2024 10:47
@rickyma
Copy link
Contributor

rickyma commented Aug 19, 2024

I think we need to first avoid the occurrence of OOM. The occurrence of OOM is either due to unreasonable memory configurations or bugs in the code. As for how the code should handle after OOM, I don't think it's very important, because the server has already malfunctioned at this point. Even if there is a memory leak, it's actually not important anymore.

@lwllvyb
Copy link
Contributor Author

lwllvyb commented Aug 28, 2024

I think we need to first avoid the occurrence of OOM. The occurrence of OOM is either due to unreasonable memory configurations or bugs in the code. As for how the code should handle after OOM, I don't think it's very important, because the server has already malfunctioned at this point. Even if there is a memory leak, it's actually not important anymore.

So can this PR be accepted ? Or should I just cancel this PR? @rickyma

@rickyma
Copy link
Contributor

rickyma commented Aug 28, 2024

I'm OK with this PR. But it's meaningless. When an OOM error occurs, this PR will not help much.

@lwllvyb
Copy link
Contributor Author

lwllvyb commented Aug 28, 2024

I'm OK with this PR. But it's meaningless. When an OOM error occurs, this PR will not help much.

This PR is not to prevent the OOM exception, but to ensure that the pre-allocated ByteBuf can be released normally.

@jerqi
Copy link
Contributor

jerqi commented Aug 28, 2024

I'm OK with this PR. But it's meaningless. When an OOM error occurs, this PR will not help much.

This PR is not to prevent the OOM exception, but to ensure that the pre-allocated ByteBuf can be released normally.

You shouldn't catch OOM exception. If it throws OOM, more errors may throw. You can't recover it by just catching it.

@lwllvyb
Copy link
Contributor Author

lwllvyb commented Aug 28, 2024

I'm OK with this PR. But it's meaningless. When an OOM error occurs, this PR will not help much.

This PR is not to prevent the OOM exception, but to ensure that the pre-allocated ByteBuf can be released normally.

You shouldn't catch OOM exception. If it throws OOM, more errors may throw. You can't recover it by just catching it.

I don't understand what you mean.
If the OOM exception happen, how we deal with the pre-allocated ByteBuf ? Reconfigure and restart the server , or other ways?

@jerqi
Copy link
Contributor

jerqi commented Aug 28, 2024

I'm OK with this PR. But it's meaningless. When an OOM error occurs, this PR will not help much.

This PR is not to prevent the OOM exception, but to ensure that the pre-allocated ByteBuf can be released normally.

You shouldn't catch OOM exception. If it throws OOM, more errors may throw. You can't recover it by just catching it.

I don't understand what you mean. If the OOM exception happen, how we deal with the pre-allocated ByteBuf ? Reconfigure and restart the server , or other ways?

If it will OOM, the java process should exit.

@lwllvyb
Copy link
Contributor Author

lwllvyb commented Aug 30, 2024

I'm OK with this PR. But it's meaningless. When an OOM error occurs, this PR will not help much.

This PR is not to prevent the OOM exception, but to ensure that the pre-allocated ByteBuf can be released normally.

You shouldn't catch OOM exception. If it throws OOM, more errors may throw. You can't recover it by just catching it.

I don't understand what you mean. If the OOM exception happen, how we deal with the pre-allocated ByteBuf ? Reconfigure and restart the server , or other ways?

If it will OOM, the java process should exit.

I need to clarify that it is not OOM but OutOfDirectMemoryError. From the stack trace, we can see that the server did not exit.

@rickyma
Copy link
Contributor

rickyma commented Aug 30, 2024

Yeah, it is a Netty's internal OOM error. It's meaningless to catch this exception. On the other hand, this PR is harmless.

So I choose to remain neutral.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] memory leak when reach memory limit
3 participants