Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

alluxio.client.file.cache.LocalCacheManager Failed to get existing page due to java.io.IOException: java.util.concurrent.RejectedExecutionException #24663

Open
fgwang7w opened this issue Mar 3, 2025 · 1 comment
Assignees
Labels

Comments

@fgwang7w
Copy link
Member

fgwang7w commented Mar 3, 2025

The cache hit ratio has been around 70%+ in our prod environment. The log indicates massive error message showing RejectedExecutionException.

JStack trace:

2025-01-30T15:25:56.973Z	ERROR	20250130_152200_10236_cup4a.21.0.6.0-90-78	alluxio.client.file.cache.LocalCacheManager	Failed to get existing page PageId{FileId=2150900da0ee9376cca0e746ddb91f9e, PageIndex=0} from pageStore
java.io.IOException: java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask@5a7a510[Not completed, task = alluxio.client.file.cache.TimeBoundPageStore$$Lambda$4165/0x0000ffc1f2cff568@7821a7e7] rejected from java.util.concurrent.ThreadPoolExecutor@18d371d3[Running, pool size = 64, active threads = 64, queued tasks = 0, completed tasks = 1144697]
	at alluxio.client.file.cache.TimeBoundPageStore.get(TimeBoundPageStore.java:100)
	at alluxio.client.file.cache.LocalCacheManager.getPage(LocalCacheManager.java:641)
	at alluxio.client.file.cache.LocalCacheManager.get(LocalCacheManager.java:481)
	at alluxio.client.file.cache.NoExceptionCacheManager.get(NoExceptionCacheManager.java:87)
	at alluxio.client.file.cache.LocalCacheFileInStream.readInternal(LocalCacheFileInStream.java:150)
	at alluxio.client.file.cache.LocalCacheFileInStream.positionedRead(LocalCacheFileInStream.java:228)
	at alluxio.hadoop.HdfsFileInputStream.read(HdfsFileInputStream.java:153)
	at alluxio.hadoop.HdfsFileInputStream.readFully(HdfsFileInputStream.java:170)
	at org.apache.hadoop.fs.FSDataInputStream.readFully(FSDataInputStream.java:107)
	at com.facebook.presto.hive.parquet.HdfsParquetDataSource.readInternal(HdfsParquetDataSource.java:65)
	at com.facebook.presto.parquet.AbstractParquetDataSource.readFully(AbstractParquetDataSource.java:60)
	at com.facebook.presto.parquet.reader.ParquetReader$1.read(ParquetReader.java:445)
	at java.base/java.io.BufferedInputStream.fill(BufferedInputStream.java:252)
	at java.base/java.io.BufferedInputStream.read1(BufferedInputStream.java:292)
	at java.base/java.io.BufferedInputStream.read(BufferedInputStream.java:351)
	at com.facebook.presto.hive.$internal.parquet.org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:177)
	at com.facebook.presto.hive.$internal.parquet.org.apache.thrift.transport.TTransport.readAll(TTransport.java:109)
	at com.facebook.presto.hive.$internal.parquet.org.apache.thrift.protocol.TCompactProtocol.readByte(TCompactProtocol.java:636)
	at com.facebook.presto.hive.$internal.parquet.org.apache.thrift.protocol.TCompactProtocol.readFieldBegin(TCompactProtocol.java:539)
	at org.apache.parquet.format.InterningProtocol.readFieldBegin(InterningProtocol.java:155)
	at org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1026)
	at org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1019)
	at org.apache.parquet.format.PageHeader.read(PageHeader.java:896)
	at org.apache.parquet.format.Util.read(Util.java:362)
	at org.apache.parquet.format.Util.readPageHeader(Util.java:132)
	at com.facebook.presto.parquet.reader.ParquetColumnChunk.readPageHeader(ParquetColumnChunk.java:90)
	at com.facebook.presto.parquet.reader.ParquetColumnChunk.buildPageReader(ParquetColumnChunk.java:119)
	at com.facebook.presto.parquet.reader.ParquetReader.createPageReaderInternal(ParquetReader.java:502)
	at com.facebook.presto.parquet.reader.ParquetReader.createPageReader(ParquetReader.java:495)
	at com.facebook.presto.parquet.reader.ParquetReader.readPrimitive(ParquetReader.java:368)
	at com.facebook.presto.parquet.reader.ParquetReader.readColumnChunk(ParquetReader.java:558)
	at com.facebook.presto.parquet.reader.ParquetReader.readBlock(ParquetReader.java:541)
	at com.facebook.presto.hive.parquet.ParquetPageSource$ParquetBlockLoader.load(ParquetPageSource.java:230)
	at com.facebook.presto.hive.parquet.ParquetPageSource$ParquetBlockLoader.load(ParquetPageSource.java:208)
	at com.facebook.presto.common.block.LazyBlock.assureLoaded(LazyBlock.java:313)
	at com.facebook.presto.common.block.LazyBlock.getLoadedBlock(LazyBlock.java:304)
	at com.facebook.presto.operator.ScanFilterAndProjectOperator$RecordingLazyBlockLoader.load(ScanFilterAndProjectOperator.java:335)
	at com.facebook.presto.operator.ScanFilterAndProjectOperator$RecordingLazyBlockLoader.load(ScanFilterAndProjectOperator.java:321)
	at com.facebook.presto.common.block.LazyBlock.assureLoaded(LazyBlock.java:313)
	at com.facebook.presto.common.block.LazyBlock.getLoadedBlock(LazyBlock.java:304)
	at com.facebook.presto.operator.project.InputPageProjection.project(InputPageProjection.java:69)
	at com.facebook.presto.operator.project.PageProjectionWithOutputs.project(PageProjectionWithOutputs.java:56)
	at com.facebook.presto.operator.project.PageProcessor$ProjectSelectedPositions.processBatch(PageProcessor.java:327)
	at com.facebook.presto.operator.project.PageProcessor$ProjectSelectedPositions.process(PageProcessor.java:201)
	at com.facebook.presto.operator.WorkProcessorUtils$ProcessWorkProcessor.process(WorkProcessorUtils.java:315)
	at com.facebook.presto.operator.WorkProcessorUtils$YieldingIterator.computeNext(WorkProcessorUtils.java:79)
	at com.facebook.presto.operator.WorkProcessorUtils$YieldingIterator.computeNext(WorkProcessorUtils.java:65)
	at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:141)
	at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:136)
	at com.facebook.presto.operator.project.MergingPageOutput.getOutput(MergingPageOutput.java:128)
	at com.facebook.presto.operator.ScanFilterAndProjectOperator.processPageSource(ScanFilterAndProjectOperator.java:316)
	at com.facebook.presto.operator.ScanFilterAndProjectOperator.getOutput(ScanFilterAndProjectOperator.java:260)
	at com.facebook.presto.operator.Driver.processInternal(Driver.java:428)
	at com.facebook.presto.operator.Driver.lambda$processFor$9(Driver.java:311)
	at com.facebook.presto.operator.Driver.tryWithLock(Driver.java:732)
	at com.facebook.presto.operator.Driver.processFor(Driver.java:304)
	at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1079)
	at com.facebook.presto.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:165)
	at com.facebook.presto.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:603)
	at com.facebook.presto.$gen.Presto_0_282____20250130_134124_1.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask@5a7a510[Not completed, task = alluxio.client.file.cache.TimeBoundPageStore$$Lambda$4165/0x0000ffc1f2cff568@7821a7e7] rejected from java.util.concurrent.ThreadPoolExecutor@18d371d3[Running, pool size = 64, active threads = 64, queued tasks = 0, completed tasks = 1144697]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2055)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825)
	at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355)
	at java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:140)
	at alluxio.shaded.client.com.google.common.util.concurrent.SimpleTimeLimiter.callWithTimeout(SimpleTimeLimiter.java:159)
	at alluxio.client.file.cache.TimeBoundPageStore.get(TimeBoundPageStore.java:91)
	... 62 more

Some system metrics indications:

Image

Some cache related metrics indications:

Image

JMX metric shows CachePutErrors for which majority of them are RejectedExecutionExceptions.

Fetching metrics for table: com.facebook.alluxio:name=Client.CachePutInsufficientSpaceErrors.presto,type=counters

Columns: ['Count', 'node', 'object_name']
[1078, '7b41106c-9595-42de-b8c3-11237bb50acb', 'com.facebook.alluxio:name=Client.CachePutInsufficientSpaceErrors.presto,type=counters']
[840, '0ab2bafb-c997-4244-9f52-47717b91b928', 'com.facebook.alluxio:name=Client.CachePutInsufficientSpaceErrors.presto,type=counters']
[822, '0eda3ed8-f3c7-4446-b931-986091d31506', 'com.facebook.alluxio:name=Client.CachePutInsufficientSpaceErrors.presto,type=counters']
[917, '254c927f-31ea-4197-b949-971bec6f25d4', 'com.facebook.alluxio:name=Client.CachePutInsufficientSpaceErrors.presto,type=counters']
[892, '88710cc9-3b2f-4df1-b5e3-026a9882e816', 'com.facebook.alluxio:name=Client.CachePutInsufficientSpaceErrors.presto,type=counters']

Fetching metrics for table: com.facebook.alluxio:name=Client.CachePutErrors.presto,type=counters

Columns: ['Count', 'node', 'object_name']
[1129923, '7b41106c-9595-42de-b8c3-11237bb50acb', 'com.facebook.alluxio:name=Client.CachePutErrors.presto,type=counters']
[1257635, '0ab2bafb-c997-4244-9f52-47717b91b928', 'com.facebook.alluxio:name=Client.CachePutErrors.presto,type=counters']
[898066, '0eda3ed8-f3c7-4446-b931-986091d31506', 'com.facebook.alluxio:name=Client.CachePutErrors.presto,type=counters']
[1323588, '254c927f-31ea-4197-b949-971bec6f25d4', 'com.facebook.alluxio:name=Client.CachePutErrors.presto,type=counters']
[1198130, '88710cc9-3b2f-4df1-b5e3-026a9882e816', 'com.facebook.alluxio:name=Client.CachePutErrors.presto,type=counters']

Not sure where to look for next, @beinan @apc999 can you help from alluxio point of view, any suggestions? This was very similar to #15972. However we are currently using alluxio-273 release. cc @nmahadevuni @tdcmeehan

@nmahadevuni
Copy link
Member

All PUT calls are in this stack trace when the issue occurs

"pool-62-thread-1" #291 prio=5 os_prio=0 cpu=251.61ms elapsed=89.81s tid=0x0000ffc7180099a0 nid=0x1bf runnable  [0x0000ffc4ff3fe000]
   java.lang.Thread.State: RUNNABLE
	at java.io.FileDescriptor.close0([email protected]/Native Method)
	at java.io.FileDescriptor.close([email protected]/FileDescriptor.java:297)
	- locked <0x0000fff929144df0> (a java.io.FileDescriptor)
	at java.io.FileOutputStream$1.close([email protected]/FileOutputStream.java:398)
	at java.io.FileDescriptor.closeAll([email protected]/FileDescriptor.java:355)
	- locked <0x0000fff929144df0> (a java.io.FileDescriptor)
	at java.io.FileOutputStream.close([email protected]/FileOutputStream.java:396)
	at alluxio.client.file.cache.store.LocalPageStore.put(LocalPageStore.java:82)
	at alluxio.client.file.cache.TimeBoundPageStore.lambda$put$0(TimeBoundPageStore.java:61)
	at alluxio.client.file.cache.TimeBoundPageStore$$Lambda$3479/0x0000ffc4fff7f500.call(Unknown Source)
	at java.util.concurrent.FutureTask.run([email protected]/FutureTask.java:264)
	at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1128)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:628)
	at java.lang.Thread.run([email protected]/Thread.java:829)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: 🆕 Unprioritized
Development

No branches or pull requests

4 participants