Skip to content

Conversation

@mattkduran
Copy link
Contributor

@mattkduran mattkduran commented Aug 3, 2025

Description of PR

The ABFS driver's auto-throttling feature (fs.azure.enable.autothrottling=true) creates Timer threads in AbfsClientThrottlingAnalyzer that are never properly cleaned up, leading to a memory leak that eventually causes OutOfMemoryError in long-running applications like Hive Metastore.

Impact:

  • Thread count grows indefinitely (observed >100,000 timer threads)
  • Affects any long-running service that creates multiple ABFS filesystem instances

Root Cause:

AbfsClientThrottlingAnalyzer creates Timer objects in its constructor but provides no mechanism to cancel them. When AbfsClient instances are closed, the associated timer threads continue running indefinitely.

Solution

Implement proper resource cleanup by making the throttling components implement Closeable and ensuring timers are cancelled when ABFS clients are closed.

Changes Made

  1. AbfsClientThrottlingAnalyzer.java
  • Added: implements Closeable
  • Added: close() method that calls timer.cancel() and timer.purge()
  • Purpose: Ensures timer threads are properly terminated when analyzer is no longer needed
  1. AbfsThrottlingIntercept.java (Interface)
  • Added: extends Closeable
  • Added: close() method signature
  • Purpose: Establishes cleanup contract for all throttling intercept implementations
  1. AbfsClientThrottlingIntercept.java
  • Added: close() method that closes both readThrottler and writeThrottler
  • Purpose: Coordinates cleanup of both read and write throttling analyzers
  1. AbfsNoOpThrottlingIntercept.java
  • Added: No-op close() method
  • Purpose: Satisfies interface contract for no-op implementation
  1. AbfsClient.java
  • Added: IOUtils.cleanupWithLogger(LOG, intercept) in existing close() method
  • Purpose: Integrates throttling cleanup into existing client resource management

https://github.com/mattkduran/ABFSleaktest
https://www.mail-archive.com/[email protected]/msg43483.html

How was this patch tested?

Standalone Validation Tool

This fix was validated using a standalone reproduction and testing tool that directly exercises the ABFS auto-throttling components outside of a full Hadoop deployment.
Repository: ABFSLeakTest

Testing Scope

  • Problem reproduction confirmed - demonstrates the timer thread leak
  • Fix validation confirmed - proves close() method resolves the leak
  • Resource cleanup verified - shows proper timer cancellation
  • Limited integration testing - standalone tool, not full Hadoop test suite

Test Results

Leak Reproduction Evidence

# Without fix: Timer threads accumulate over filesystem creation cycles
Cycle    Total Threads    ABFS Timer Threads    Status
1        50->52          0->2                   LEAK DETECTED
50       150->152        98->100               LEAK GROWING  
200      250->252        398->400              LEAK CONFIRMED

Final Analysis: 400 leaked timer threads named "abfs-timer-client-throttling-analyzer-*"
Memory Impact: ~90MB additional heap usage

# Direct analyzer testing:
🔴 Without close(): +3 timer threads (LEAKED)
✅ With close():    +0 timer threads (NO LEAK)

Test Environment

  • Java Version: OpenJDK 11.0.x
  • Hadoop Version: 3.3.6/3.4.1 (both affected)
  • Test Duration: 200 filesystem creation/destruction cycles
  • Thread Monitoring: JMX ThreadMXBean

For code changes:

  • [ X ] Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

@hadoop-yetus

This comment was marked as outdated.

Copy link
Contributor

@anujmodi2021 anujmodi2021 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch and thorough testing of the issue @mattkduran
I have a few suggestions and comments. Please do take a look at them.

Also we need at least one test (unit or integration) to be inlcuded in this patch. Can you plan for one? Idea is to have coverage of code impacted here.

Also, I see a few PR checks failing, If you click on the link of each -1 commented by hadoop-yetus, you should be able to see the issue reported and fix them.

Once all of this is done, we can wait for a few more reviews and get this checked in.

Thanks again for all the efforts.

@InterfaceAudience.Private
@InterfaceStability.Unstable
public interface AbfsThrottlingIntercept {
public interface AbfsThrottlingIntercept extends Closable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be extends Closeable?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed -- thank you for catching the typo!

}

/**
* No-op implementation of close method.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: javadoc to include @ throws

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated!

@hadoop-yetus

This comment was marked as outdated.

@hadoop-yetus

This comment was marked as outdated.

@hadoop-yetus

This comment was marked as outdated.

@anujmodi2021
Copy link
Contributor

Spotbug issue might be due to this: https://issues.apache.org/jira/browse/HADOOP-19731
We can wait for the fix and merge with trunk once it gets resolved.

@matt-duran-starburst
Copy link

Spotbug issue might be due to this: https://issues.apache.org/jira/browse/HADOOP-19731 We can wait for the fix and merge with trunk once it gets resolved.

@anujmodi2021 I saw that this got merged, is there a way that I can manually trigger the CI job to confirm if this fixed the spotbugs issues?

@anujmodi2021
Copy link
Contributor

Spotbug issue might be due to this: https://issues.apache.org/jira/browse/HADOOP-19731 We can wait for the fix and merge with trunk once it gets resolved.

@anujmodi2021 I saw that this got merged, is there a way that I can manually trigger the CI job to confirm if this fixed the spotbugs issues?

A dummy commit will help.
Some minor javadoc change can be pushed.

@hadoop-yetus
Copy link

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 1m 31s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 46m 16s trunk passed
+1 💚 compile 0m 53s trunk passed with JDK Ubuntu-21.0.7+6-Ubuntu-0ubuntu120.04
+1 💚 compile 0m 50s trunk passed with JDK Ubuntu-17.0.15+6-Ubuntu-0ubuntu120.04
+1 💚 checkstyle 0m 40s trunk passed
+1 💚 mvnsite 0m 57s trunk passed
+1 💚 javadoc 0m 47s trunk passed with JDK Ubuntu-21.0.7+6-Ubuntu-0ubuntu120.04
+1 💚 javadoc 0m 41s trunk passed with JDK Ubuntu-17.0.15+6-Ubuntu-0ubuntu120.04
-1 ❌ spotbugs 1m 36s /branch-spotbugs-hadoop-tools_hadoop-azure-warnings.html hadoop-tools/hadoop-azure in trunk has 1 extant spotbugs warnings.
+1 💚 shadedclient 34m 58s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 48s the patch passed
+1 💚 compile 0m 42s the patch passed with JDK Ubuntu-21.0.7+6-Ubuntu-0ubuntu120.04
+1 💚 javac 0m 42s the patch passed
+1 💚 compile 0m 44s the patch passed with JDK Ubuntu-17.0.15+6-Ubuntu-0ubuntu120.04
+1 💚 javac 0m 44s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 0m 26s /results-checkstyle-hadoop-tools_hadoop-azure.txt hadoop-tools/hadoop-azure: The patch generated 6 new + 0 unchanged - 0 fixed = 6 total (was 0)
+1 💚 mvnsite 0m 48s the patch passed
+1 💚 javadoc 0m 30s the patch passed with JDK Ubuntu-21.0.7+6-Ubuntu-0ubuntu120.04
+1 💚 javadoc 0m 31s the patch passed with JDK Ubuntu-17.0.15+6-Ubuntu-0ubuntu120.04
+1 💚 spotbugs 1m 26s the patch passed
+1 💚 shadedclient 32m 29s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 3m 25s hadoop-azure in the patch passed.
+1 💚 asflicense 0m 35s The patch does not generate ASF License warnings.
133m 30s
Subsystem Report/Notes
Docker ClientAPI=1.52 ServerAPI=1.52 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7852/5/artifact/out/Dockerfile
GITHUB PR #7852
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux 9cf785906c25 5.15.0-164-generic #174-Ubuntu SMP Fri Nov 14 20:25:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / cd88518
Default Java Ubuntu-17.0.15+6-Ubuntu-0ubuntu120.04
Multi-JDK versions /usr/lib/jvm/java-21-openjdk-amd64:Ubuntu-21.0.7+6-Ubuntu-0ubuntu120.04 /usr/lib/jvm/java-17-openjdk-amd64:Ubuntu-17.0.15+6-Ubuntu-0ubuntu120.04
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7852/5/testReport/
Max. process+thread count 586 (vs. ulimit of 5500)
modules C: hadoop-tools/hadoop-azure U: hadoop-tools/hadoop-azure
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7852/5/console
versions git=2.25.1 maven=3.9.11 spotbugs=4.9.7
Powered by Apache Yetus 0.14.0 https://yetus.apache.org

This message was automatically generated.

@matt-duran-starburst
Copy link

Spotbug failed here -- not something introduced in this patch as far as I'm aware:


Bug type VO_VOLATILE_INCREMENT (click for details)In class org.apache.hadoop.fs.azurebfs.services.AbfsLease$1In method org.apache.hadoop.fs.azurebfs.services.AbfsLease$1.onFailure(Throwable)Field org.apache.hadoop.fs.azurebfs.services.AbfsLease.acquireRetryCountAt AbfsLease.java:[line 200]
--


[Bug type VO_VOLATILE_INCREMENT (click for details)](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7852/5/artifact/out/branch-spotbugs-hadoop-tools_hadoop-azure-warnings.html#VO_VOLATILE_INCREMENT)
In class org.apache.hadoop.fs.azurebfs.services.AbfsLease$1
In method org.apache.hadoop.fs.azurebfs.services.AbfsLease$1.onFailure(Throwable)
Field org.apache.hadoop.fs.azurebfs.services.AbfsLease.acquireRetryCount
At AbfsLease.java:[line 200]

}
return count;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra line needed at end of file

@anujmodi2021
Copy link
Contributor

Thanks @matt-duran-starburst for the patch and taking care of everything.

Copy link
Contributor

@anujmodi2021 anujmodi2021 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@anujmodi2021 anujmodi2021 changed the title HADOOP-19624 Thread leak in ABFS AbfsClientThrottlingAnalyzer HADOOP-19624. Thread leak in ABFS AbfsClientThrottlingAnalyzer Jan 8, 2026
@anujmodi2021 anujmodi2021 changed the title HADOOP-19624. Thread leak in ABFS AbfsClientThrottlingAnalyzer HADOOP-19624. [ABFS] Fixing Thread leak in AbfsClientThrottlingAnalyzer Jan 8, 2026
@anujmodi2021 anujmodi2021 merged commit 4190e98 into apache:trunk Jan 8, 2026
2 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants