Skip to content

Commit

Permalink
[feat](lock)add deadlock detection tool and monitored lock implementa…
Browse files Browse the repository at this point in the history
…tions #39015  (#39099)

This issue proposes the addition of new features to the project,
including a deadlock detection tool and monitored lock implementations.
These features will help in identifying and debugging potential
deadlocks and monitoring lock usage. Features:

A monitored version of Lock that tracks and logs lock acquisition and
release times. Functionality:
Overrides lock(), unlock(), tryLock(), and tryLock(long timeout,
TimeUnit unit) methods. Logs information about lock acquisition time,
release time, and any failure to acquire the lock within the specified
timeout. ##### eg
```log
2024-08-07 12:02:59  [ Thread-2:2006 ] - [ WARN ]  Thread ID: 12, Thread Name: Thread-2 - Lock held for 1912 ms, exceeding hold timeout of 1000 ms
Thread stack trace:
	at java.lang.Thread.getStackTrace(Thread.java:1564)
	at org.example.lock.AbstractMonitoredLock.afterUnlock(AbstractMonitoredLock.java:49)
	at org.example.lock.MonitoredReentrantLock.unlock(MonitoredReentrantLock.java:32)
	at org.example.ExampleService.timeout(ExampleService.java:17)
	at org.example.Main.lambda$test2$1(Main.java:39)
	at java.lang.Thread.run(Thread.java:750)
```

Uses ScheduledExecutorService for periodic deadlock checks. Logs
deadlock information including thread names, states, lock info, and
stack traces.

**ThreadMXBean accesses thread information in the local JVM, which is
already in memory, so accessing it is less expensive than fetching data
from external resources such as disk or network. Thread state cache: The
JVM typically maintains a cache of thread states, reducing the need for
real-time calculations or additional data processing.** ##### eg
```log
Thread Name: Thread-0
Thread State: WAITING
Lock Name: java.util.concurrent.locks.ReentrantLock$NonfairSync@1d653213
Lock Owner Name: Thread-1
Lock Owner Id: 12
Waited Time: -1
Blocked Time: -1
Lock Info: java.util.concurrent.locks.ReentrantLock$NonfairSync@1d653213
Blocked by: java.util.concurrent.locks.ReentrantLock$NonfairSync@1d653213
Stack Trace:
	at sun.misc.Unsafe.park(Native Method)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
	at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
	at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
	at org.example.lock.MonitoredReentrantLock.lock(MonitoredReentrantLock.java:22)
	at org.example.Main.lambda$testDeadLock$3(Main.java:79)
	at org.example.Main$$Lambda$1/1221555852.run(Unknown Source)
	at java.lang.Thread.run(Thread.java:750)

2024-08-07 14:11:28  [ pool-1-thread-1:2001 ] - [ WARN ]  Deadlocks detected:
Thread Name: Thread-1
Thread State: WAITING
Lock Name: java.util.concurrent.locks.ReentrantLock$NonfairSync@13a2dfcf
Lock Owner Name: Thread-0
Lock Owner Id: 11
Waited Time: -1
Blocked Time: -1
Lock Info: java.util.concurrent.locks.ReentrantLock$NonfairSync@13a2dfcf
Blocked by: java.util.concurrent.locks.ReentrantLock$NonfairSync@13a2dfcf
Stack Trace:
	at sun.misc.Unsafe.park(Native Method)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
	at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
	at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
	at org.example.lock.MonitoredReentrantLock.lock(MonitoredReentrantLock.java:22)
	at org.example.Main.lambda$testDeadLock$4(Main.java:93)
	at org.example.Main$$Lambda$2/1556956098.run(Unknown Source)
	at java.lang.Thread.run(Thread.java:750)

```
```
    @WarmUp(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
    @measurement(iterations = 2, time = 2, timeUnit = TimeUnit.SECONDS)
    @threads(1)

Benchmark                                                          Mode  Cnt       Score   Error   Units
LockBenchmark.testMonitoredLock                                   thrpt    2   15889.407          ops/ms
LockBenchmark.testMonitoredLock:·gc.alloc.rate                    thrpt    2     678.061          MB/sec
LockBenchmark.testMonitoredLock:·gc.alloc.rate.norm               thrpt    2      56.000            B/op
LockBenchmark.testMonitoredLock:·gc.churn.PS_Eden_Space           thrpt    2     668.249          MB/sec
LockBenchmark.testMonitoredLock:·gc.churn.PS_Eden_Space.norm      thrpt    2      55.080            B/op
LockBenchmark.testMonitoredLock:·gc.churn.PS_Survivor_Space       thrpt    2       0.075          MB/sec
LockBenchmark.testMonitoredLock:·gc.churn.PS_Survivor_Space.norm  thrpt    2       0.006            B/op
LockBenchmark.testMonitoredLock:·gc.count                         thrpt    2      20.000          counts
LockBenchmark.testMonitoredLock:·gc.time                          thrpt    2       6.000              ms
LockBenchmark.testNativeLock                                      thrpt    2  103130.635          ops/ms
LockBenchmark.testNativeLock:·gc.alloc.rate                       thrpt    2      ≈ 10⁻⁴          MB/sec
LockBenchmark.testNativeLock:·gc.alloc.rate.norm                  thrpt    2      ≈ 10⁻⁶            B/op
LockBenchmark.testNativeLock:·gc.count                            thrpt    2         ≈ 0          counts

    @WarmUp(iterations = 5, time = 1, timeUnit = TimeUnit.SECONDS)
    @measurement(iterations = 2, time = 2, timeUnit = TimeUnit.SECONDS)
    @threads(100)

Benchmark                                                          Mode  Cnt       Score   Error   Units
LockBenchmark.testMonitoredLock                                   thrpt    2   10994.606          ops/ms
LockBenchmark.testMonitoredLock:·gc.alloc.rate                    thrpt    2     488.508          MB/sec
LockBenchmark.testMonitoredLock:·gc.alloc.rate.norm               thrpt    2      56.002            B/op
LockBenchmark.testMonitoredLock:·gc.churn.PS_Eden_Space           thrpt    2     481.390          MB/sec
LockBenchmark.testMonitoredLock:·gc.churn.PS_Eden_Space.norm      thrpt    2      55.163            B/op
LockBenchmark.testMonitoredLock:·gc.churn.PS_Survivor_Space       thrpt    2       0.020          MB/sec
LockBenchmark.testMonitoredLock:·gc.churn.PS_Survivor_Space.norm  thrpt    2       0.002            B/op
LockBenchmark.testMonitoredLock:·gc.count                         thrpt    2      18.000          counts
LockBenchmark.testMonitoredLock:·gc.time                          thrpt    2       9.000              ms
LockBenchmark.testNativeLock                                      thrpt    2  558652.036          ops/ms
LockBenchmark.testNativeLock:·gc.alloc.rate                       thrpt    2       0.016          MB/sec
LockBenchmark.testNativeLock:·gc.alloc.rate.norm                  thrpt    2      ≈ 10⁻⁴            B/op
LockBenchmark.testNativeLock:·gc.count                            thrpt    2         ≈ 0          counts
```

Issue Number: close #xxx

<!--Describe your changes.-->
  • Loading branch information
CalvinKirs authored and liulijia committed Sep 14, 2024
1 parent 4818ba0 commit f7e43ed
Show file tree
Hide file tree
Showing 21 changed files with 485 additions and 114 deletions.
17 changes: 17 additions & 0 deletions fe/fe-common/src/main/java/org/apache/doris/common/Config.java
Original file line number Diff line number Diff line change
Expand Up @@ -2453,4 +2453,21 @@ public class Config extends ConfigBase {

@ConfField(mutable = true)
public static boolean enable_cooldown_replica_affinity = true;

//==========================================================================
// end of cloud config
//==========================================================================
//==========================================================================
// start of lock config
@ConfField(description = {"是否开启死锁检测",
"Whether to enable deadlock detection"})
public static boolean enable_deadlock_detection = false;

@ConfField(description = {"死锁检测间隔时间,单位分钟",
"Deadlock detection interval time, unit minute"})
public static long deadlock_detection_interval_minute = 5;

@ConfField(mutable = true, description = {"表示最大锁持有时间,超过该时间会打印告警日志,单位秒",
"Maximum lock hold time; logs a warning if exceeded"})
public static long max_lock_hold_threshold_seconds = 10;
}
11 changes: 10 additions & 1 deletion fe/fe-core/src/main/java/org/apache/doris/DorisFE.java
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
import org.apache.doris.common.Log4jConfig;
import org.apache.doris.common.ThreadPoolManager;
import org.apache.doris.common.Version;
import org.apache.doris.common.lock.DeadlockMonitor;
import org.apache.doris.common.telemetry.Telemetry;
import org.apache.doris.common.util.JdkUtils;
import org.apache.doris.common.util.NetUtils;
Expand Down Expand Up @@ -59,6 +60,7 @@
import java.nio.channels.FileLock;
import java.nio.channels.OverlappingFileLockException;
import java.nio.file.StandardOpenOption;
import java.util.concurrent.TimeUnit;

public class DorisFE {
private static final Logger LOG = LogManager.getLogger(DorisFE.class);
Expand All @@ -84,6 +86,13 @@ public static void main(String[] args) {
start(DORIS_HOME_DIR, PID_DIR, args, options);
}

private static void startMonitor() {
if (Config.enable_deadlock_detection) {
DeadlockMonitor deadlockMonitor = new DeadlockMonitor();
deadlockMonitor.startMonitoring(Config.deadlock_detection_interval_minute, TimeUnit.MINUTES);
}
}

// entrance for doris frontend
public static void start(String dorisHomeDir, String pidDir, String[] args, StartupOptions options) {
if (System.getenv("DORIS_LOG_TO_STDERR") != null) {
Expand Down Expand Up @@ -202,7 +211,7 @@ public static void start(String dorisHomeDir, String pidDir, String[] args, Star
}

ThreadPoolManager.registerAllThreadPoolMetric();

startMonitor();
while (true) {
Thread.sleep(2000);
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
import org.apache.doris.common.MetaNotFoundException;
import org.apache.doris.common.io.Text;
import org.apache.doris.common.io.Writable;
import org.apache.doris.common.lock.MonitoredReentrantReadWriteLock;
import org.apache.doris.persist.ColocatePersistInfo;
import org.apache.doris.persist.gson.GsonPostProcessable;
import org.apache.doris.persist.gson.GsonUtils;
Expand Down Expand Up @@ -50,7 +51,6 @@
import java.util.Map;
import java.util.Optional;
import java.util.Set;
import java.util.concurrent.locks.ReentrantReadWriteLock;
import java.util.stream.Collectors;

/**
Expand Down Expand Up @@ -176,7 +176,7 @@ public static boolean isGlobalGroupName(String groupName) {
// save some error msg of the group for show. no need to persist
private Map<GroupId, String> group2ErrMsgs = Maps.newHashMap();

private transient ReentrantReadWriteLock lock = new ReentrantReadWriteLock();
private transient MonitoredReentrantReadWriteLock lock = new MonitoredReentrantReadWriteLock();

public ColocateTableIndex() {

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,9 +27,9 @@
import org.apache.doris.common.UserException;
import org.apache.doris.common.io.Text;
import org.apache.doris.common.io.Writable;
import org.apache.doris.common.lock.MonitoredReentrantReadWriteLock;
import org.apache.doris.common.util.DebugUtil;
import org.apache.doris.common.util.PropertyAnalyzer;
import org.apache.doris.common.util.QueryableReentrantReadWriteLock;
import org.apache.doris.common.util.Util;
import org.apache.doris.datasource.CatalogIf;
import org.apache.doris.persist.CreateTableInfo;
Expand Down Expand Up @@ -85,7 +85,8 @@ public class Database extends MetaObject implements Writable, DatabaseIf<Table>
private volatile String fullQualifiedName;
@SerializedName(value = "clusterName")
private String clusterName;
private QueryableReentrantReadWriteLock rwLock;

private MonitoredReentrantReadWriteLock rwLock;

// table family group map
private Map<Long, Table> idToTable;
Expand Down Expand Up @@ -135,7 +136,7 @@ public Database(long id, String name) {
if (this.fullQualifiedName == null) {
this.fullQualifiedName = "";
}
this.rwLock = new QueryableReentrantReadWriteLock(true);
this.rwLock = new MonitoredReentrantReadWriteLock(true);
this.idToTable = Maps.newConcurrentMap();
this.nameToTable = Maps.newConcurrentMap();
this.lowerCaseToTableName = Maps.newConcurrentMap();
Expand Down
6 changes: 3 additions & 3 deletions fe/fe-core/src/main/java/org/apache/doris/catalog/Env.java
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,7 @@
import org.apache.doris.common.UserException;
import org.apache.doris.common.io.CountingDataOutputStream;
import org.apache.doris.common.io.Text;
import org.apache.doris.common.lock.MonitoredReentrantLock;
import org.apache.doris.common.util.Daemon;
import org.apache.doris.common.util.DynamicPartitionUtil;
import org.apache.doris.common.util.HttpURLUtil;
Expand All @@ -118,7 +119,6 @@
import org.apache.doris.common.util.NetUtils;
import org.apache.doris.common.util.PrintableMap;
import org.apache.doris.common.util.PropertyAnalyzer;
import org.apache.doris.common.util.QueryableReentrantLock;
import org.apache.doris.common.util.SmallFileMgr;
import org.apache.doris.common.util.TimeUtils;
import org.apache.doris.common.util.Util;
Expand Down Expand Up @@ -313,7 +313,7 @@ public class Env {
// We use fair ReentrantLock to avoid starvation. Do not use this lock in critical code pass
// because fair lock has poor performance.
// Using QueryableReentrantLock to print owner thread in debug mode.
private QueryableReentrantLock lock;
private MonitoredReentrantLock lock;

private CatalogMgr catalogMgr;
private GlobalFunctionMgr globalFunctionMgr;
Expand Down Expand Up @@ -586,7 +586,7 @@ private Env(boolean isCheckpointCatalog) {
this.syncJobManager = new SyncJobManager();
this.alter = new Alter();
this.consistencyChecker = new ConsistencyChecker();
this.lock = new QueryableReentrantLock(true);
this.lock = new MonitoredReentrantLock(true);
this.backupHandler = new BackupHandler(this);
this.metaDir = Config.meta_dir;
this.publishVersionDaemon = new PublishVersionDaemon();
Expand Down
8 changes: 4 additions & 4 deletions fe/fe-core/src/main/java/org/apache/doris/catalog/Table.java
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
import org.apache.doris.common.MetaNotFoundException;
import org.apache.doris.common.io.Text;
import org.apache.doris.common.io.Writable;
import org.apache.doris.common.util.QueryableReentrantReadWriteLock;
import org.apache.doris.common.lock.MonitoredReentrantReadWriteLock;
import org.apache.doris.common.util.SqlUtils;
import org.apache.doris.common.util.Util;
import org.apache.doris.external.hudi.HudiTable;
Expand Down Expand Up @@ -72,7 +72,7 @@ public abstract class Table extends MetaObject implements Writable, TableIf {
protected volatile String qualifiedDbName;
protected TableType type;
protected long createTime;
protected QueryableReentrantReadWriteLock rwLock;
protected MonitoredReentrantReadWriteLock rwLock;

/*
* fullSchema and nameToColumn should contains all columns, both visible and shadow.
Expand Down Expand Up @@ -112,7 +112,7 @@ public Table(TableType type) {
this.type = type;
this.fullSchema = Lists.newArrayList();
this.nameToColumn = Maps.newTreeMap(String.CASE_INSENSITIVE_ORDER);
this.rwLock = new QueryableReentrantReadWriteLock(true);
this.rwLock = new MonitoredReentrantReadWriteLock(true);
}

public Table(long id, String tableName, TableType type, List<Column> fullSchema) {
Expand All @@ -132,7 +132,7 @@ public Table(long id, String tableName, TableType type, List<Column> fullSchema)
// Only view in with-clause have null base
Preconditions.checkArgument(type == TableType.VIEW, "Table has no columns");
}
this.rwLock = new QueryableReentrantReadWriteLock(true);
this.rwLock = new MonitoredReentrantReadWriteLock(true);
this.createTime = Instant.now().getEpochSecond();
}

Expand Down
4 changes: 2 additions & 2 deletions fe/fe-core/src/main/java/org/apache/doris/catalog/Tablet.java
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
import org.apache.doris.common.Pair;
import org.apache.doris.common.io.Text;
import org.apache.doris.common.io.Writable;
import org.apache.doris.common.lock.MonitoredReentrantReadWriteLock;
import org.apache.doris.persist.gson.GsonUtils;
import org.apache.doris.resource.Tag;
import org.apache.doris.system.Backend;
Expand All @@ -50,7 +51,6 @@
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.concurrent.locks.ReentrantReadWriteLock;
import java.util.stream.Collectors;
import java.util.stream.LongStream;

Expand Down Expand Up @@ -97,7 +97,7 @@ public enum TabletStatus {
private long cooldownReplicaId = -1;
@SerializedName(value = "cooldownTerm")
private long cooldownTerm = -1;
private ReentrantReadWriteLock cooldownConfLock = new ReentrantReadWriteLock();
private MonitoredReentrantReadWriteLock cooldownConfLock = new MonitoredReentrantReadWriteLock();

// last time that the tablet checker checks this tablet.
// no need to persist
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
import org.apache.doris.common.MetaNotFoundException;
import org.apache.doris.common.io.Text;
import org.apache.doris.common.io.Writable;
import org.apache.doris.common.lock.MonitoredReentrantReadWriteLock;
import org.apache.doris.common.util.Util;
import org.apache.doris.datasource.CatalogIf;
import org.apache.doris.datasource.ExternalCatalog;
Expand All @@ -50,7 +51,6 @@
import java.util.Map;
import java.util.Set;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.locks.ReentrantReadWriteLock;

/**
* Base class of external database.
Expand All @@ -61,7 +61,7 @@ public abstract class ExternalDatabase<T extends ExternalTable>
implements DatabaseIf<T>, Writable, GsonPostProcessable {
private static final Logger LOG = LogManager.getLogger(ExternalDatabase.class);

protected ReentrantReadWriteLock rwLock = new ReentrantReadWriteLock(true);
protected MonitoredReentrantReadWriteLock rwLock = new MonitoredReentrantReadWriteLock(true);

@SerializedName(value = "id")
protected long id;
Expand Down Expand Up @@ -344,7 +344,7 @@ public void gsonPostProcess() throws IOException {
for (T tbl : idToTbl.values()) {
tableNameToId.put(tbl.getName(), tbl.getId());
}
rwLock = new ReentrantReadWriteLock(true);
rwLock = new MonitoredReentrantReadWriteLock(true);
}

@Override
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

package org.apache.doris.common.lock;

import org.apache.doris.common.Config;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
* Abstract base class for a monitored lock that tracks lock acquisition,
* release, and attempt times. It provides mechanisms for monitoring the
* duration for which a lock is held and logging any instances where locks
* are held longer than a specified timeout or fail to be acquired within
* a specified timeout.
*/
public abstract class AbstractMonitoredLock {

private static final Logger LOG = LoggerFactory.getLogger(AbstractMonitoredLock.class);

// Thread-local variable to store the lock start time
private final ThreadLocal<Long> lockStartTime = new ThreadLocal<>();


/**
* Method to be called after successfully acquiring the lock.
* Sets the start time for the lock.
*/
protected void afterLock() {
lockStartTime.set(System.nanoTime());
}

/**
* Method to be called after releasing the lock.
* Calculates the lock hold time and logs a warning if it exceeds the hold timeout.
*/
protected void afterUnlock() {
Long startTime = lockStartTime.get();
if (startTime != null) {
long lockHoldTimeNanos = System.nanoTime() - startTime;
long lockHoldTimeMs = lockHoldTimeNanos >> 20;
if (lockHoldTimeMs > Config.max_lock_hold_threshold_seconds * 1000) {
Thread currentThread = Thread.currentThread();
String stackTrace = getThreadStackTrace(currentThread.getStackTrace());
LOG.warn("Thread ID: {}, Thread Name: {} - Lock held for {} ms, exceeding hold timeout of {} ms "
+ "Thread stack trace:{}",
currentThread.getId(), currentThread.getName(), lockHoldTimeMs, lockHoldTimeMs, stackTrace);
}
lockStartTime.remove();
}
}

/**
* Method to be called after attempting to acquire the lock using tryLock.
* Logs a warning if the lock was not acquired within a reasonable time.
*
* @param acquired Whether the lock was successfully acquired
* @param startTime The start time of the lock attempt
*/
protected void afterTryLock(boolean acquired, long startTime) {
if (acquired) {
afterLock();
return;
}
if (LOG.isDebugEnabled()) {
long elapsedTime = (System.nanoTime() - startTime) >> 20;
Thread currentThread = Thread.currentThread();
String stackTrace = getThreadStackTrace(currentThread.getStackTrace());
LOG.debug("Thread ID: {}, Thread Name: {} - Failed to acquire the lock within {} ms"
+ "\nThread blocking info:\n{}",
currentThread.getId(), currentThread.getName(), elapsedTime, stackTrace);
}
}

/**
* Utility method to format the stack trace of a thread.
*
* @param stackTrace The stack trace elements of the thread
* @return A formatted string of the stack trace
*/
private String getThreadStackTrace(StackTraceElement[] stackTrace) {
StringBuilder sb = new StringBuilder();
for (StackTraceElement element : stackTrace) {
sb.append("\tat ").append(element).append("\n");
}
return sb.toString().replace("\n", "\\n");
}
}


Loading

0 comments on commit f7e43ed

Please sign in to comment.