Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock in a Loop During High-Concurrency File Updates #5695

Open
qidunfundtim opened this issue Feb 25, 2025 · 1 comment
Open

Deadlock in a Loop During High-Concurrency File Updates #5695

qidunfundtim opened this issue Feb 25, 2025 · 1 comment
Assignees
Labels
kind/bug Something isn't working

Comments

@qidunfundtim
Copy link

What happened:
We have encountered several deadlock issues with JFS in our environment, which consists of AWS RDS and AWS S3. Specifically, 500 nodes are simultaneously writing logs to JFS, and these logs are updated in real-time.
We also enable quota in JFS.
What you expected to happen:
I don't know why deadlock happen, want to fix it.
How to reproduce it (as minimally and precisely as possible):
Launch 500 aws ec2 with JFS mounted, start a python application on each of them, keep writing log into JFS.
We may destroy ec2 instance, and launch new ec2 instance from time to time.
Anything else we need to know?
here is some monitor data:
Image

Image

Image

Process 25448 waits for ExclusiveLock on tuple (2,32) of relation 24708 of database 5; blocked by process 31328.
Process 31328 waits for ShareLock on transaction 685022274; blocked by process 31374.
Process 31374 waits for ExclusiveLock on tuple (5,57) of relation 24708 of database 5; blocked by process 31902.
Process 31902 waits for ShareLock on transaction 685023299; blocked by process 25448.

Environment:

  • JuiceFS version (use juicefs --version) or Hadoop Java SDK version: juicefs version 1.1.1+2023-11-28.437f4e6
  • Cloud provider or hardware configuration running JuiceFS: aws RDS(postgresql) + aws s3
  • OS (e.g cat /etc/os-release): VERSION="22.04.4 LTS (Jammy Jellyfish)"
  • Kernel (e.g. uname -a): 6.5.0-1024-aws
  • Object storage (cloud provider and region, or self maintained): aws s3, us-east-1
  • Metadata engine info (version, cloud provider managed or self maintained): aws RDS(postgresql)
  • Network connectivity (JuiceFS to metadata engine, JuiceFS to object storage): aws internal network
  • Others:
@qidunfundtim qidunfundtim added the kind/bug Something isn't working label Feb 25, 2025
@jiefenghuang
Copy link
Contributor

mark: need to sort and limit the quotas in doFlushQuotas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants