Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the performance of Juicefs gc #5671

Open
SonglinLife opened this issue Feb 17, 2025 · 2 comments
Open

Improve the performance of Juicefs gc #5671

SonglinLife opened this issue Feb 17, 2025 · 2 comments
Labels
kind/feature New feature or request

Comments

@SonglinLife
Copy link

SonglinLife commented Feb 17, 2025

When dealing with a large number of pending deleted files, we typically use juicefs gc to remove these files. However, the juicefs gc command invokes the scanPendingFiles function, where each soft - deleted file is processed sequentially.

juicefs/pkg/meta/tkv.go

Lines 2584 to 2594 in 2dd3897

for key, value := range pairs {
if len(key) != klen {
return fmt.Errorf("invalid key %x", key)
}
ino := m.decodeInode([]byte(key)[1:9])
size := binary.BigEndian.Uint64([]byte(key)[9:])
ts := m.parseInt64(value)
clean, err := scan(ino, size, ts)
if err != nil {
return err
}

The issue is that if there are a large number of files with relatively small individual sizes, the file deletion speed will drop significantly. Although gc provides the --threads parameter, in reality, this parameter controls the parallel deletion speed within a single file.

I think this is how we can solve it in tkv.go: process each file pending deletion in parallel. The code is provided below. Please help me review it. :)

	batchSize := 1000000

	threads := min(1, m.conf.MaxDeletes/3)
	deleteFileChan := make(chan pair, threads)
	var wg sync.WaitGroup

	for i := 0; i < threads; i++ {
		wg.Add(1)
		go func() {
			defer wg.Done()
			for pair := range deleteFileChan {
				key, value := pair.key, pair.value
				if len(key) != klen {
					logger.Errorf("invalid key %x", key)
					continue
				}
				ino := m.decodeInode([]byte(key)[1:9])
				size := binary.BigEndian.Uint64([]byte(key)[9:])
				ts := m.parseInt64(value)
				clean, err := scan(ino, size, ts)
				if err != nil {
					logger.Errorf("scan pending deleted files: %s", err)
					continue
				}
				if clean {
					m.doDeleteFileData(ino, size)
				}
			}
		}()
	}

	prefixKey := m.fmtKey("D")
	endKey := nextKey(prefixKey)
	for {
		keys, values, err := m.scan(prefixKey, endKey, batchSize, func(k, v []byte) bool {
			return len(k) == klen
		})
		if len(keys) == 0 {
			break
		}
		if err != nil {
			close(deleteFileChan)
			wg.Wait()
			return err
		}
		prefixKey = keys[len(keys)-1]

		for index, key := range keys {
			deleteFileChan <- pair{key, values[index]}
		}

		if len(keys) < batchSize {
			break
		}
	}

	close(deleteFileChan)
	wg.Wait()
	return nil

And, I tested the results and found that it can improve the gc performance by 10x in the scenario of deleting small files.

@SonglinLife SonglinLife added the kind/feature New feature or request label Feb 17, 2025
@jiefenghuang
Copy link
Contributor

PRs are very welcome

@SonglinLife
Copy link
Author

yes, I'd like to create a pr to address this issue.

SonglinLife added a commit to ctripcloud/juicefs that referenced this issue Feb 19, 2025
Improve the file deletion performance by processing multiple files in parallel

ref: juicedata#5671
SonglinLife added a commit to ctripcloud/juicefs that referenced this issue Feb 19, 2025
Improve the file deletion performance by processing multiple files in parallel

ref: juicedata#5671
SonglinLife added a commit to ctripcloud/juicefs that referenced this issue Feb 19, 2025
Improve the file deletion performance by processing multiple files in parallel

ref: juicedata#5671
SonglinLife added a commit to ctripcloud/juicefs that referenced this issue Feb 19, 2025
Improve the file deletion performance by processing multiple files in parallel

ref: juicedata#5671
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants