Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mountpoint wont start after a hard reboot, cache dir must be cleared manually #1251

Open
eschernau-fmi opened this issue Jan 30, 2025 · 5 comments
Labels
bug Something isn't working

Comments

@eschernau-fmi
Copy link

Mountpoint for Amazon S3 version

mount-s3 1.13.0

AWS Region

No response

Describe the running environment

EC2 instance running Rocky Linux 8.10

Mountpoint options

/usr/bin/mount-s3 --read-only --allow-other --file-mode 0555 --dir-mode 0555 --part-size 134217728 --metadata-ttl 300 --cache /opt/mountpoint/cache/$BUCKET --max-cache-size 1024 $BUCKET $MOUNT

What happened?

If an instance hangs, typically out-of-memory, and we must hard-boot it in the AWS UI, when it comes back up, mountpoint will frequently not come up until we manually remove all the files in the cache directory.

Relevant log output

Jan 28 22:48:24 hostname mount-s3[1529]: [ERROR] mountpoint_s3::cli: timeout after 30 seconds waiting for message from child process
Jan 28 22:48:24 hostname systemd[1]: mountpoint.service: Control process exited, code=exited status=1
Jan 28 22:48:24 hostname systemd[1]: mountpoint.service: Failed with result 'exit-code'.
Jan 28 22:48:24 hostname systemd[1]: Failed to start Service to mount the xxxx bucket, $bucket, at /mount using aws mountpoint.
@eschernau-fmi eschernau-fmi added the bug Something isn't working label Jan 30, 2025
@passaro
Copy link
Contributor

passaro commented Jan 31, 2025

Hi @eschernau-fmi, could you collect the logs for Mountpoint with the --debug flag? That could help us confirm whether the mount is stuck while cleaning the cache directory.

How are you manually removing the files in the cache directory? Is it taking long (e.g. longer than 30s)? Does the user have different permission than the mount-s3 process?

@eschernau-fmi
Copy link
Author

Unfortunately I can't run the machine in debug mode, the logs are too massive and grow without end.

Everything is running as root.

On boot, I will get alerts from our system monitoring tool for mountpoint errors in syslog, so I manually log into the host and do:

rm -rf /opt/mountpoint/cache/$bucket/*

Which takes a short period of time, maybe 5-8 seconds. Then I can do a 'systemctl start $myservice' and it comes up.

For reference, systemd service file is:

[Unit]
Description=Service to mount the bucket, $bucket, at $mount using aws mountpoint
After=network-online.target
AssertPathIsDirectory=$mount

[Service]
Type=forking
User=root
ExecStart=/usr/bin/mount-s3 --read-only --allow-other --file-mode 0555 --dir-mode 0555 --part-size 134217728 --metadata-ttl 300 --cache /opt/mountpoint/cache/$bucket --max-cache-size 1024 $bucket $mount
ExecStop=/usr/bin/fusermount -u $mount
OOMScoreAdjust=-1000

[Install]
WantedBy=multi-user.target

@passaro
Copy link
Contributor

passaro commented Jan 31, 2025

[..] Which takes a short period of time, maybe 5-8 seconds

How many files are in the cache directory? From your --max-cache-size 1024, I'd expect ~1024. It is surprising that it takes that long.

Unfortunately I can't run the machine in debug mode, the logs are too massive and grow without end.

As an alternative, could you try and run mount-s3 manually when you log into the host after a failure but before running rm? You could use the same arguments as in the service file plus --debug. That would hopefully tell us why it is failing/stuck.

@eschernau-fmi
Copy link
Author

Good idea, I hadn't thought of that. I'll make sure we run it in debug next time this happens.

@passaro
Copy link
Contributor

passaro commented Feb 3, 2025

@eschernau-fmi, on second thought, depending on the size of the files, the cache could contain a lot more small blocks. Could you report how many are there in one of the cases where you have to manually clear the cache?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants