Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TBS: set sampling.tail.storage_limit to 70% of available disk space by default #15450

Closed
wants to merge 6 commits into from

Conversation

carsonip
Copy link
Member

Motivation/summary

Set sampling.tail.storage_limit to a 70% of available disk space by default when storage limit is unset. No behavior change if storage limit is set explicitly, either to 0 or non-0.

Rebase after merging #15235

Checklist

For functional changes, consider:

  • Is it observable through the addition of either logging or metrics?
  • Is its use being published in telemetry to enable product improvement?
  • Have system tests been added to avoid regression?

How to test these changes

Start apm-server with sampling.tail.storage_limit unset, set to 0, and set to >0. In case of unset, check that storage limit is automatically set to a percentage of disk space by looking at the logs.

Related issues

Fixes #14933

@carsonip carsonip requested a review from a team as a code owner January 28, 2025 23:15
Copy link
Contributor

mergify bot commented Jan 28, 2025

This pull request does not have a backport label. Could you fix it @carsonip? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-7.17 is the label to automatically backport to the 7.17 branch.
  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit.
  • backport-8.x is the label to automatically backport to the 8.x branch.

Copy link
Contributor

mergify bot commented Jan 28, 2025

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Jan 28, 2025
@carsonip carsonip added backport-skip Skip notification from the automated backport with mergify and removed backport-8.x Automated backport to the 8.x branch with mergify labels Jan 28, 2025
@carsonip
Copy link
Member Author

@simitt I've set it to backport-skip as I'm under the impression that it is 9.0 specific.

if tailSamplingConfig.StorageLimitAuto {
const (
// fallbackDefault is the default TBS storage limit if disk space detection fails.
fallbackDefault = 3 << 30
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[to reviewer] It is still going to be an arbitrary number, but do we want to increase this to something like 5GB or 10GB?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the discussed solution, WDYT about changing the default to the % of total disk size, e.g. 70%.

1pkg
1pkg previously approved these changes Jan 29, 2025
Copy link
Member

@1pkg 1pkg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any chance that we would want to make "70%" configurable in the future too?

@carsonip
Copy link
Member Author

carsonip commented Jan 29, 2025

Is there any chance that we would want to make "70%" configurable in the future too?

I doubt we need the configuration, as user can already configure the absolute number of storage e.g. "3GB", but I could be proven wrong in the long run.

Copy link
Contributor

mergify bot commented Jan 29, 2025

This pull request is now in conflicts. Could you fix it @carsonip? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b tbs-storage-limit-detection upstream/tbs-storage-limit-detection
git merge upstream/main
git push upstream tbs-storage-limit-detection

StorageLimitParsed: 3000000000,
StorageLimit: "",
StorageLimitParsed: 0,
StorageLimitAuto: true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a huge fan of having two config options for the same concern.
Would it make sense to allow configuring either a fixed limit or a percentage in the StorageLimit field? We could keep the 3GB in 8.x and then switch the default to 70% in 9.0.
This way the percentage would be configurable without introducing a new config and I assume some customers would prefer the percentage over the fixed size.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea. We can make StorageLimit accept a % suffix value. And the default for the StorageLimit is 70%. We will still need 2 fields: StorageLimitParsed for an absolute value, StorageLimitPercentage for the percentage. I'll proceed to implement this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The complexity with obtaining an absolute value from a passed in percentage is that ideally we'll add the current DB size to the available disk space, to avoid drifting as the db size increases. We either ignore the drift, or we use % of total size instead of available size. But we still have to determine beats data path here to calculate the storage limit, which may be a circular dependency.

Questions:

  • When should we calculate the absolute value of storage limit - at config parse time, or at processor start time? Processor start time has the benefit of that beats data path is definitely correct and we'll have access to the db size.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Played around with it a little bit. What about doing the following:

  • Move the logic for identifying the disk size limit tointernal/beater/config/sampling.go and adding a config option for the storage dir:
// TailSamplingConfig holds configuration related to tail-sampling.
type TailSamplingConfig struct {
        ...
	TTL                   time.Duration         `config:"ttl" validate:"min=1s"`
	StorageLimit          string                `config:"storage_limit"`
	StorageLimitParsed    uint64
	StorageDir            string
	DiscardOnWriteFailure bool `config:"discard_on_write_failure"`

	esConfigured bool
}
  • The diskspace calculation (including the dependency to libbeat paths) be moved to the TailSamplingConfig Unpack function to calculate the available disk based on
    (1) user specified GB limit
    (2) total disk size * user configured percentage
    (3) default to total disk size * 70%
    Afaics this doesn't creaeete a circular dependency.

  • Then only cfg.StorageLimitParsed and cfg.StorageDir would need to be passed to the TBS implementation, everything else is isolated in the beater/config package.

@carsonip
Copy link
Member Author

Superseded by #15467

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip Skip notification from the automated backport with mergify
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Revisit default TBS storage size limit sampling.tail.storage_limit and storage limit handling
3 participants