Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use vtbackup in scheduled backups #658

Merged
merged 18 commits into from
Mar 6, 2025
Merged

Conversation

frouioui
Copy link
Member

@frouioui frouioui commented Jan 28, 2025

Context

This Pull Request modifies the scheduled backup feature that was introduced in #553. The implementation in the original PR is rather simple and was a "v1". The CRD VitessBackupSchedule was introduced, giving the option to end-users to configure different VitessBackupScheduleStrategy, which instruct vitess-operator which keyspace/shard the user wants to backup. All the strategies in a VitessBackupSchedule were merged together to form a single vtctldclient command that would be executed in a K8S's Job following the schedule set in the VitessBackupSchedule CRD.

While this approach works, it has limitation: it uses vtctldclient Backup which will temporarily take out a serving tablet from the serving tablet pool and use it to take the backup. While this limitation is documented in the release notes, we want to encourage people to use vtbackup as an alternative.

Proposed Change

This Pull Request modifies the VitessBackupSchedule controller to create one Kubernetes Job per VitessBackupScheduleStrategy defined the VitessBackupSchedule by the user - instead of merging all strategies into a single Job. Each Job will now create a vtbackup pod - instead of creating a pod where a vtctldclient command was executed.

Changes

  • Since vtbackup can only backup a single keyspace/shard, we need one Job per strategy, meaning that now each VitessBackupSchedule manages multiple Job resources (one per strategy), which was not the case before.

  • The previous code logic relied heavily on the CRD's Status to schedule next runs, this logic was removed to make Status purely informative to only the end users, not the vitess-operator. Moreover, the previous logic was updating the Status of the resource on, almost, every request without knowing if we actually need to update the Status. This was modified to only update the Status when needed and make it ignore conflicts.

  • The reconciling loop of VitessBackupSchedule was, at times, not returning the correct result+error expected by the controller-runtime TypedReconciler interface, leading to unnecessary re-queue of requests or missed re-queues. The logic has been modified to return correct values. A periodic resync of the controller has also been added to safeguard potential issues and to ensure resources (Jobs, PVCs, etc) are cleaned up correctly no matter what.

  • The upgrade test has been modified to also run the 401_scheduled_backups.yaml manifest.

  • The default ConcurrentPolicy of the Jobs has been modified from Allow to Forbid. With the new strategies requiring far more resources and time to complete, this new default makes sense. This value is still configurable by end-users.

  • Previously, the vtbackup created with every VitessShard was always attempting to do an initial backup (--initial_backup), which would fail if the VitessBackup definition was added after the VitessShard started running. That logic has been slightly modified so the vitess-operator is more aware of the current state of the VitessShard before creating the vtbackup pod. If there is already a primary in the shard, it will not be an initial backup, but an --allow_first_backup backup - as we only create this pod if there is no backup for this shard. The VitessBackupSchedule relies on the same logic to determine the type of backup it has to run, either: initial, allow first, or normal. Concurrency issues between the initial vtbackup pod and the first scheduled run of a VitessBackupSchedule is, depending on the schedule, likely. In such situation both vtbackup pods come up with the --initial_backup flag set to true. One of the two pods will simply fail early.

Related PRs

Signed-off-by: Florent Poinsard <[email protected]>
Signed-off-by: Florent Poinsard <[email protected]>
Signed-off-by: Florent Poinsard <[email protected]>
Signed-off-by: Florent Poinsard <[email protected]>
Signed-off-by: Florent Poinsard <[email protected]>
Copy link
Contributor

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! ❤️ I only had a few minor comments/questions.

Signed-off-by: Florent Poinsard <[email protected]>
Copy link
Collaborator

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only had a few comments on the diff in this PR, but it's unclear to me how AllowFirstBackup works. Is that functionality that was already present, and hence not showing up as a diff in this PR? If I'm just missing it, please point me to the file where I can see it.

- &ldquo;Allow&rdquo; (default): allows CronJobs to run concurrently;
- &ldquo;Forbid&rdquo;: forbids concurrent runs, skipping next run if previous run hasn&rsquo;t finished yet;
- &ldquo;Allow&rdquo;: allows CronJobs to run concurrently;
- &ldquo;Forbid&rdquo; (default): forbids concurrent runs, skipping next run if previous run hasn&rsquo;t finished yet;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -2602,7 +2592,8 @@ <h3 id="planetscale.com/v2.VitessBackupScheduleStatus">VitessBackupScheduleStatu
</td>
<td>
<em>(Optional)</em>
<p>A list of pointers to currently running jobs.</p>
<p>A list of pointers to currently running jobs.
This field is deprecated and no longer used in &gt;= v2.15. It will be removed in a future release.</p>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this file generated? I'm not repeating my comments from api.md here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs/api.md and docs/api/index.html are both generated using the same source.

@frouioui
Copy link
Member Author

frouioui commented Mar 5, 2025

@deepthi AllowFirstBackup is new code, the flag is set here

"allow_first_backup": backupSpec.AllowFirstBackup,
on the vtbackup pod. The decision whether to set the flag or not is done in vtbackupSpec.

@frouioui
Copy link
Member Author

frouioui commented Mar 5, 2025

We have this constant:

TypeFirstBackup = "empty"

That is used in both places, once for the vtbackup init pod, and another for the scheduled backup:

backupType := vitessbackup.TypeFirstBackup

This flag will be used if: we have no backup and we already have a primary running in the shard.

@deepthi
Copy link
Collaborator

deepthi commented Mar 5, 2025

We have this constant:

TypeFirstBackup = "empty"

That is used in both places, once for the vtbackup init pod, and another for the scheduled backup:

backupType := vitessbackup.TypeFirstBackup

This flag will be used if: we have no backup and we already have a primary running in the shard.

Based on an out-of-band discussion, we should not allow people to take "first" backups on a long-running cluster. If they have not been using VItessBackupSchedules and they want to start using them, there must be a usable backup that can be used as the basis for the first scheduled backup.

frouioui added 2 commits March 5, 2025 15:52
Signed-off-by: Florent Poinsard <[email protected]>
Signed-off-by: Florent Poinsard <[email protected]>
@frouioui frouioui merged commit 2a9fe95 into main Mar 6, 2025
12 checks passed
@frouioui frouioui deleted the use-vtbackup-in-scheduled-backups branch March 6, 2025 17:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants