-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use vtbackup in scheduled backups #658
Conversation
Signed-off-by: Florent Poinsard <[email protected]>
Signed-off-by: Florent Poinsard <[email protected]>
Signed-off-by: Florent Poinsard <[email protected]>
Signed-off-by: Florent Poinsard <[email protected]>
Signed-off-by: Florent Poinsard <[email protected]>
Signed-off-by: Florent Poinsard <[email protected]>
Signed-off-by: Florent Poinsard <[email protected]>
Signed-off-by: Florent Poinsard <[email protected]>
Signed-off-by: Florent Poinsard <[email protected]>
Signed-off-by: Florent Poinsard <[email protected]>
Signed-off-by: Florent Poinsard <[email protected]>
Signed-off-by: Florent Poinsard <[email protected]>
Signed-off-by: Florent Poinsard <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! ❤️ I only had a few minor comments/questions.
pkg/controller/vitessbackupschedule/vitessbackupschedule_controller.go
Outdated
Show resolved
Hide resolved
Signed-off-by: Florent Poinsard <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only had a few comments on the diff in this PR, but it's unclear to me how AllowFirstBackup works. Is that functionality that was already present, and hence not showing up as a diff in this PR? If I'm just missing it, please point me to the file where I can see it.
- “Allow” (default): allows CronJobs to run concurrently; | ||
- “Forbid”: forbids concurrent runs, skipping next run if previous run hasn’t finished yet; | ||
- “Allow”: allows CronJobs to run concurrently; | ||
- “Forbid” (default): forbids concurrent runs, skipping next run if previous run hasn’t finished yet; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
docs/api/index.html
Outdated
@@ -2602,7 +2592,8 @@ <h3 id="planetscale.com/v2.VitessBackupScheduleStatus">VitessBackupScheduleStatu | |||
</td> | |||
<td> | |||
<em>(Optional)</em> | |||
<p>A list of pointers to currently running jobs.</p> | |||
<p>A list of pointers to currently running jobs. | |||
This field is deprecated and no longer used in >= v2.15. It will be removed in a future release.</p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this file generated? I'm not repeating my comments from api.md here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docs/api.md
and docs/api/index.html
are both generated using the same source.
@deepthi
vtbackup pod. The decision whether to set the flag or not is done in vtbackupSpec .
|
We have this constant:
That is used in both places, once for the vtbackup init pod, and another for the scheduled backup: vitess-operator/pkg/controller/vitessbackupschedule/vitessbackupschedule_controller.go Line 611 in 072d109
This flag will be used if: we have no backup and we already have a primary running in the shard. |
Based on an out-of-band discussion, we should not allow people to take "first" backups on a long-running cluster. If they have not been using VItessBackupSchedules and they want to start using them, there must be a usable backup that can be used as the basis for the first scheduled backup. |
Signed-off-by: Florent Poinsard <[email protected]>
Signed-off-by: Florent Poinsard <[email protected]>
Context
This Pull Request modifies the scheduled backup feature that was introduced in #553. The implementation in the original PR is rather simple and was a "v1". The CRD
VitessBackupSchedule
was introduced, giving the option to end-users to configure differentVitessBackupScheduleStrategy
, which instruct vitess-operator which keyspace/shard the user wants to backup. All the strategies in aVitessBackupSchedule
were merged together to form a singlevtctldclient
command that would be executed in a K8S'sJob
following the schedule set in theVitessBackupSchedule
CRD.While this approach works, it has limitation: it uses
vtctldclient Backup
which will temporarily take out a serving tablet from the serving tablet pool and use it to take the backup. While this limitation is documented in the release notes, we want to encourage people to usevtbackup
as an alternative.Proposed Change
This Pull Request modifies the
VitessBackupSchedule
controller to create one KubernetesJob
perVitessBackupScheduleStrategy
defined theVitessBackupSchedule
by the user - instead of merging all strategies into a singleJob
. EachJob
will now create avtbackup
pod - instead of creating a pod where avtctldclient
command was executed.Changes
Since
vtbackup
can only backup a single keyspace/shard, we need oneJob
per strategy, meaning that now eachVitessBackupSchedule
manages multipleJob
resources (one per strategy), which was not the case before.The previous code logic relied heavily on the CRD's Status to schedule next runs, this logic was removed to make Status purely informative to only the end users, not the vitess-operator. Moreover, the previous logic was updating the Status of the resource on, almost, every request without knowing if we actually need to update the Status. This was modified to only update the Status when needed and make it ignore conflicts.
The reconciling loop of
VitessBackupSchedule
was, at times, not returning the correct result+error expected by the controller-runtimeTypedReconciler
interface, leading to unnecessary re-queue of requests or missed re-queues. The logic has been modified to return correct values. A periodic resync of the controller has also been added to safeguard potential issues and to ensure resources (Jobs
, PVCs, etc) are cleaned up correctly no matter what.The upgrade test has been modified to also run the
401_scheduled_backups.yaml
manifest.The default
ConcurrentPolicy
of theJobs
has been modified fromAllow
toForbid
. With the new strategies requiring far more resources and time to complete, this new default makes sense. This value is still configurable by end-users.Previously, the
vtbackup
created with everyVitessShard
was always attempting to do an initial backup (--initial_backup
), which would fail if theVitessBackup
definition was added after theVitessShard
started running. That logic has been slightly modified so the vitess-operator is more aware of the current state of theVitessShard
before creating thevtbackup
pod. If there is already a primary in the shard, it will not be an initial backup, but an--allow_first_backup
backup - as we only create this pod if there is no backup for this shard. TheVitessBackupSchedule
relies on the same logic to determine the type of backup it has to run, either: initial, allow first, or normal. Concurrency issues between the initialvtbackup
pod and the first scheduled run of aVitessBackupSchedule
is, depending on the schedule, likely. In such situation bothvtbackup
pods come up with the--initial_backup
flag set to true. One of the two pods will simply fail early.Related PRs
vtbackup
in scheduled backups vitessio/website#1946vtbackup
in scheduled backups vitessio/vitess#17869