close
Skip to content

chore: support dataprotection restore API decoupling#10192

Open
leon-ape wants to merge 30 commits into
mainfrom
support/dp-restore-api-decoupling
Open

chore: support dataprotection restore API decoupling#10192
leon-ape wants to merge 30 commits into
mainfrom
support/dp-restore-api-decoupling

Conversation

@leon-ape
Copy link
Copy Markdown
Contributor

@leon-ape leon-ape commented May 4, 2026

Summary

This PR switches cluster restore to the Cluster.spec.restore intent flow and removes the previous apps-owned DP restore execution path.

The restore flow now uses explicit Kubernetes APIs:

  • users or Restore ops create a target Cluster with spec.restore
  • apps projects the restore intent onto initial PVC templates through spec.dataSourceRef.kind=Backup plus apps-owned restore metadata annotations
  • DP watches Backup-kind PVCs and performs the actual data/account restore
  • DP writes PVC restore conditions
  • apps aggregates PVC restore conditions back to Component/Cluster restore conditions
  • Restore ops waits for the apps restore condition and the final target Cluster phase

This keeps apps and ops interacting with data protection through API contracts rather than through controller-internal DP restore calls.

Design

  • Add apps.kubeblocks.io/v1.Cluster.spec.restore as the user-facing restore intent.
  • Restore ops keeps its existing user-facing operation, but now reconstructs the target Cluster from the Backup snapshot and sets spec.restore; it no longer writes kubeblocks.io/restore-from-backup.
  • apps does not import or invoke DP restore execution logic. It only translates Cluster.spec.restore into generic PVC restore metadata:
    • PVC.spec.dataSourceRef.apiGroup
    • PVC.spec.dataSourceRef.kind
    • PVC.spec.dataSourceRef.name
    • restore source namespace, PITR, parameters, component, and volume template annotations
  • DP owns the Backup runtime implementation by watching PVCs whose dataSourceRef points at dataprotection.kubeblocks.io/Backup.
  • DP creates internal Restore CRs as implementation details to reuse the existing restore/job path; internal Restore CRs are not apps or ops input.
  • DP restores system account data from Backup before or during volume restore so apps can continue to provision final account Secrets through its normal flow.
  • apps records restore progress through Component and Cluster restore conditions based on PVC restore conditions.
  • Once restore is completed, apps cleans restore dataSourceRef/metadata from PVC templates so later scale-out does not re-enter restore.

Changes

  • Add Cluster.spec.restore, restore source fields, PITR string, and parameters map[string]string to the apps v1 API.
  • Update apps cluster normalization to inject restore intent into PVC templates and clean it after restore completion.
  • Update apps cluster and component status reconciliation to aggregate restore conditions from PVC conditions.
  • Update Restore ops to emit Cluster.spec.restore instead of the legacy restore annotation and to wait on restore condition plus target Cluster phase.
  • Update DP VolumePopulator to handle Backup-kind PVCs as the restore runtime entry:
    • build an internal Restore CR from PVC/Backup metadata
    • wait for backup repo preparation before creating restore jobs
    • restore system account source data
    • populate into temporary PVCs and rebind to workload PVCs
    • write PVC restore conditions
    • clean up populate jobs/PVCs after completion
  • Update restore examples to use Cluster.spec.restore.
  • Clean up test-only DP imports from apps restore intent/workload propagation tests.

Out of Scope

  • Existing apps backup-management coupling is unchanged and should be handled separately.
  • Scale-out during an active initial restore is assumed out of scope for this PR.
  • DP internal Restore CR support remains as an implementation detail.

Validation

  • make doc
  • go test ./controllers/dataprotection -run TestAPIs
  • go test ./pkg/operations -run TestAPIs
  • go test ./controllers/apps/cluster/...
  • go test ./controllers/apps/component/...
  • go test ./pkg/controller/component ./controllers/apps/cluster -run 'Test(...)'
  • git diff --check

@apecloud-bot
Copy link
Copy Markdown
Collaborator

Auto Cherry-pick Instructions

Usage:
  - /nopick: Not auto cherry-pick when PR merged.
  - /pick: release-x.x [release-x.x]: Auto cherry-pick to the specified branch when PR merged.

Example:
  - /nopick
  - /pick release-1.1

@github-actions github-actions Bot added the size/XXL Denotes a PR that changes 1000+ lines. label May 4, 2026
@leon-ape leon-ape changed the title Support DP restore API decoupling chore: support dataprotection restore API decoupling May 4, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 4, 2026

Codecov Report

❌ Patch coverage is 56.41158% with 843 lines in your changes missing coverage. Please review.
✅ Project coverage is 52.60%. Comparing base (1b25be2) to head (d39142b).

Files with missing lines Patch % Lines
...llers/dataprotection/volumepopulator_controller.go 53.75% 452 Missing and 120 partials ⚠️
apis/apps/v1/retention_period.go 0.00% 87 Missing ⚠️
...rollers/apps/cluster/transformer_cluster_status.go 58.22% 53 Missing and 13 partials ⚠️
...ers/apps/component/transformer_component_status.go 73.80% 24 Missing and 9 partials ⚠️
controllers/dataprotection/backup_controller.go 67.27% 11 Missing and 7 partials ⚠️
...ps/cluster/transformer_cluster_sharding_account.go 62.22% 13 Missing and 4 partials ⚠️
pkg/operations/restore.go 81.17% 10 Missing and 6 partials ⚠️
controllers/apps/cluster/restore_intent.go 84.61% 11 Missing and 3 partials ⚠️
...mponent/transformer_component_account_provision.go 42.85% 8 Missing ⚠️
...rs/apps/component/transformer_component_account.go 66.66% 4 Missing and 2 partials ⚠️
... and 2 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #10192      +/-   ##
==========================================
+ Coverage   52.47%   52.60%   +0.12%     
==========================================
  Files         532      533       +1     
  Lines       61230    62459    +1229     
==========================================
+ Hits        32133    32855     +722     
- Misses      25828    26231     +403     
- Partials     3269     3373     +104     
Flag Coverage Δ
unittests 52.60% <56.41%> (+0.12%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@leon-ape leon-ape added the nopick Not auto cherry-pick when PR merged label May 9, 2026
@leon-ape leon-ape added this to the Release 1.2.0 milestone May 9, 2026
@leon-ape leon-ape marked this pull request as ready for review May 9, 2026 04:21
if !apierrors.IsNotFound(err) {
return intctrlutil.RequeueWithError(err, reqCtx.Log, "")
}
target, err = r.buildTargetCluster(reqCtx, clusterRestore, backup, restoreTime)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clusterRestore need refer cluster.spec to create cluster, not only from backup snapshot

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

//
// +kubebuilder:pruning:PreserveUnknownFields
// +kubebuilder:validation:Required
Spec runtime.RawExtension `json:"spec"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use cluster spec ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Copy Markdown
Contributor

@weicao weicao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DP VolumePopulator review: one blocker in the new Backup-kind PVC path.

In Populate, populatePVC is only assigned inside the for _, v := range restoreMgr.PrepareDataBackupSets loop, but after the loop we always call rebindPVCAndPV(reqCtx, populatePVC, pvc). If the restore decision chooses RestoreData while ValidateAndInitRestoreMGR produces zero PrepareDataBackupSets, this will panic on populatePVC.Spec.VolumeName.

This can happen in the new path because decidePVCRestore chooses RestoreData from backup.Status.BackupMethod.TargetVolumes containing the PVC volume, while RestoreManager.SetBackupSets only appends a prepareData backup set when the backup uses volume snapshot or the ActionSet has a prepareData stage. A target volume without a usable prepareData path should fail closed before Populate, not reach a nil dereference.

Please add an explicit guard after restore manager init, for example when decision.mode == pvcRestoreModeRestoreData && len(restoreMgr.PrepareDataBackupSets) == 0, return a fatal error with a clear message, plus a unit test for the Backup-kind PVC path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nopick Not auto cherry-pick when PR merged size/XXL Denotes a PR that changes 1000+ lines.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants