Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(backup): extends backup manifest with info needed for 1-to-1 restore. #4177

Merged
merged 5 commits into from
Jan 10, 2025

Conversation

VAveryanov8
Copy link
Collaborator

@VAveryanov8 VAveryanov8 commented Dec 17, 2024

This adds following data to the backup manifest:

General:

cluster_id: uuid of the cluster
dc: data center name
rack: rack from the scylla configuration
node_id: id of the scylla node (equals to host id)
task_id: uuid of the backup task
snapshot_tag: snapshot tag
shard_count: number of shards in scylladb
cpu_count: number of cpu available on scylla node
storage_size: total size of the disk in bytes

Instance Details:

cloud_provider: aws|gcp|azure or empty in case of on-premise
instance_type: instance type, e.g. t2.nano or empty when on-premise

This also includes bug fix in cloudmeta.GetInstanceMetadata(ctx) - adds check for ctx cancellation.
This also includes fixes in unit tests related to NodeInfo.

Fixes: #4130


Please make sure that:

  • Code is split to commits that address a single change
  • Commit messages are informative
  • Commit titles have module prefix
  • Commit titles have issue nr. suffix

@VAveryanov8 VAveryanov8 force-pushed the va/extend-backup-manifest-part-3 branch 2 times, most recently from cedece2 to acf312c Compare December 18, 2024 15:43
@VAveryanov8
Copy link
Collaborator Author

If we intend to populating ManifestInfo from file content instead of file path and file name, then it worth to include snapshot_id and task_id into the manifest.

However moving to populating ManifestInfo from file content looks like a relatively significant change considering that we need to preserve backward compatibility with "older" manifests.

@VAveryanov8 VAveryanov8 marked this pull request as ready for review December 18, 2024 16:34
@karol-kokoszka
Copy link
Collaborator

If we intend to populating ManifestInfo from file content instead of file path and file name, then it worth to include snapshot_id and task_id into the manifest.

However moving to populating ManifestInfo from file content looks like a relatively significant change considering that we need to preserve backward compatibility with "older" manifests.

I'm not sure if I understand what do you mean ? We just want to add additional information to the manifest file without removing or changing anything.

You mean that if we want to include snapshot_id and task_id then it's a significant change ?

@VAveryanov8
Copy link
Collaborator Author

VAveryanov8 commented Dec 18, 2024

I'm not sure if I understand what do you mean ? We just want to add additional information to the manifest file without removing or changing anything.

You mean that if we want to include snapshot_id and task_id then it's a significant change ?

We briefly mentioned on a call, that we may want to simplify how ManifestInfo is populated if all needed info will be contained in a manifest file.
So I'm just pointing out if we want to do that - it will be a relatively significant change + snapshot_id and task_id are currently missing.

@Michal-Leszczynski
Copy link
Collaborator

We briefly mentioned on a call, that we may want to simplify how ManifestInfo is populated if all needed info will be contained in a manifest file.
So I'm just pointing out if we want to do that - if will be a relatively significant change + snapshot_id and task_id are currently missing.

I guess that we can add them to the manifest file when uploading the manifest, but we can set them in ManifestInfo when reading manifest in the same way as we are doing it today - by parsing path. This way those changes wouldn't require any adjustments in the code base, but they will make manifests more self-contained, which might be helpful in the future.

@VAveryanov8
Copy link
Collaborator Author

I've updated pr - added snapshot_id and task_id, it's ready for review 👁️

@VAveryanov8
Copy link
Collaborator Author

@Michal-Leszczynski @karol-kokoszka this pr is ready for review 👁️

go.mod Show resolved Hide resolved
pkg/cmd/agent/nodeinfo_linux.go Show resolved Hide resolved
pkg/service/backup/worker_manifest.go Outdated Show resolved Hide resolved
@VAveryanov8 VAveryanov8 force-pushed the va/extend-backup-manifest-part-3 branch from 164ff2c to 0599418 Compare December 23, 2024 10:54
@VAveryanov8 VAveryanov8 changed the base branch from master to va/extend-backup-manifest-agent-metadata December 23, 2024 10:54
@VAveryanov8
Copy link
Collaborator Author

VAveryanov8 commented Dec 23, 2024

TODO before merge

Base automatically changed from va/extend-backup-manifest-agent-metadata to master January 3, 2025 08:26
…gger`.

This updates scylla-manager module to the latest version of `v3/swagger` package.
This extends agent `/node_info` response with `stroage_size` and
`data_directory` fields.
@VAveryanov8 VAveryanov8 force-pushed the va/extend-backup-manifest-part-3 branch from 0599418 to fe831e4 Compare January 3, 2025 08:42
Copy link
Collaborator

@Michal-Leszczynski Michal-Leszczynski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just started thinking about re-tries and timeouts when fetching cloud metadata.
The problem is that we have them configured on both agent client (used by SM) and object storage client (used by the agent) and it seems problematic.

My first feeling is that we should drop any re-try and timeout handling on the agent side (as it acts as a advanced proxy) and rely on the configuration used by SM. I know that you were specifically asked to add the re-tries on the agent side, but I guess that we missed this problem before.

What are your opinions on this matter?
Maybe there is some other approach that solves this issue?

The current implementation of fetching cloud metadata just once at the start-up solves this problem, but it has other drawbacks mentioned in the comment. Maybe this is the best solution, but could you elaborate on other, possible approaches to this problem?

pkg/scyllaclient/model.go Show resolved Hide resolved
pkg/cmd/agent/metadata.go Outdated Show resolved Hide resolved
pkg/cmd/agent/metadata.go Outdated Show resolved Hide resolved
pkg/scyllaclient/client_agent.go Outdated Show resolved Hide resolved
pkg/service/backup/backupspec/manifest.go Outdated Show resolved Hide resolved
pkg/service/backup/backupspec/manifest.go Outdated Show resolved Hide resolved
pkg/service/backup/worker_manifest.go Outdated Show resolved Hide resolved
@VAveryanov8
Copy link
Collaborator Author

What are your opinions on this matter?
Maybe there is some other approach that solves this issue?

The current implementation of fetching cloud metadata just once at the start-up solves this problem, but it has other drawbacks mentioned in the comment. Maybe this is the best solution, but could you elaborate on other, possible approaches to this problem?

I think the most important part here is that we can't distinguish error when metadata is not available because we not in a cloud (or user disabled metadata) from error fetching metadata because of the issue with metadata service. For example timeout or fail establishing connection in both cases will look the same for us. That why I think we should never return an error from the agent and handle retries on it side.

From the api standpoint of view if we delegate responsibility of handling timeouts and retries on SM side, then we going to loose ability to have separate retries configuration per cloud provider and it will be much harder to determine root cause of the timeout - network issue between SM and agent or because cloud metadata svc is experiencing some issue.

Copy link
Collaborator

@karol-kokoszka karol-kokoszka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks !
Please fix the panic.

Additionaly, please harden some of the backup integration tests so that they will check the generated manifest content as well.

Backup integration tests are in pkg/service/backup/service_backup_integration_test.go.

Copy link
Collaborator

@karol-kokoszka karol-kokoszka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 from my side, but please address the comment with manifest validation in integration tests + get the +1 from @Michal-Leszczynski

pkg/cmd/agent/metadata.go Outdated Show resolved Hide resolved
pkg/cmd/agent/server.go Outdated Show resolved Hide resolved
pkg/cloudmeta/metadata.go Outdated Show resolved Hide resolved
pkg/cloudmeta/metadata.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@Michal-Leszczynski Michal-Leszczynski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@VAveryanov8
Copy link
Collaborator Author

@karol-kokoszka @Michal-Leszczynski

I've updated integration test with additional checks for new manifest fields. If you don't have any comments regarding it I can proceed with squashing and merging 😄

@karol-kokoszka
Copy link
Collaborator

Yes, proceed.

This fixes the issue when context that was passed to GetInstanceMetadata is
canceled before any of provider's functions returned.
This extends agent server with `/cloud/metadata` endpoint which returns
instance details such as `cloud_provider` and `instance_type`.
This adds following data to the backup manifest:
General:
  cluster_id: uuid of the cluster
  dc: data center name
  rack: rack from the scylla configuration
  node_id: id of the scylla node (equals to host id)
  task_id: uuid of the backup task
  snapshot_tag: snapshot tag
  shard_count: number of shard on scylla node
  cpu_count: number of cpus on scylla node
  storage_size: total size of the disk in bytes
Instance Details:
  cloud_provider: aws|gcp|azure or empty in case of on-premise
  instance_type: instance type, e.g. t2.nano or empty when on-premise

Fixes: #4130
@VAveryanov8 VAveryanov8 force-pushed the va/extend-backup-manifest-part-3 branch from fcbd42e to a592901 Compare January 10, 2025 09:03
@VAveryanov8 VAveryanov8 merged commit f7f9ac2 into master Jan 10, 2025
51 checks passed
@VAveryanov8 VAveryanov8 deleted the va/extend-backup-manifest-part-3 branch January 10, 2025 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extend the backup manifest
3 participants