feat(backup): extends backup manifest with info needed for 1-to-1 restore. #4177

VAveryanov8 · 2024-12-17T16:57:49Z

This adds following data to the backup manifest:

General:

cluster_id: uuid of the cluster
dc: data center name
rack: rack from the scylla configuration
node_id: id of the scylla node (equals to host id)
task_id: uuid of the backup task
snapshot_tag: snapshot tag
shard_count: number of shards in scylladb
cpu_count: number of cpu available on scylla node
storage_size: total size of the disk in bytes

Instance Details:

cloud_provider: aws|gcp|azure or empty in case of on-premise
instance_type: instance type, e.g. t2.nano or empty when on-premise

This also includes bug fix in cloudmeta.GetInstanceMetadata(ctx) - adds check for ctx cancellation.
This also includes fixes in unit tests related to NodeInfo.

Fixes: #4130

Please make sure that:

Code is split to commits that address a single change
Commit messages are informative
Commit titles have module prefix
Commit titles have issue nr. suffix

VAveryanov8 · 2024-12-18T15:58:37Z

If we intend to populating ManifestInfo from file content instead of file path and file name, then it worth to include snapshot_id and task_id into the manifest.

However moving to populating ManifestInfo from file content looks like a relatively significant change considering that we need to preserve backward compatibility with "older" manifests.

karol-kokoszka · 2024-12-18T16:42:53Z

If we intend to populating ManifestInfo from file content instead of file path and file name, then it worth to include snapshot_id and task_id into the manifest.

However moving to populating ManifestInfo from file content looks like a relatively significant change considering that we need to preserve backward compatibility with "older" manifests.

I'm not sure if I understand what do you mean ? We just want to add additional information to the manifest file without removing or changing anything.

You mean that if we want to include snapshot_id and task_id then it's a significant change ?

VAveryanov8 · 2024-12-18T17:04:21Z

I'm not sure if I understand what do you mean ? We just want to add additional information to the manifest file without removing or changing anything.

You mean that if we want to include snapshot_id and task_id then it's a significant change ?

We briefly mentioned on a call, that we may want to simplify how ManifestInfo is populated if all needed info will be contained in a manifest file.
So I'm just pointing out if we want to do that - it will be a relatively significant change + snapshot_id and task_id are currently missing.

Michal-Leszczynski · 2024-12-19T08:19:52Z

We briefly mentioned on a call, that we may want to simplify how ManifestInfo is populated if all needed info will be contained in a manifest file.
So I'm just pointing out if we want to do that - if will be a relatively significant change + snapshot_id and task_id are currently missing.

I guess that we can add them to the manifest file when uploading the manifest, but we can set them in ManifestInfo when reading manifest in the same way as we are doing it today - by parsing path. This way those changes wouldn't require any adjustments in the code base, but they will make manifests more self-contained, which might be helpful in the future.

VAveryanov8 · 2024-12-19T10:10:36Z

I've updated pr - added snapshot_id and task_id, it's ready for review 👁️

VAveryanov8 · 2024-12-20T10:04:26Z

@Michal-Leszczynski @karol-kokoszka this pr is ready for review 👁️

go.mod

pkg/cmd/agent/nodeinfo_linux.go

pkg/service/backup/worker_manifest.go

VAveryanov8 · 2024-12-23T11:06:33Z

TODO before merge

Once feat(swagger): adds /cloud/metadata to agent api definition #4186 is merged, update vendor to use v3/swagger from master branch. Currently this pr uses v3/swagger from the branch.
Manually squash changes

…gger`. This updates scylla-manager module to the latest version of `v3/swagger` package.

This extends agent `/node_info` response with `stroage_size` and `data_directory` fields.

Michal-Leszczynski

I just started thinking about re-tries and timeouts when fetching cloud metadata.
The problem is that we have them configured on both agent client (used by SM) and object storage client (used by the agent) and it seems problematic.

My first feeling is that we should drop any re-try and timeout handling on the agent side (as it acts as a advanced proxy) and rely on the configuration used by SM. I know that you were specifically asked to add the re-tries on the agent side, but I guess that we missed this problem before.

What are your opinions on this matter?
Maybe there is some other approach that solves this issue?

The current implementation of fetching cloud metadata just once at the start-up solves this problem, but it has other drawbacks mentioned in the comment. Maybe this is the best solution, but could you elaborate on other, possible approaches to this problem?

pkg/scyllaclient/model.go

pkg/cmd/agent/metadata.go

pkg/scyllaclient/client_agent.go

pkg/service/backup/backupspec/manifest.go

pkg/service/backup/worker_manifest.go

VAveryanov8 · 2025-01-03T14:56:20Z

What are your opinions on this matter?
Maybe there is some other approach that solves this issue?

The current implementation of fetching cloud metadata just once at the start-up solves this problem, but it has other drawbacks mentioned in the comment. Maybe this is the best solution, but could you elaborate on other, possible approaches to this problem?

I think the most important part here is that we can't distinguish error when metadata is not available because we not in a cloud (or user disabled metadata) from error fetching metadata because of the issue with metadata service. For example timeout or fail establishing connection in both cases will look the same for us. That why I think we should never return an error from the agent and handle retries on it side.

From the api standpoint of view if we delegate responsibility of handling timeouts and retries on SM side, then we going to loose ability to have separate retries configuration per cloud provider and it will be much harder to determine root cause of the timeout - network issue between SM and agent or because cloud metadata svc is experiencing some issue.

pkg/cloudmeta/metadata.go

pkg/cloudmeta/metadata_test.go

pkg/cmd/agent/agent.go

pkg/cmd/agent/metadata.go

pkg/service/backup/backupspec/manifest.go

karol-kokoszka

Thanks !
Please fix the panic.

Additionaly, please harden some of the backup integration tests so that they will check the generated manifest content as well.

Backup integration tests are in pkg/service/backup/service_backup_integration_test.go.

karol-kokoszka

👍 from my side, but please address the comment with manifest validation in integration tests + get the +1 from @Michal-Leszczynski

pkg/cmd/agent/metadata.go

pkg/cmd/agent/server.go

pkg/cloudmeta/metadata.go

Michal-Leszczynski

LGTM!

VAveryanov8 · 2025-01-09T16:29:22Z

@karol-kokoszka @Michal-Leszczynski

I've updated integration test with additional checks for new manifest fields. If you don't have any comments regarding it I can proceed with squashing and merging 😄

karol-kokoszka · 2025-01-09T16:59:53Z

Yes, proceed.

This fixes the issue when context that was passed to GetInstanceMetadata is canceled before any of provider's functions returned.

This extends agent server with `/cloud/metadata` endpoint which returns instance details such as `cloud_provider` and `instance_type`.

This adds following data to the backup manifest: General: cluster_id: uuid of the cluster dc: data center name rack: rack from the scylla configuration node_id: id of the scylla node (equals to host id) task_id: uuid of the backup task snapshot_tag: snapshot tag shard_count: number of shard on scylla node cpu_count: number of cpus on scylla node storage_size: total size of the disk in bytes Instance Details: cloud_provider: aws|gcp|azure or empty in case of on-premise instance_type: instance type, e.g. t2.nano or empty when on-premise Fixes: #4130

VAveryanov8 force-pushed the va/extend-backup-manifest-part-3 branch 2 times, most recently from cedece2 to acf312c Compare December 18, 2024 15:43

VAveryanov8 marked this pull request as ready for review December 18, 2024 16:34

VAveryanov8 requested review from karol-kokoszka and Michal-Leszczynski as code owners December 18, 2024 16:34

Michal-Leszczynski requested changes Dec 20, 2024

View reviewed changes

go.mod Show resolved Hide resolved

pkg/cmd/agent/nodeinfo_linux.go Show resolved Hide resolved

pkg/service/backup/worker_manifest.go Outdated Show resolved Hide resolved

VAveryanov8 force-pushed the va/extend-backup-manifest-part-3 branch from 164ff2c to 0599418 Compare December 23, 2024 10:54

VAveryanov8 changed the base branch from master to va/extend-backup-manifest-agent-metadata December 23, 2024 10:54

Base automatically changed from va/extend-backup-manifest-agent-metadata to master January 3, 2025 08:26

VAveryanov8 added 2 commits January 3, 2025 09:38

chore: updates scylla-manager module to the latest version of `v3/swa…

a62bfac

…gger`. This updates scylla-manager module to the latest version of `v3/swagger` package.

feat(agent): extends /node_info response with storage_size.

e67f38d

This extends agent `/node_info` response with `stroage_size` and `data_directory` fields.

VAveryanov8 force-pushed the va/extend-backup-manifest-part-3 branch from 0599418 to fe831e4 Compare January 3, 2025 08:42

VAveryanov8 requested a review from Michal-Leszczynski January 3, 2025 09:22

Michal-Leszczynski reviewed Jan 3, 2025

View reviewed changes