-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(backup): extends backup manifest with info needed for 1-to-1 restore. #4177
Conversation
cedece2
to
acf312c
Compare
If we intend to populating ManifestInfo from file content instead of file path and file name, then it worth to include However moving to populating ManifestInfo from file content looks like a relatively significant change considering that we need to preserve backward compatibility with "older" manifests. |
I'm not sure if I understand what do you mean ? We just want to add additional information to the manifest file without removing or changing anything. You mean that if we want to include snapshot_id and task_id then it's a significant change ? |
We briefly mentioned on a call, that we may want to simplify how ManifestInfo is populated if all needed info will be contained in a manifest file. |
I guess that we can add them to the manifest file when uploading the manifest, but we can set them in |
I've updated pr - added snapshot_id and task_id, it's ready for review 👁️ |
@Michal-Leszczynski @karol-kokoszka this pr is ready for review 👁️ |
164ff2c
to
0599418
Compare
TODO before merge
|
…gger`. This updates scylla-manager module to the latest version of `v3/swagger` package.
This extends agent `/node_info` response with `stroage_size` and `data_directory` fields.
0599418
to
fe831e4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just started thinking about re-tries and timeouts when fetching cloud metadata.
The problem is that we have them configured on both agent client (used by SM) and object storage client (used by the agent) and it seems problematic.
My first feeling is that we should drop any re-try and timeout handling on the agent side (as it acts as a advanced proxy) and rely on the configuration used by SM. I know that you were specifically asked to add the re-tries on the agent side, but I guess that we missed this problem before.
What are your opinions on this matter?
Maybe there is some other approach that solves this issue?
The current implementation of fetching cloud metadata just once at the start-up solves this problem, but it has other drawbacks mentioned in the comment. Maybe this is the best solution, but could you elaborate on other, possible approaches to this problem?
I think the most important part here is that we can't distinguish error when metadata is not available because we not in a cloud (or user disabled metadata) from error fetching metadata because of the issue with metadata service. For example timeout or fail establishing connection in both cases will look the same for us. That why I think we should never return an error from the agent and handle retries on it side. From the api standpoint of view if we delegate responsibility of handling timeouts and retries on SM side, then we going to loose ability to have separate retries configuration per cloud provider and it will be much harder to determine root cause of the timeout - network issue between SM and agent or because cloud metadata svc is experiencing some issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks !
Please fix the panic.
Additionaly, please harden some of the backup integration tests so that they will check the generated manifest content as well.
Backup integration tests are in pkg/service/backup/service_backup_integration_test.go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 from my side, but please address the comment with manifest validation in integration tests + get the +1 from @Michal-Leszczynski
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@karol-kokoszka @Michal-Leszczynski I've updated integration test with additional checks for new manifest fields. If you don't have any comments regarding it I can proceed with squashing and merging 😄 |
Yes, proceed. |
This fixes the issue when context that was passed to GetInstanceMetadata is canceled before any of provider's functions returned.
This extends agent server with `/cloud/metadata` endpoint which returns instance details such as `cloud_provider` and `instance_type`.
This adds following data to the backup manifest: General: cluster_id: uuid of the cluster dc: data center name rack: rack from the scylla configuration node_id: id of the scylla node (equals to host id) task_id: uuid of the backup task snapshot_tag: snapshot tag shard_count: number of shard on scylla node cpu_count: number of cpus on scylla node storage_size: total size of the disk in bytes Instance Details: cloud_provider: aws|gcp|azure or empty in case of on-premise instance_type: instance type, e.g. t2.nano or empty when on-premise Fixes: #4130
fcbd42e
to
a592901
Compare
This adds following data to the backup manifest:
General:
cluster_id: uuid of the cluster
dc: data center name
rack: rack from the scylla configuration
node_id: id of the scylla node (equals to host id)
task_id: uuid of the backup task
snapshot_tag: snapshot tag
shard_count: number of shards in scylladb
cpu_count: number of cpu available on scylla node
storage_size: total size of the disk in bytes
Instance Details:
cloud_provider: aws|gcp|azure or empty in case of on-premise
instance_type: instance type, e.g. t2.nano or empty when on-premise
This also includes bug fix in cloudmeta.GetInstanceMetadata(ctx) - adds check for ctx cancellation.
This also includes fixes in unit tests related to NodeInfo.
Fixes: #4130
Please make sure that: