-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Scylla API for backup #4169
Conversation
7022961
to
6a37901
Compare
@karol-kokoszka @VAveryanov8 so the idea is that the |
* fix(backup_test): add missing 'Integration' suffix to tests Some tests were missing the Integration suffix in their names. This resulted in not including them in the 'make pkg-integration-test' command used when running tests on gh actions. * refactor(testutils): export CheckAnyConstraint It is also useful for backup svc tests. * fix(backup_test): skip TestBackupSkipSchemaIntegration for older Scylla versions
This adds /cloud/metadata api call to agent which should return cloud instance metadata, such as instance_type and cloud_provider. Refs: #4130
This log does not contain any useful information, but it clogs the log files since checking for closest DC is done during every fresh scyllaclient creation, which is done by the config cache service every minute.
For Scylla to access object storage, it needs to be configured in the 'object_storage.yaml' config file.
A separate column for Scylla task ID is needed because: - it has a different type from agent job ID - it make it clear which API was used
Those methods consist of both: - direct Scylla backup API call - helper Scylla Task Manager API calls
f8a0823
to
c239007
Compare
Nice! Looks good to me! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will continue with the review, but leaving one comment now. Need to understand why SM must pass s3 endpoint.
I feel this is the security threat (but maybe I'm wrong, I'm not security expert - convince me that it is not the threat).
default: | ||
// Endpoint has already been resolved on SM side | ||
resolvedEndpoint = provider |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it mean that if I call agent with endpoint query param = "http://169.254.169.254/storage_service/backup" then "169.254.169.254" is going to be passed to scylla server as the AWS endpoint to scylla server ?
Then this IP is consumed by scylla server to query S3 API, right ?
I think it's a threat called SSRF.
This default must be either change to return explicit error, or Scylla Manager Agent must be aware of whitelisted IP addresses (passed with "endpoint" query param).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I see it's actually the main route. The input must be validated then. Is it possible to provide whitelisted IPs/Hosts ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, does it mean that the S3 enpoint url is not known by Scylla ?
Why ?
Isn't it stored in some scylla configuration ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@regevran why ScyllaAPI must be informed about the endpoint ? Cannot it be read by scylla directly ? Using this info https://github.com/scylladb/scylladb/blob/92db2eca0b8ab0a4fa2571666a7fe2d2b07c697b/docs/dev/object_storage.md?plain=1#L29-L39 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we covered those points during the meeting, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. These are comments from before the sync. Waiting for the output from the meeting then.
@VAveryanov8 did you leave the comments/questions ? I don't see them. |
When working with Rclone, SM specifies just the provider name, and Rclone (with agent config) resolves it internally to the correct endpoint. This made it so user didn't need to specify the exact endpoint when running SM backup/restore tasks. When working with Scylla, SM needs to specify resolved host name on its own. This should be the same name as specified in 'object_storage.yaml' (See https://github.com/scylladb/scylladb/blob/92db2eca0b8ab0a4fa2571666a7fe2d2b07c697b/docs/dev/object_storage.md?plain=1#L29-L39). In order to maximize compatibility and UX, we still want it to be possible to specify just the provider name when running backup/restore. In such case, SM sends provider name as the "endpoint" query param, which is resolved by agent to proper host name when forwarding request to Scylla. Different "endpoint" query params are not resolved. Note that resolving "endpoint" query param in the proxy is just for the UX, so it might not work correctly in all the cases. In order to ensure correctness, "endpoint" should be specified directly by SM user so that no resolving is needed.
Scylla backup API can be used when: - node exposes Scylla backup API - s3 is the used provider - backup won't create versioned files
Some tests used interceptor for given paths in order to wait/block/check some API calls. Those interceptors were updated to also look for Scylla backup API paths.
Using Scylla backup API does not result in changes to Rclone transfers, rate limiting or cpu pinning, so it shouldn't be checked as a part of the restore test.
This is a simple test for checking whether the correct API is used during the backup.
c239007
to
ecfeb80
Compare
if !q.Has(endpointQueryKey) { | ||
logger.Error(r.Context(), "Expected endpoint query param, but didn't receive it", | ||
"query", r.URL.RawQuery) | ||
return | ||
} | ||
|
||
// Resolve provider to the proper endpoint | ||
provider := q.Get(endpointQueryKey) | ||
resolvedEndpoint, err := resolver(provider) | ||
if err != nil { | ||
logger.Error(r.Context(), "Failed to resolve provider to endpoint", "provider", provider, "err", err) | ||
return | ||
} | ||
|
||
resolvedHost, err := endpointToHostName(resolvedEndpoint) | ||
if err != nil { | ||
logger.Error(r.Context(), "Failed to convert endpoint to host name", | ||
"endpoint", resolvedEndpoint, | ||
"err", err) | ||
return | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when there is an error raised in the Directory that is expected to validate&modify the incoming request ?
I understand that the error is just logged, but proceeds with the call to the scylla server with the current implementation.
This behavior means that the inproper request is proxied still to the scylla server.
It's much more efficient to stop processing the invalid request straight in the proxy instead of forwarding it.
To do that, you can introduce the middleware that is reponsible for validation +
http.Error(w, "Missing required query parameter: endpoint", http.StatusBadRequest)
return
The middleware should evaluate the correct endpoint and save it to conext.
Then, the director just checks the context to see if there is something to change.
Director, can skip the validation
This comment is only valid if you proceed with resolving endpoint in agent. The topic to discuss is that is can be safer/better to have the identifiers of endpoints in object storage and use it in sm configuration.
pkg/scyllaclient/client_agent.go
Outdated
// SupportsScyllaBackupRestoreAPI returns whether node exposes backup/restore API | ||
// that can be used instead of the Rclone API for backup/restore tasks. | ||
func (ni *NodeInfo) SupportsScyllaBackupRestoreAPI() (bool, error) { | ||
// Check master builds | ||
if scyllaversion.MasterVersion(ni.ScyllaVersion) { | ||
return true, nil | ||
} | ||
// Check OSS | ||
supports, err := scyllaversion.CheckConstraint(ni.ScyllaVersion, ">= 6.3, < 2000") | ||
if err != nil { | ||
return false, errors.Errorf("Unsupported Scylla version: %s", ni.ScyllaVersion) | ||
} | ||
if supports { | ||
return true, nil | ||
} | ||
// Check ENT | ||
supports, err = scyllaversion.CheckConstraint(ni.ScyllaVersion, ">= 2024.3") | ||
if err != nil { | ||
return false, errors.Errorf("Unsupported Scylla version: %s", ni.ScyllaVersion) | ||
} | ||
return supports, nil | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, it's basing on the release version.
It's confirmed that Scylla exposes wtih appi beginning with 6.3 and 2024.3 ?
Is there any other possibility of checking the availablity of this endpoint ? Maybe simple HEAD request to the endpoint ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not confirmed, we would need to come back to this part after an actual Scylla release supports native backup/restore API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we assume that if endpoint is available then it means scylla supports it ? If so, then I suggest to probe api with HEAD request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I like the idea of not relying on version checks, but proposed approach requires Scylla to handle HEAD requests and I don't think that's the case. At least I can't see them in the swagger definitions or swagger UI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, they don't support HEAD requests, at least this is what I see trying to curl
for it from localhost.
GET responds OK, HEAD gives 404 (not found).
Then maybe, there is some configuration parameter available to GET which can answer whether the feature is available ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about this suggestion. Wouldn't it be difficult to distinguish potential errors coming from API not being exposed and connectivity issues?
Also, the initial idea was to check Scylla swagger definitions, but it's currently not possible due to scylladb/scylladb#16424.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have connectivity issue, then you got network level error. You won't get response with HTTP code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then maybe, there is some configuration parameter available to GET which can answer whether the feature is available ?
Could you elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If core uses some configuration parameter related to the backup API in scylla that is exposed together with this API, there could be simple GET performed on this config param.
404 code means that the API is not available.
But, if the backup API means that there are only POST methods exposed that are changing the state then I think you need to stay with the current approach where you basically checks the versions number.
nodeConfig, err := s.configCache.ReadAll(clusterID) | ||
if err != nil { | ||
return errors.Wrap(err, "read all nodes config") | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's safer and more accurate to force config cache to update cluster configuration first ? And then call to Read All.
func (w *worker) useScyllaBackupAPI(ctx context.Context, d snapshotDir, hi hostInfo) (bool, error) { | ||
// Scylla backup API does not handle creation of versioned files. | ||
if d.willCreateVersioned { | ||
return false, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I miss more debug information like why ScyllaBackupAPI is not going to be used.
My proposal is to either replace (bool, error) returns with (error) return, where error!=nil means not supported. Error message can be logged (info) as a reasoning then.
I see you log which API is going to be used, so RClone api log info can be extended with reasoning,.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like mixing actual errors with checks, so I just added logging on failed checks.
Print("Validate state after backup") | ||
validateState(h.srcCluster, "repair", true, 3, 88, pinnedCPU) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why it's removed ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using Scylla backup API we don't need to alter transfers, rate limit, cpu pinning etc, so I deleted this check out of laziness. But I can bring it back for testing scenarios using Rclone API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, please
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Michal-Leszczynski Thanks for the PR! Looks nice. There are few comments to discuss.
The most important is to figure out what to do with these endpoints configured for object storage.
I'm setting "Request changes" to block it until it's confirmed if core can use IDs for example.
This PR starts using Scylla backup API in SM backup task!
It is mostly complete and can be tested, but there are 3 issues that were discovered during development:
scylla-manager-agent.yaml
andobject-storage.yaml
. It should be enough for default setups and tests, but we will be on the safe side when we fix this issue. Take a look at bd0cf1a for more info.In terms of the general overview of this PR - the main objective was to fix replace the
/agent/rclone/sync/movedir
Rclone API with the/storage_service/backup
Scylla API - nothing more.Scylla API can be used when:
Checking whether Scylla API can be used is done separately per node/snapshot_dir.
Luckily, things like pause/resume/progress does not seem like they need additional work in the scope of this issue.
Also, for now Scylla versions which are supposed to support Scylla backup/restore API are:
master
6.3
2024.3
Fixes #4143
Fixes #4138
Fixes #4141