Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Scylla API for backup #4169

Merged
merged 16 commits into from
Jan 15, 2025
Merged

Conversation

Michal-Leszczynski
Copy link
Collaborator

@Michal-Leszczynski Michal-Leszczynski commented Dec 16, 2024

This PR starts using Scylla backup API in SM backup task!
It is mostly complete and can be tested, but there are 3 issues that were discovered during development:

In terms of the general overview of this PR - the main objective was to fix replace the /agent/rclone/sync/movedir Rclone API with the /storage_service/backup Scylla API - nothing more.
Scylla API can be used when:

  • node exposes Scylla backup API
  • s3 is the used provider
  • backup won't create versioned files

Checking whether Scylla API can be used is done separately per node/snapshot_dir.

Luckily, things like pause/resume/progress does not seem like they need additional work in the scope of this issue.

Also, for now Scylla versions which are supposed to support Scylla backup/restore API are:

  • master
  • 6.3
  • 2024.3

Fixes #4143
Fixes #4138
Fixes #4141

@Michal-Leszczynski Michal-Leszczynski force-pushed the ml/backup-scylla-api branch 5 times, most recently from 7022961 to 6a37901 Compare December 18, 2024 14:43
@Michal-Leszczynski Michal-Leszczynski marked this pull request as ready for review December 18, 2024 15:31
@Michal-Leszczynski Michal-Leszczynski changed the base branch from master to ml/scylla-api December 18, 2024 15:32
@Michal-Leszczynski
Copy link
Collaborator Author

@karol-kokoszka @VAveryanov8 so the idea is that the ml/scylla-api will be the branch for the scylla api milestone.
Please take a look!

Michal-Leszczynski and others added 7 commits January 2, 2025 13:03
* fix(backup_test): add missing 'Integration' suffix to tests

Some tests were missing the Integration suffix in their names.
This resulted in not including them in the 'make pkg-integration-test'
command used when running tests on gh actions.

* refactor(testutils): export CheckAnyConstraint

It is also useful for backup svc tests.

* fix(backup_test): skip TestBackupSkipSchemaIntegration for older Scylla versions
This adds /cloud/metadata api call to agent which should return cloud
instance metadata, such as instance_type and cloud_provider.

Refs: #4130
This log does not contain any useful information, but it clogs
the log files since checking for closest DC is done during
every fresh scyllaclient creation, which is done by the
config cache service every minute.
…#4185)

It turns out that Scylla 2024.2 does not expose this API.
For now, it's not know which enterprise release will contain it,
so we need to fall back to the CQL workaround.

Fixes #4183
For Scylla to access object storage, it needs to be configured
in the 'object_storage.yaml' config file.
A separate column for Scylla task ID is needed because:
- it has a different type from agent job ID
- it make it clear which API was used
Those methods consist of both:
- direct Scylla backup API call
- helper Scylla Task Manager API calls
@VAveryanov8
Copy link
Collaborator

Nice! Looks good to me!
Just a few questions/nits

Copy link
Collaborator

@karol-kokoszka karol-kokoszka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will continue with the review, but leaving one comment now. Need to understand why SM must pass s3 endpoint.
I feel this is the security threat (but maybe I'm wrong, I'm not security expert - convince me that it is not the threat).

Comment on lines +185 to +187
default:
// Endpoint has already been resolved on SM side
resolvedEndpoint = provider
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it mean that if I call agent with endpoint query param = "http://169.254.169.254/storage_service/backup" then "169.254.169.254" is going to be passed to scylla server as the AWS endpoint to scylla server ?

Then this IP is consumed by scylla server to query S3 API, right ?

I think it's a threat called SSRF.

This default must be either change to return explicit error, or Scylla Manager Agent must be aware of whitelisted IP addresses (passed with "endpoint" query param).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see it's actually the main route. The input must be validated then. Is it possible to provide whitelisted IPs/Hosts ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, does it mean that the S3 enpoint url is not known by Scylla ?
Why ?
Isn't it stored in some scylla configuration ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@regevran why ScyllaAPI must be informed about the endpoint ? Cannot it be read by scylla directly ? Using this info https://github.com/scylladb/scylladb/blob/92db2eca0b8ab0a4fa2571666a7fe2d2b07c697b/docs/dev/object_storage.md?plain=1#L29-L39 ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we covered those points during the meeting, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. These are comments from before the sync. Waiting for the output from the meeting then.

@karol-kokoszka
Copy link
Collaborator

@VAveryanov8 did you leave the comments/questions ? I don't see them.

pkg/cmd/agent/router.go Show resolved Hide resolved
pkg/cmd/agent/router.go Show resolved Hide resolved
When working with Rclone, SM specifies just the provider name,
and Rclone (with agent config) resolves it internally to the correct endpoint.
This made it so user didn't need to specify the exact endpoint when running SM backup/restore tasks.

When working with Scylla, SM needs to specify resolved host name on its own.
This should be the same name as specified in 'object_storage.yaml'
(See https://github.com/scylladb/scylladb/blob/92db2eca0b8ab0a4fa2571666a7fe2d2b07c697b/docs/dev/object_storage.md?plain=1#L29-L39).

In order to maximize compatibility and UX, we still want it to be possible
to specify just the provider name when running backup/restore.
In such case, SM sends provider name as the "endpoint" query param,
which is resolved by agent to proper host name when forwarding request to Scylla.
Different "endpoint" query params are not resolved.

Note that resolving "endpoint" query param in the proxy is just for the UX,
so it might not work correctly in all the cases.
In order to ensure correctness, "endpoint" should be specified directly by SM user
so that no resolving is needed.
Scylla backup API can be used when:
- node exposes Scylla backup API
- s3 is the used provider
- backup won't create versioned files
This commit adds code for using Scylla backup API.
Luckily for us, handling pause/resume and progress
is analogous to the Rclone API handling.

Fixes #4143
Fixes #4138
Fixes #4141
Some tests used interceptor for given paths
in order to wait/block/check some API calls.
Those interceptors were updated to also look
for Scylla backup API paths.
Using Scylla backup API does not result in changes
to Rclone transfers, rate limiting or cpu pinning,
so it shouldn't be checked as a part of the restore test.
This is a simple test for checking whether the correct API
is used during the backup.
Comment on lines +128 to +149
if !q.Has(endpointQueryKey) {
logger.Error(r.Context(), "Expected endpoint query param, but didn't receive it",
"query", r.URL.RawQuery)
return
}

// Resolve provider to the proper endpoint
provider := q.Get(endpointQueryKey)
resolvedEndpoint, err := resolver(provider)
if err != nil {
logger.Error(r.Context(), "Failed to resolve provider to endpoint", "provider", provider, "err", err)
return
}

resolvedHost, err := endpointToHostName(resolvedEndpoint)
if err != nil {
logger.Error(r.Context(), "Failed to convert endpoint to host name",
"endpoint", resolvedEndpoint,
"err", err)
return
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when there is an error raised in the Directory that is expected to validate&modify the incoming request ?
I understand that the error is just logged, but proceeds with the call to the scylla server with the current implementation.
This behavior means that the inproper request is proxied still to the scylla server.

It's much more efficient to stop processing the invalid request straight in the proxy instead of forwarding it.
To do that, you can introduce the middleware that is reponsible for validation +

http.Error(w, "Missing required query parameter: endpoint", http.StatusBadRequest)
return

The middleware should evaluate the correct endpoint and save it to conext.
Then, the director just checks the context to see if there is something to change.

Director, can skip the validation
This comment is only valid if you proceed with resolving endpoint in agent. The topic to discuss is that is can be safer/better to have the identifiers of endpoints in object storage and use it in sm configuration.

Comment on lines 257 to 279
// SupportsScyllaBackupRestoreAPI returns whether node exposes backup/restore API
// that can be used instead of the Rclone API for backup/restore tasks.
func (ni *NodeInfo) SupportsScyllaBackupRestoreAPI() (bool, error) {
// Check master builds
if scyllaversion.MasterVersion(ni.ScyllaVersion) {
return true, nil
}
// Check OSS
supports, err := scyllaversion.CheckConstraint(ni.ScyllaVersion, ">= 6.3, < 2000")
if err != nil {
return false, errors.Errorf("Unsupported Scylla version: %s", ni.ScyllaVersion)
}
if supports {
return true, nil
}
// Check ENT
supports, err = scyllaversion.CheckConstraint(ni.ScyllaVersion, ">= 2024.3")
if err != nil {
return false, errors.Errorf("Unsupported Scylla version: %s", ni.ScyllaVersion)
}
return supports, nil
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, it's basing on the release version.
It's confirmed that Scylla exposes wtih appi beginning with 6.3 and 2024.3 ?
Is there any other possibility of checking the availablity of this endpoint ? Maybe simple HEAD request to the endpoint ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not confirmed, we would need to come back to this part after an actual Scylla release supports native backup/restore API.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we assume that if endpoint is available then it means scylla supports it ? If so, then I suggest to probe api with HEAD request.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I like the idea of not relying on version checks, but proposed approach requires Scylla to handle HEAD requests and I don't think that's the case. At least I can't see them in the swagger definitions or swagger UI.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, they don't support HEAD requests, at least this is what I see trying to curl for it from localhost.
GET responds OK, HEAD gives 404 (not found).

Then maybe, there is some configuration parameter available to GET which can answer whether the feature is available ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this suggestion. Wouldn't it be difficult to distinguish potential errors coming from API not being exposed and connectivity issues?
Also, the initial idea was to check Scylla swagger definitions, but it's currently not possible due to scylladb/scylladb#16424.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have connectivity issue, then you got network level error. You won't get response with HTTP code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then maybe, there is some configuration parameter available to GET which can answer whether the feature is available ?

Could you elaborate?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If core uses some configuration parameter related to the backup API in scylla that is exposed together with this API, there could be simple GET performed on this config param.
404 code means that the API is not available.

But, if the backup API means that there are only POST methods exposed that are changing the state then I think you need to stay with the current approach where you basically checks the versions number.

Comment on lines +680 to +684
nodeConfig, err := s.configCache.ReadAll(clusterID)
if err != nil {
return errors.Wrap(err, "read all nodes config")
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's safer and more accurate to force config cache to update cluster configuration first ? And then call to Read All.

func (w *worker) useScyllaBackupAPI(ctx context.Context, d snapshotDir, hi hostInfo) (bool, error) {
// Scylla backup API does not handle creation of versioned files.
if d.willCreateVersioned {
return false, nil
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I miss more debug information like why ScyllaBackupAPI is not going to be used.
My proposal is to either replace (bool, error) returns with (error) return, where error!=nil means not supported. Error message can be logged (info) as a reasoning then.
I see you log which API is going to be used, so RClone api log info can be extended with reasoning,.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like mixing actual errors with checks, so I just added logging on failed checks.

Comment on lines 614 to 616
Print("Validate state after backup")
validateState(h.srcCluster, "repair", true, 3, 88, pinnedCPU)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why it's removed ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using Scylla backup API we don't need to alter transfers, rate limit, cpu pinning etc, so I deleted this check out of laziness. But I can bring it back for testing scenarios using Rclone API.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, please

Copy link
Collaborator

@karol-kokoszka karol-kokoszka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Michal-Leszczynski Thanks for the PR! Looks nice. There are few comments to discuss.
The most important is to figure out what to do with these endpoints configured for object storage.
I'm setting "Request changes" to block it until it's confirmed if core can use IDs for example.

@Michal-Leszczynski Michal-Leszczynski merged commit a7f380c into ml/scylla-api Jan 15, 2025
51 checks passed
@Michal-Leszczynski Michal-Leszczynski deleted the ml/backup-scylla-api branch January 15, 2025 11:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants