Use Scylla API for backup #4169

Michal-Leszczynski · 2024-12-16T11:51:40Z

This PR starts using Scylla backup API in SM backup task!
It is mostly complete and can be tested, but there are 3 issues that were discovered during development:

Extend --location to accept raw endpoint instead of just the provider #4161 - right now SM tries its best to resolve the endpoint from scylla-manager-agent.yaml and object-storage.yaml. It should be enough for default setups and tests, but we will be on the safe side when we fix this issue. Take a look at bd0cf1a for more info.
Don't identify host by IP string in config cache svc #4181 - this can create issues in other places in the code and forced a workaround in the backup code.
Run SM tests agains SSTables with UUID ID #4182 - this makes it really annoying to test Rclone/Scylla API against mixed integer/UUID sstables.

In terms of the general overview of this PR - the main objective was to fix replace the /agent/rclone/sync/movedir Rclone API with the /storage_service/backup Scylla API - nothing more.
Scylla API can be used when:

node exposes Scylla backup API
s3 is the used provider
backup won't create versioned files

Checking whether Scylla API can be used is done separately per node/snapshot_dir.

Luckily, things like pause/resume/progress does not seem like they need additional work in the scope of this issue.

Also, for now Scylla versions which are supposed to support Scylla backup/restore API are:

master
6.3
2024.3

Fixes #4143
Fixes #4138
Fixes #4141

Michal-Leszczynski · 2024-12-18T15:34:54Z

@karol-kokoszka @VAveryanov8 so the idea is that the ml/scylla-api will be the branch for the scylla api milestone.
Please take a look!

* fix(backup_test): add missing 'Integration' suffix to tests Some tests were missing the Integration suffix in their names. This resulted in not including them in the 'make pkg-integration-test' command used when running tests on gh actions. * refactor(testutils): export CheckAnyConstraint It is also useful for backup svc tests. * fix(backup_test): skip TestBackupSkipSchemaIntegration for older Scylla versions

This adds /cloud/metadata api call to agent which should return cloud instance metadata, such as instance_type and cloud_provider. Refs: #4130

This log does not contain any useful information, but it clogs the log files since checking for closest DC is done during every fresh scyllaclient creation, which is done by the config cache service every minute.

…#4185) It turns out that Scylla 2024.2 does not expose this API. For now, it's not know which enterprise release will contain it, so we need to fall back to the CQL workaround. Fixes #4183

For Scylla to access object storage, it needs to be configured in the 'object_storage.yaml' config file.

A separate column for Scylla task ID is needed because: - it has a different type from agent job ID - it make it clear which API was used

Those methods consist of both: - direct Scylla backup API call - helper Scylla Task Manager API calls

VAveryanov8 · 2025-01-08T16:08:51Z

Nice! Looks good to me!
Just a few questions/nits

karol-kokoszka

I will continue with the review, but leaving one comment now. Need to understand why SM must pass s3 endpoint.
I feel this is the security threat (but maybe I'm wrong, I'm not security expert - convince me that it is not the threat).

karol-kokoszka · 2025-01-08T17:51:22Z

pkg/cmd/agent/router.go

+		default:
+			// Endpoint has already been resolved on SM side
+			resolvedEndpoint = provider


Does it mean that if I call agent with endpoint query param = "http://169.254.169.254/storage_service/backup" then "169.254.169.254" is going to be passed to scylla server as the AWS endpoint to scylla server ?

Then this IP is consumed by scylla server to query S3 API, right ?

I think it's a threat called SSRF.

This default must be either change to return explicit error, or Scylla Manager Agent must be aware of whitelisted IP addresses (passed with "endpoint" query param).

OK, I see it's actually the main route. The input must be validated then. Is it possible to provide whitelisted IPs/Hosts ?

BTW, does it mean that the S3 enpoint url is not known by Scylla ?
Why ?
Isn't it stored in some scylla configuration ?

@regevran why ScyllaAPI must be informed about the endpoint ? Cannot it be read by scylla directly ? Using this info https://github.com/scylladb/scylladb/blob/92db2eca0b8ab0a4fa2571666a7fe2d2b07c697b/docs/dev/object_storage.md?plain=1#L29-L39 ?

I think that we covered those points during the meeting, right?

Right. These are comments from before the sync. Waiting for the output from the meeting then.

karol-kokoszka · 2025-01-08T18:15:14Z

@VAveryanov8 did you leave the comments/questions ? I don't see them.

pkg/cmd/agent/router.go

When working with Rclone, SM specifies just the provider name, and Rclone (with agent config) resolves it internally to the correct endpoint. This made it so user didn't need to specify the exact endpoint when running SM backup/restore tasks. When working with Scylla, SM needs to specify resolved host name on its own. This should be the same name as specified in 'object_storage.yaml' (See https://github.com/scylladb/scylladb/blob/92db2eca0b8ab0a4fa2571666a7fe2d2b07c697b/docs/dev/object_storage.md?plain=1#L29-L39). In order to maximize compatibility and UX, we still want it to be possible to specify just the provider name when running backup/restore. In such case, SM sends provider name as the "endpoint" query param, which is resolved by agent to proper host name when forwarding request to Scylla. Different "endpoint" query params are not resolved. Note that resolving "endpoint" query param in the proxy is just for the UX, so it might not work correctly in all the cases. In order to ensure correctness, "endpoint" should be specified directly by SM user so that no resolving is needed.

Scylla backup API can be used when: - node exposes Scylla backup API - s3 is the used provider - backup won't create versioned files

This commit adds code for using Scylla backup API. Luckily for us, handling pause/resume and progress is analogous to the Rclone API handling. Fixes #4143 Fixes #4138 Fixes #4141

Some tests used interceptor for given paths in order to wait/block/check some API calls. Those interceptors were updated to also look for Scylla backup API paths.

Using Scylla backup API does not result in changes to Rclone transfers, rate limiting or cpu pinning, so it shouldn't be checked as a part of the restore test.

This is a simple test for checking whether the correct API is used during the backup.

karol-kokoszka · 2025-01-09T12:12:42Z

pkg/cmd/agent/router.go

+		if !q.Has(endpointQueryKey) {
+			logger.Error(r.Context(), "Expected endpoint query param, but didn't receive it",
+				"query", r.URL.RawQuery)
+			return
+		}
+
+		// Resolve provider to the proper endpoint
+		provider := q.Get(endpointQueryKey)
+		resolvedEndpoint, err := resolver(provider)
+		if err != nil {
+			logger.Error(r.Context(), "Failed to resolve provider to endpoint", "provider", provider, "err", err)
+			return
+		}
+
+		resolvedHost, err := endpointToHostName(resolvedEndpoint)
+		if err != nil {
+			logger.Error(r.Context(), "Failed to convert endpoint to host name",
+				"endpoint", resolvedEndpoint,
+				"err", err)
+			return
+		}
+


What happens when there is an error raised in the Directory that is expected to validate&modify the incoming request ?
I understand that the error is just logged, but proceeds with the call to the scylla server with the current implementation.
This behavior means that the inproper request is proxied still to the scylla server.

It's much more efficient to stop processing the invalid request straight in the proxy instead of forwarding it.
To do that, you can introduce the middleware that is reponsible for validation +

http.Error(w, "Missing required query parameter: endpoint", http.StatusBadRequest) return

The middleware should evaluate the correct endpoint and save it to conext.
Then, the director just checks the context to see if there is something to change.

Director, can skip the validation
This comment is only valid if you proceed with resolving endpoint in agent. The topic to discuss is that is can be safer/better to have the identifiers of endpoints in object storage and use it in sm configuration.

karol-kokoszka · 2025-01-09T12:24:21Z

pkg/scyllaclient/client_agent.go

+// SupportsScyllaBackupRestoreAPI returns whether node exposes backup/restore API
+// that can be used instead of the Rclone API for backup/restore tasks.
+func (ni *NodeInfo) SupportsScyllaBackupRestoreAPI() (bool, error) {
+	// Check master builds
+	if scyllaversion.MasterVersion(ni.ScyllaVersion) {
+		return true, nil
+	}
+	// Check OSS
+	supports, err := scyllaversion.CheckConstraint(ni.ScyllaVersion, ">= 6.3, < 2000")
+	if err != nil {
+		return false, errors.Errorf("Unsupported Scylla version: %s", ni.ScyllaVersion)
+	}
+	if supports {
+		return true, nil
+	}
+	// Check ENT
+	supports, err = scyllaversion.CheckConstraint(ni.ScyllaVersion, ">= 2024.3")
+	if err != nil {
+		return false, errors.Errorf("Unsupported Scylla version: %s", ni.ScyllaVersion)
+	}
+	return supports, nil
+}
+


So, it's basing on the release version.
It's confirmed that Scylla exposes wtih appi beginning with 6.3 and 2024.3 ?
Is there any other possibility of checking the availablity of this endpoint ? Maybe simple HEAD request to the endpoint ?

It's not confirmed, we would need to come back to this part after an actual Scylla release supports native backup/restore API.

Can we assume that if endpoint is available then it means scylla supports it ? If so, then I suggest to probe api with HEAD request.

In general I like the idea of not relying on version checks, but proposed approach requires Scylla to handle HEAD requests and I don't think that's the case. At least I can't see them in the swagger definitions or swagger UI.

OK, they don't support HEAD requests, at least this is what I see trying to curl for it from localhost.
GET responds OK, HEAD gives 404 (not found).

Then maybe, there is some configuration parameter available to GET which can answer whether the feature is available ?

Not sure about this suggestion. Wouldn't it be difficult to distinguish potential errors coming from API not being exposed and connectivity issues?
Also, the initial idea was to check Scylla swagger definitions, but it's currently not possible due to scylladb/scylladb#16424.

If you have connectivity issue, then you got network level error. You won't get response with HTTP code.

Then maybe, there is some configuration parameter available to GET which can answer whether the feature is available ?

Could you elaborate?

If core uses some configuration parameter related to the backup API in scylla that is exposed together with this API, there could be simple GET performed on this config param.
404 code means that the API is not available.

But, if the backup API means that there are only POST methods exposed that are changing the state then I think you need to stay with the current approach where you basically checks the versions number.

karol-kokoszka · 2025-01-09T12:38:14Z

pkg/service/backup/service.go

+	nodeConfig, err := s.configCache.ReadAll(clusterID)
+	if err != nil {
+		return errors.Wrap(err, "read all nodes config")
+	}
+


Maybe it's safer and more accurate to force config cache to update cluster configuration first ? And then call to Read All.

pkg/service/backup/service_backup_integration_test.go

pkg/service/backup/worker_deduplicate.go

karol-kokoszka · 2025-01-09T13:02:46Z

pkg/service/backup/worker_scylla_upload.go

+func (w *worker) useScyllaBackupAPI(ctx context.Context, d snapshotDir, hi hostInfo) (bool, error) {
+	// Scylla backup API does not handle creation of versioned files.
+	if d.willCreateVersioned {
+		return false, nil


I miss more debug information like why ScyllaBackupAPI is not going to be used.
My proposal is to either replace (bool, error) returns with (error) return, where error!=nil means not supported. Error message can be logged (info) as a reasoning then.
I see you log which API is going to be used, so RClone api log info can be extended with reasoning,.

I don't like mixing actual errors with checks, so I just added logging on failed checks.

pkg/service/backup/worker_upload.go

karol-kokoszka · 2025-01-09T13:06:42Z

pkg/service/restore/restore_integration_test.go

-	Print("Validate state after backup")
-	validateState(h.srcCluster, "repair", true, 3, 88, pinnedCPU)
-


Why it's removed ?

When using Scylla backup API we don't need to alter transfers, rate limit, cpu pinning etc, so I deleted this check out of laziness. But I can bring it back for testing scenarios using Rclone API.

yes, please

schema/v3.5.0.cql

v3/swagger/agent.json

karol-kokoszka

@Michal-Leszczynski Thanks for the PR! Looks nice. There are few comments to discuss.
The most important is to figure out what to do with these endpoints configured for object storage.
I'm setting "Request changes" to block it until it's confirmed if core can use IDs for example.

… config

Michal-Leszczynski force-pushed the ml/backup-scylla-api branch 5 times, most recently from 7022961 to 6a37901 Compare December 18, 2024 14:43

Michal-Leszczynski marked this pull request as ready for review December 18, 2024 15:31

Michal-Leszczynski requested a review from karol-kokoszka as a code owner December 18, 2024 15:31

Michal-Leszczynski changed the base branch from master to ml/scylla-api December 18, 2024 15:32

Michal-Leszczynski requested a review from VAveryanov8 December 18, 2024 15:33

Michal-Leszczynski and others added 7 commits January 2, 2025 13:03

feat(swagger): adds /cloud/metadata to agent api definition. (#4186)

ef3b968

This adds /cloud/metadata api call to agent which should return cloud instance metadata, such as instance_type and cloud_provider. Refs: #4130

refactor(scyllaclient): reduce info to debug on closest DC (#4189)

5cfe08c

This log does not contain any useful information, but it clogs the log files since checking for closest DC is done during every fresh scyllaclient creation, which is done by the config cache service every minute.

fix(scyllaclient): don't query raft read barrier API on Scylla 2024.2 (…

d2c391c

…#4185) It turns out that Scylla 2024.2 does not expose this API. For now, it's not know which enterprise release will contain it, so we need to fall back to the CQL workaround. Fixes #4183

feat(testing): configure Scylla object storage (minio)

2def7c1

For Scylla to access object storage, it needs to be configured in the 'object_storage.yaml' config file.

feat(schema): add scylla_task_id to backup_run_progress

3e04748

A separate column for Scylla task ID is needed because: - it has a different type from agent job ID - it make it clear which API was used

feat(scyllaclient): add methods for managing Scylla backup API

c55fb1d

Those methods consist of both: - direct Scylla backup API call - helper Scylla Task Manager API calls

Michal-Leszczynski force-pushed the ml/backup-scylla-api branch from f8a0823 to c239007 Compare January 8, 2025 08:52

karol-kokoszka reviewed Jan 8, 2025

View reviewed changes

VAveryanov8 reviewed Jan 8, 2025

View reviewed changes

pkg/cmd/agent/router.go Show resolved Hide resolved

pkg/cmd/agent/router.go Show resolved Hide resolved

Michal-Leszczynski added 6 commits January 9, 2025 12:24

feat(backup): check when Scylla backup API can be used

dc6acb8

Scylla backup API can be used when: - node exposes Scylla backup API - s3 is the used provider - backup won't create versioned files

feat(backup): use Scylla backup API

19b00bf

This commit adds code for using Scylla backup API. Luckily for us, handling pause/resume and progress is analogous to the Rclone API handling. Fixes #4143 Fixes #4138 Fixes #4141

chore(backup_test): adjust tests to Scylla backup API

4d5bee4

Some tests used interceptor for given paths in order to wait/block/check some API calls. Those interceptors were updated to also look for Scylla backup API paths.

chore(restore_test): adjust tests to Scylla backup API

007313b

Using Scylla backup API does not result in changes to Rclone transfers, rate limiting or cpu pinning, so it shouldn't be checked as a part of the restore test.

feat(backup_test): add TestBackupCorrectAPIIntegration

ecfeb80

This is a simple test for checking whether the correct API is used during the backup.

Michal-Leszczynski force-pushed the ml/backup-scylla-api branch from c239007 to ecfeb80 Compare January 9, 2025 11:24