Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't add cluster to SM DB before validation finishes #4029

Merged
merged 5 commits into from
Sep 11, 2024

Conversation

Michal-Leszczynski
Copy link
Collaborator

@Michal-Leszczynski Michal-Leszczynski commented Sep 10, 2024

Commit 5bf6b35 introduced a bug that when adding cluster to SM finished with error,
cluster was still saved in SM DB, but only with the ID and known hosts fields.
That's because method validateHostsConnectivity used discoverAndSetClusterHosts
instead of discoverClusterHosts, which happened before inserting cluster to SM DB.

In case client was mistakenly initialized with "" host, getting host from the pool would panic.
This should be replaced with a simple error. This PR adds additional validation to client config,
which prohibits specifying empty host there. It also makes host pool return an error instead of panic
in case of invalid host pool response.

This PR also includes testes for mentioned issues (validated that those tests weren't passing without the fixes).

Fixes #4028

Commit 5bf6b35 introduced a bug that when adding cluster to SM finished with error,
cluster was still saved in SM DB, but only with the ID and known hosts fields.
That's because method validateHostsConnectivity used discoverAndSetClusterHosts
instead of discoverClusterHosts, which happened before inserting cluster to SM DB.

Fixes #4028
E.g. clusters created before SM 3.2.6 can have empty --host SM DB column.
In case client was mistakenly initialized with "" host, getting host from the pool would panic:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x11da5ac]
goroutine 253 [running]:
github.com/scylladb/scylla-manager/v3/pkg/scyllaclient.NewClient.hostPool.func3(0xc0004345a0)
github.com/scylladb/scylla-manager/v3/pkg/scyllaclient/hostpool.go:32 +0xac

This should be replaced with a simple error.
This commit adds additional validation to client config,
which prohibits specifying empty host there.
It also makes host pool return an error instead of panic
in case of invalid host pool response.

Fixes scylladb/scylla-enterprise#4692
As it is possible that the --host field in SM DB is empty,
we should test that this scenario does not break anything in SM.

Tests scylladb/scylla-enterprise#4692
@Michal-Leszczynski
Copy link
Collaborator Author

@karol-kokoszka This PR is ready for review! Failures are connected to #4032.

@karol-kokoszka
Copy link
Collaborator

Thanks !
I tested it by removing the host value in DB directly:

➜  scylla-manager git:(ml/fix-4028) cqlsh                                                                                                                                                                                                                       <aws:dbaas-lab-developer>
Connected to SM Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.0.8 | CQL spec 3.3.1 | Native protocol v4]
Use HELP for help.
cqlsh> select * from scylla_manager.cluster;

 id                                   | auth_token | force_non_ssl_session_port | force_tls_disabled | host           | known_hosts                                                                                                  | labels | name | port
--------------------------------------+------------+----------------------------+--------------------+----------------+--------------------------------------------------------------------------------------------------------------+--------+------+------
 f25492d6-be27-41c8-b379-68182411381b |      token |                      False |              False | 192.168.200.11 | {'192.168.200.11', '192.168.200.12', '192.168.200.13', '192.168.200.21', '192.168.200.22', '192.168.200.23'} |   null |      |    0

(1 rows)
cqlsh> update scylla_manager.cluster set host = '';
SyntaxException: line 1:43 : Syntax error
cqlsh> update scylla_manager.cluster set host = '' where id = 
cqlsh> update scylla_manager.cluster set host = '' where id = f25492d6-be27-41c8-b379-68182411381b;
cqlsh> select * from scylla_manager.cluster;

 id                                   | auth_token | force_non_ssl_session_port | force_tls_disabled | host | known_hosts                                                                                                  | labels | name | port
--------------------------------------+------------+----------------------------+--------------------+------+--------------------------------------------------------------------------------------------------------------+--------+------+------
 f25492d6-be27-41c8-b379-68182411381b |      token |                      False |              False |      | {'192.168.200.11', '192.168.200.12', '192.168.200.13', '192.168.200.21', '192.168.200.22', '192.168.200.23'} |   null |      |    0

(1 rows)
cqlsh> exit
➜  scylla-manager git:(ml/fix-4028) cqlsh                                                                                                                                                                                                                       <aws:dbaas-lab-developer>
Connected to SM Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.0.8 | CQL spec 3.3.1 | Native protocol v4]
Use HELP for help.
cqlsh> select * from scylla_manager.cluster;

 id                                   | auth_token | force_non_ssl_session_port | force_tls_disabled | host | known_hosts                                                                                                  | labels | name | port
--------------------------------------+------------+----------------------------+--------------------+------+--------------------------------------------------------------------------------------------------------------+--------+------+------
 f25492d6-be27-41c8-b379-68182411381b |      token |                      False |              False |      | {'192.168.200.11', '192.168.200.12', '192.168.200.13', '192.168.200.21', '192.168.200.22', '192.168.200.23'} |   null |      |    0

(1 rows)
cqlsh> exit
➜  scylla-manager git:(ml/fix-4028) ./sctool.dev cluster update -c f25492d6-be27-41c8-b379-68182411381b --host 192.168.200.13  

Scylla-manager started with

13:57:51.145    ERROR   cluster Missing --host flag. Using only previously discovered hosts instead     {"cluster ID": "f25492d6-be27-41c8-b379-68182411381b", "_trace_id": "TqpEHNhfRn-_vRguMF6k8A"}
github.com/scylladb/go-log.Logger.log
        /home/karkok/dev/scylla-manager/vendor/github.com/scylladb/go-log/logger.go:101
github.com/scylladb/go-log.Logger.Error
        /home/karkok/dev/scylla-manager/vendor/github.com/scylladb/go-log/logger.go:84
github.com/scylladb/scylla-manager/v3/pkg/service/cluster.(*Service).discoverClusterHosts
        /home/karkok/dev/scylla-manager/pkg/service/cluster/service.go:179
github.com/scylladb/scylla-manager/v3/pkg/service/cluster.(*Service).discoverAndSetClusterHosts
        /home/karkok/dev/scylla-manager/pkg/service/cluster/service.go:159
github.com/scylladb/scylla-manager/v3/pkg/service/cluster.(*Service).CreateClientNoCache
        /home/karkok/dev/scylla-manager/pkg/service/cluster/service.go:140
github.com/scylladb/scylla-manager/v3/pkg/service/configcache.(*Service).updateSingle
        /home/karkok/dev/scylla-manager/pkg/service/configcache/service.go:157
github.com/scylladb/scylla-manager/v3/pkg/service/configcache.(*Service).updateAll.func1
        /home/karkok/dev/scylla-manager/pkg/service/configcache/service.go:208

... what is fine.

Eventually, it is possible to update the host with

➜  scylla-manager git:(ml/fix-4028) ./sctool.dev cluster update -c f25492d6-be27-41c8-b379-68182411381b --host 192.168.200.13                                                                                                                                   <aws:dbaas-lab-developer>
➜  scylla-manager git:(ml/fix-4028) cqlsh                                                                                                                                                                                                                       <aws:dbaas-lab-developer>
Connected to SM Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.0.8 | CQL spec 3.3.1 | Native protocol v4]
Use HELP for help.
cqlsh> select * from scylla_manager.cluster;

 id                                   | auth_token | force_non_ssl_session_port | force_tls_disabled | host           | known_hosts                                                                                                  | labels | name | port
--------------------------------------+------------+----------------------------+--------------------+----------------+--------------------------------------------------------------------------------------------------------------+--------+------+------
 f25492d6-be27-41c8-b379-68182411381b |      token |                      False |              False | 192.168.200.13 | {'192.168.200.11', '192.168.200.12', '192.168.200.13', '192.168.200.21', '192.168.200.22', '192.168.200.23'} |   null |      |    0

(1 rows)

@karol-kokoszka
Copy link
Collaborator

@Michal-Leszczynski can you run operator tests on that ?

Copy link
Collaborator

@karol-kokoszka karol-kokoszka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR is OK, let's merge after getting the results from operator tests.

@Michal-Leszczynski
Copy link
Collaborator Author

Operator tests passed, but since this time I added some more safety nets and tests, so I will re-trigger them.

@Michal-Leszczynski
Copy link
Collaborator Author

Operator tests started here: scylladb/scylla-operator#2105 (comment).

@Michal-Leszczynski
Copy link
Collaborator Author

@karol-kokoszka operator tests passed: scylladb/scylla-operator#2105 (comment)!

@karol-kokoszka
Copy link
Collaborator

OK, so merging to master and backporting to branch-3.3
let's start the operator tests on the backport PR as well

@karol-kokoszka karol-kokoszka merged commit 4ef2a82 into master Sep 11, 2024
45 of 51 checks passed
@karol-kokoszka karol-kokoszka deleted the ml/fix-4028 branch September 11, 2024 13:21
@karol-kokoszka karol-kokoszka mentioned this pull request Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scylla Manager regression in 3.3.2: segmentation fault
2 participants