Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disrupt_replace_service_level_using_detach_during_load validate_role_service_level_attributes_against_db fails with: IndexError: list index out of range #9671

Open
2 tasks
yarongilor opened this issue Jan 7, 2025 · 8 comments
Assignees

Comments

@yarongilor
Copy link
Contributor

Packages

Scylla version: 2024.3.0~dev-20241220.e8463f6b719f with build-id bb749b8a21072b70e631576259d767ecfe654583

Kernel Version: 6.8.0-1021-aws

Issue description

  • This issue is a regression.
  • It is unknown if this issue is a regression.

Describe your issue in detail and steps it took to produce it.

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

disrupt_replace_service_level_using_detach_during_load	elasticity-test-nemesis-master-db-node-2556bfba-2	Failed	2024-12-23 19:55:30	2024-12-23 20:31:03
Nemesis Information
Class: Sisyphus
Name: disrupt_replace_service_level_using_detach_during_load
Status: Failed
Failure reason
Traceback (most recent call last):
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5452, in wrapper
    result = method(*args[1:], **kwargs)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 4938, in disrupt_replace_service_level_using_detach_during_load
    self.format_error_for_sla_test_and_raise(error_events=error_events)
  File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5031, in format_error_for_sla_test_and_raise
    raise NemesisSubTestFailure("\n".join(f"Step: {error.step}. Error:\n - {error}"
sdcm.exceptions.NemesisSubTestFailure: Step: Run stress command and validate io_queue_operations during load. Error:
 - (TestStepEvent Severity.ERROR) period_type=end event_id=afcdaccf-4177-4637-80ab-916770064553 during_nemesis=ReplaceServiceLevelUsingDetachDuringLoad duration=10m20s: step=Run stress command and validate io_queue_operations during load  errors=list index out of range
Traceback (most recent call last):
  File "/home/ubuntu/scylla-cluster-tests/sdcm/sla/sla_tests.py", line 31, in run_stress_and_validate_scheduler_io_queue_operations_during_load
    self.validate_io_queue_operations(start_time=start_time,
  File "/home/ubuntu/scylla-cluster-tests/sdcm/sla/libs/sla_utils.py", line 157, in validate_io_queue_operations
    user['role'].validate_role_service_level_attributes_against_db()
  File "/home/ubuntu/scylla-cluster-tests/test_lib/sla.py", line 386, in validate_role_service_level_attributes_against_db
    LOGGER.debug("Service level from LIST: %s", service_level[0].service_level)
IndexError: list index out of range

Step: Attach service level 'sl50_dda50134' with 50 shares to role500_dda50134. Validate io_queue_operations during load. Error:
 - (TestStepEvent Severity.ERROR) period_type=end event_id=b6287188-84d6-4586-b294-aed18dedd245 during_nemesis=ReplaceServiceLevelUsingDetachDuringLoad duration=0s: step=Attach service level 'sl50_dda50134' with 50 shares to role500_dda50134. Validate io_queue_operations during load  errors=list index out of range
Traceback (most recent call last):
  File "/home/ubuntu/scylla-cluster-tests/sdcm/sla/sla_tests.py", line 121, in attach_sl_and_validate_io_queue_operations
    role_for_attach.validate_role_service_level_attributes_against_db()
  File "/home/ubuntu/scylla-cluster-tests/test_lib/sla.py", line 386, in validate_role_service_level_attributes_against_db
    LOGGER.debug("Service level from LIST: %s", service_level[0].service_level)
IndexError: list index out of range

Installation details

Cluster size: 3 nodes (i4i.large)

Scylla Nodes used in this run:

  • elasticity-test-nemesis-master-db-node-2556bfba-3 (52.210.179.65 | 10.4.14.61) (shards: 2)
  • elasticity-test-nemesis-master-db-node-2556bfba-2 (54.216.17.80 | 10.4.15.159) (shards: 2)
  • elasticity-test-nemesis-master-db-node-2556bfba-1 (54.220.142.116 | 10.4.15.152) (shards: 2)

OS / Image: ami-0af6c6b0a814f34a7 (aws: undefined_region)

Test: byo-longevity-test-yg2
Test id: 2556bfba-bff7-4ec5-833d-312330270ab4
Test name: scylla-staging/yarongilor/byo-longevity-test-yg2
Test method: longevity_sla_test.LongevitySlaTest.test_custom_time
Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 2556bfba-bff7-4ec5-833d-312330270ab4
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 2556bfba-bff7-4ec5-833d-312330270ab4

Logs:

Jenkins job URL
Argus

@fruch
Copy link
Contributor

fruch commented Jan 13, 2025

@yarongilor

you are not running it with the needed configuration for those nemesis

< t:2024-12-23 17:15:03,765 f:sct_config.py   l:2144 c:sdcm.sct_config      p:INFO  > config_files:
< t:2024-12-23 17:15:03,765 f:sct_config.py   l:2144 c:sdcm.sct_config      p:INFO  > - test-cases/performance/perf-regression-latency-i4i_2xlarge-elasticity-90-percent-with-nemesis.yaml
< t:2024-12-23 17:15:03,765 f:sct_config.py   l:2144 c:sdcm.sct_config      p:INFO  > - configurations/disable_kms.yaml                                                               < t:2024-12-23 17:15:03,765 f:sct_config.py   l:2144 c:sdcm.sct_config      p:INFO  > - configurations/network_config/two_interfaces.yaml
< t:2024-12-23 17:15:03,765 f:sct_config.py   l:2144 c:sdcm.sct_config      p:INFO  > - configurations/ldap-authorization.yaml

'configurations/nemesis/additional_configs/sla_config.yaml' is missing

@fruch fruch closed this as completed Jan 13, 2025
@yarongilor
Copy link
Contributor Author

@yarongilor

you are not running it with the needed configuration for those nemesis

< t:2024-12-23 17:15:03,765 f:sct_config.py   l:2144 c:sdcm.sct_config      p:INFO  > config_files:                                                                                   < t:2024-12-23 17:15:03,765 f:sct_config.py   l:2144 c:sdcm.sct_config      p:INFO  > - test-cases/performance/perf-regression-latency-i4i_2xlarge-elasticity-90-percent-with-nemesis.yaml                                                                                                                                                                                  < t:2024-12-23 17:15:03,765 f:sct_config.py   l:2144 c:sdcm.sct_config      p:INFO  > - configurations/disable_kms.yaml                                                               < t:2024-12-23 17:15:03,765 f:sct_config.py   l:2144 c:sdcm.sct_config      p:INFO  > - configurations/network_config/two_interfaces.yaml                                             < t:2024-12-23 17:15:03,765 f:sct_config.py   l:2144 c:sdcm.sct_config      p:INFO  > - configurations/ldap-authorization.yaml

'configurations/nemesis/additional_configs/sla_config.yaml' is missing

@fruch , i wouldn't guess that from an error of "IndexError: list index out of range"...
so i think it is still an SCT issue.

@yarongilor yarongilor reopened this Jan 13, 2025
@pehala
Copy link
Contributor

pehala commented Jan 13, 2025

@fruch , i wouldn't guess that from an error of "IndexError: list index out of range"...
so i think it is still an SCT issue.

In this case, it is either documentation issue or an enhancement for the existing nemesis, so I would create a new issue for this with more honed down description

@fruch
Copy link
Contributor

fruch commented Jan 13, 2025

@yarongilor
how did you know to add configurations/network_config/two_interfaces.yaml and configurations/ldap-authorization.yaml ?
it's documented exactly the same way

also seems you added the following boolean in your configuration :

        # Temporary solution. We do not want to run SLA nemeses during not-SLA test until the feature is stable
        dict(name="sla", env="SCT_SLA", type=boolean,
             help="run SLA nemeses if the test is SLA only"),

so what documentation do you except ? (if you expect it, please write it)

@yarongilor
Copy link
Contributor Author

@yarongilor how did you know to add configurations/network_config/two_interfaces.yaml and configurations/ldap-authorization.yaml ? it's documented exactly the same way

also seems you added the following boolean in your configuration :

        # Temporary solution. We do not want to run SLA nemeses during not-SLA test until the feature is stable
        dict(name="sla", env="SCT_SLA", type=boolean,
             help="run SLA nemeses if the test is SLA only"),

so what documentation do you except ? (if you expect it, please write it)

@fruch , As can be seen above, this parameter help is not detailed enough to understand what's missing. i actually still have no idea what's missing..
AS for configurations/ldap-authorization.yaml - i know to add it since the Ldap nemesis specifically complains what's missing when skipping running it.
I have a vague feeling about the root cause of what's missing here - you mentioned configurations/nemesis/additional_configs/sla_config.yaml is missing , but what specific parameter is it? the nemesis didn't skip for missing parameters.. that is since test used configurations/ldap-authorization.yaml that has:

use_ldap: true
ldap_server_type: 'openldap'
use_ldap_authorization: true
authenticator: 'PasswordAuthenticator'
authenticator_user: cassandra
authenticator_password: cassandra
authorizer: 'CassandraAuthorizer'

Could it be that the combination of Ldap and SLA is the root of the problem somehow?
and if so - can we consider adding SLA nemeses something like:

        if self.cluster.params.get('use_ldap'):
            raise UnsupportedNemesis("SLA feature can't work with Ldap")

? or should we consider adjusting SLA nemesis code to work with Ldap somehow?
i'm not too familiar with SLA so perhaps @juliayakovlev could advise here.

@juliayakovlev
Copy link
Contributor

By SCT log I see that Service level 'sl250_dda50134' has been created and attached to role role250_dda50134.

< t:2024-12-23 19:55:30,348 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2024-12-23 19:55:30.345: (DisruptionEvent Severity.NORMAL) period_type=begin event_id=48401a81-6ddc-4e7b-a04d-082f6ec94a1b: nemesis_name=ReplaceServiceLevelUsingDetachDuringLoad target_node=Node elasticity-test-nemesis-master-db-node-2556bfba-2 [54.216.17.80 | 10.4.15.159]

< t:2024-12-23 19:55:31,641 f:sla.py          l:361  c:test_lib.sla         p:DEBUG > CREATE role query: CREATE ROLE IF NOT EXISTS role250_dda50134 WITH password = 'rolep250' AND login = True AND superuser = True
< t:2024-12-23 19:55:31,641 f:common.py       l:1317 c:utils                p:DEBUG > Executing CQL 'CREATE ROLE IF NOT EXISTS role250_dda50134 WITH password = 'rolep250' AND login = True AND superuser = True' ...
< t:2024-12-23 19:55:31,737 f:sla.py          l:150  c:test_lib.sla         p:DEBUG > Create service level query: CREATE SERVICE_LEVEL IF NOT EXISTS 'sl250_dda50134' WITH shares = 250
< t:2024-12-23 19:55:31,737 f:common.py       l:1317 c:utils                p:DEBUG > Executing CQL 'CREATE SERVICE_LEVEL IF NOT EXISTS 'sl250_dda50134' WITH shares = 250' ...
< t:2024-12-23 19:55:31,737 f:common.py       l:1325 c:utils                p:DEBUG > Executing CQL 'CREATE SERVICE_LEVEL IF NOT EXISTS 'sl250_dda50134' WITH shares = 250' ...
< t:2024-12-23 19:55:31,752 f:sla.py          l:152  c:test_lib.sla         p:DEBUG > Service level 'sl250_dda50134' has been created
< t:2024-12-23 19:55:31,752 f:sla.py          l:273  c:test_lib.sla         p:DEBUG > Attach service level query: ATTACH SERVICE_LEVEL 'sl250_dda50134' TO role250_dda50134
< t:2024-12-23 19:55:31,752 f:common.py       l:1317 c:utils                p:DEBUG > Executing CQL 'ATTACH SERVICE_LEVEL 'sl250_dda50134' TO role250_dda50134' ...
< t:2024-12-23 19:55:32,208 f:sla_utils.py    l:405  c:sdcm.sla.libs.sla_utils p:DEBUG > Start wait for service level propagated for service_level 'sl250_dda50134' on the node 'elasticity-test-nemesis-master-db-node-2556bfba-1'
< t:2024-12-23 19:55:33,361 f:sla_utils.py    l:425  c:sdcm.sla.libs.sla_utils p:DEBUG > Finish wait for service level propagated for service_level 'sl250_dda50134' on the node 'elasticity-test-nemesis-master-db-node-2556bfba-1'
< t:2024-12-23 19:55:33,361 f:sla_utils.py    l:405  c:sdcm.sla.libs.sla_utils p:DEBUG > Start wait for service level propagated for service_level 'sl250_dda50134' on the node 'elasticity-test-nemesis-master-db-node-2556bfba-2'
< t:2024-12-23 19:55:34,477 f:sla_utils.py    l:425  c:sdcm.sla.libs.sla_utils p:DEBUG > Finish wait for service level propagated for service_level 'sl250_dda50134' on the node 'elasticity-test-nemesis-master-db-node-2556bfba-2'
< t:2024-12-23 19:55:34,477 f:sla_utils.py    l:405  c:sdcm.sla.libs.sla_utils p:DEBUG > Start wait for service level propagated for service_level 'sl250_dda50134' on the node 'elasticity-test-nemesis-master-db-node-2556bfba-3'
< t:2024-12-23 19:55:35,596 f:sla_utils.py    l:425  c:sdcm.sla.libs.sla_utils p:DEBUG > Finish wait for service level propagated for service_level 'sl250_dda50134' on the node 'elasticity-test-nemesis-master-db-node-2556bfba-3'

But 'LIST ATTACHED SERVICE_LEVEL OF role250_dda50134' returned empty list and this is very weird . Like there is no SL attached to the role (despite it should be). And this caused to IndexError:

< t:2024-12-23 20:06:00,243 f:sla.py          l:317  c:test_lib.sla         p:DEBUG > List attached service level(s) query: LIST ATTACHED SERVICE_LEVEL OF role250_dda50134
< t:2024-12-23 20:06:00,243 f:common.py       l:1325 c:utils                p:DEBUG > Executing CQL 'LIST ATTACHED SERVICE_LEVEL OF role250_dda50134' ...
< t:2024-12-23 20:06:00,246 f:sla.py          l:374  c:test_lib.sla         p:DEBUG > List of service levels for role role250_dda50134: []
< t:2024-12-23 20:06:00,249 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2024-12-23 20:06:00.247: (TestStepEvent Severity.ERROR) period_type=end event_id=afcdaccf-4177-4637-80ab-916770064553 during_nemesis=ReplaceServiceLevelUsingDetachDuringLoad duration=10m20s: step=Run stress command and validate io_queue_operations during load  errors=list index out of range
< t:2024-12-23 20:06:00,249 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > Traceback (most recent call last):
< t:2024-12-23 20:06:00,249 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "/home/ubuntu/scylla-cluster-tests/sdcm/sla/sla_tests.py", line 31, in run_stress_and_validate_scheduler_io_queue_operations_during_load
< t:2024-12-23 20:06:00,249 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >     self.validate_io_queue_operations(start_time=start_time,
< t:2024-12-23 20:06:00,249 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "/home/ubuntu/scylla-cluster-tests/sdcm/sla/libs/sla_utils.py", line 157, in validate_io_queue_operations
< t:2024-12-23 20:06:00,249 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >     user['role'].validate_role_service_level_attributes_against_db()
< t:2024-12-23 20:06:00,249 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >   File "/home/ubuntu/scylla-cluster-tests/test_lib/sla.py", line 386, in validate_role_service_level_attributes_against_db
< t:2024-12-23 20:06:00,249 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  >     LOGGER.debug("Service level from LIST: %s", service_level[0].service_level)
< t:2024-12-23 20:06:00,249 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > IndexError: list index out of range

By the SCT code this [failed] line LOGGER.debug("Service level from LIST: %s", service_level[0].service_level) should not be achieved.
Expected to raise ValueError because of service_level list is empty or at least print "No attached Service level to the role %s" and exit from function:

    def validate_role_service_level_attributes_against_db(self):
        service_level = self.list_user_role_attached_service_levels()
        LOGGER.debug("List of service levels for role %s: %s", self.name, service_level)
        if not service_level and self.attached_service_level:
            ValueError(f"No Service Level attached to the role '{self.name}'. But it is expected that Service Level "
                       f"'{self._attached_service_level_name}' is attached to this role. Validate if it is test or "
                       "Scylla issue")
        elif not self.attached_service_level and service_level:
            ValueError(f"Found attached Service Level '{service_level[0].service_level}' to the role '{self.name}'. "
                       "But it is expected that no attached Service Level. Validate if it is test or Scylla issue")
        elif not service_level and not self.attached_service_level:
            LOGGER.debug("No attached Service level to the role %s", self.name)
            return

        LOGGER.debug("Service level from LIST: %s", service_level[0].service_level)

So there are 2 problems here:

  1. 'LIST ATTACHED SERVICE_LEVEL OF role250_dda50134' returned empty list unexpectedly
  2. ValueError was not raised

About LDAP and SLA - I do not know how LDAP may broke the test

@temichus
Copy link
Contributor

2 problem at least can be easily solved #9829

@temichus
Copy link
Contributor

temichus commented Jan 22, 2025

@yarongilor is this issue reproduced after merging #9829?

I mean

'LIST ATTACHED SERVICE_LEVEL OF role250_dda50134'  returned empty list unexpectedly

cc @pehala

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants