Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSH Timeout Error on first backup-cluster or restore #689

Closed
lopesjp opened this issue Nov 29, 2023 · 1 comment · Fixed by #804
Closed

SSH Timeout Error on first backup-cluster or restore #689

lopesjp opened this issue Nov 29, 2023 · 1 comment · Fixed by #804
Labels
assess Issues in the state 'assess' configuration help-wanted Issues in the state 'help-wanted'

Comments

@lopesjp
Copy link

lopesjp commented Nov 29, 2023

Project board link

Hi 👋

We have been trying Medusa on a Cassandra cluster with 10 nodes and around 1 TB of data.

To perform the first backup of the cluster, we are running the medusa backup-cluster --backup-name <name> --mode differential.

These backups are going to a S3 bucket, which do not have any backups, so I assume that although is differential, basically does a full backup?!

Nonetheless, the curious thing we found was that for the last nodes, the backup was incomplete.
We ran the command on node 01 at 17:28, which ran until 19:15, checking the logs from the process we noticed an error

[2023-11-27 19:15:38,899] ERROR: This error happened during the cluster backup: (-1, b'Socket error: disconnected')
Traceback (most recent call last):
  File "/opt/cassandra-medusa/lib64/python3.6/site-packages/medusa/backup_cluster.py", line 64, in orchestrate
    backup.execute(cql_session_provider)
  File "/opt/cassandra-medusa/lib64/python3.6/site-packages/medusa/backup_cluster.py", line 150, in execute
    self._upload_backup()
  File "/opt/cassandra-medusa/lib64/python3.6/site-packages/medusa/backup_cluster.py", line 172, in _upload_backup
    hosts_variables={})
  File "/opt/cassandra-medusa/lib64/python3.6/site-packages/medusa/orchestration.py", line 83, in pssh_run
    display_output(error)
  File "/opt/cassandra-medusa/lib64/python3.6/site-packages/medusa/orchestration.py", line 26, in display_output
    for line in host_out.stdout:
  File "/opt/cassandra-medusa/lib64/python3.6/site-packages/pssh/clients/base/single.py", line 246, in read_output_buffer
    for line in output_buffer:
  File "/opt/cassandra-medusa/lib64/python3.6/site-packages/pssh/clients/ssh/single.py", line 301, in _read_output
    _reader.get(timeout=timeout)
  File "src/gevent/greenlet.py", line 805, in gevent._gevent_cgreenlet.Greenlet.get
  File "src/gevent/greenlet.py", line 373, in gevent._gevent_cgreenlet.Greenlet._raise_exception
  File "/opt/cassandra-medusa/lib64/python3.6/site-packages/gevent/_compat.py", line 66, in reraise
    raise value.with_traceback(tb)
  File "src/gevent/greenlet.py", line 908, in gevent._gevent_cgreenlet.Greenlet.run
  File "/opt/cassandra-medusa/lib64/python3.6/site-packages/pssh/clients/ssh/single.py", line 324, in _read_output_to_buffer
    size, data = channel.read_nonblocking(is_stderr=is_stderr)
  File "ssh/channel.pyx", line 189, in ssh.channel.Channel.read_nonblocking
  File "ssh/utils.pyx", line 94, in ssh.utils.handle_error_codes
ssh.exceptions.SSHError: (-1, b'Socket error: disconnected')

This error causes it to crash and trigger the clean-up of the snapshots.

[2023-11-27 19:15:38,898] ERROR: Job executing "mkdir -p /tmp/medusa-job-40c0e132-2a9c-4822-b179-4b1150b0b7ef; cd /tmp/medusa-job-40c0e132-2a9c-4822-b179-4b1150b0b7ef && medusa-wrapper sudo medusa  -vvv backup-node --backup-name test-all-cluster-drk   --mode differential" ran and finished with errors on following nodes...
[2023-11-27 19:15:38,898] DEBUG: Waiting for stdout reader
[2023-11-27 19:15:38,899] ERROR: This error happened during the cluster backup: (-1, b'Socket error: disconnected')
...
[2023-11-27 19:15:38,907] ERROR: Something went wrong! Attempting to clean snapshots and exit.
...
[2023-11-27 19:15:54,756] INFO: Job executing "nodetool -Dcom.sun.jndi.rmiURLParsing=legacy -u <user> -pw <password> clearsnapshot -t medusa-test-all-cluster-drk" ran and finished Successfully on all nodes.
[2023-11-27 19:15:54,757] INFO: All nodes successfully cleared their snapshot.

Meanwhile, on one of the nodes that was incomplete, it shows that the upload was still ongoing, and it stopped due to the fact that the snapshot was no longer there.

[2023-11-27 19:15:11,897] DEBUG: [https://s3-eu-west-1.amazonaws.com:443](https://s3-eu-west-1.amazonaws.com/) "PUT /sbk-medusa-prd/data/fbp_markets_journal_store/messages-fd3f4700de1a11eabb225d971d2361
fe/mc-455-big-TOC.txt HTTP/1.1" 200 0
...
[2023-11-27 19:20:21,347] ERROR: Error occurred during backup: awscli cp failed. Max attempts exceeded. Check /tmp/awscli_50571839-2d02-4c4d-8c52-afdad2e56732.output for more informations.
...
The user-provided path /var/lib/cassandra/data/fbp_markets_journal_store/messages-fd3f4700de1a11eabb225d971d2361fe/snapshots/medusa-test-all-cluster-drk/mc-6305-big-Data.db does not exist.

I assume this happens because it takes much time to do the backup of all nodes and all data, and the SSH connection might time out.

But what is this SSH connection?
Is the first node that opens this connection to all nodes to execute this command mkdir -p /tmp/medusa-job-40c0e132-2a9c-4822-b179-4b1150b0b7ef; cd /tmp/medusa-job-40c0e132-2a9c-4822-b179-4b1150b0b7ef && medusa-wrapper sudo medusa -vvv backup-node --backup-name test-all-cluster-drk --mode differential?
Is the connection open at the beginning and kept until all nodes have finished the backup?

┆Issue is synchronized with this Jira Story by Unito
┆Issue Number: MED-18

@lopesjp lopesjp changed the title SSH Error when running backup-cluster for a SSH Error when running first backup-cluster Nov 29, 2023
@lopesjp
Copy link
Author

lopesjp commented Jan 3, 2024

As an update to this, it is clearly related with the SSH keep-alive.
Happens when performing the first backup of high volume data or a restore.

On both actions, an SSH connection is established that requires to be open until the end. This has happened for a cluster of 10 nodes and more than 1Tb.

Is there a recommended approach for maintaining the Medusa SSH connection without resorting to constantly adjusting the SSH keep-alive interval settings on the node from which the command is executed?

@lopesjp lopesjp changed the title SSH Error when running first backup-cluster SSH Timeout Error on first backup-cluster or restore Jan 3, 2024
@rzvoncek rzvoncek added the assess Issues in the state 'assess' label Mar 5, 2024
@adejanovski adejanovski moved this to Assess/Investigate in K8ssandra Mar 5, 2024
@rzvoncek rzvoncek added help wanted Extra attention is needed configuration and removed assess Issues in the state 'assess' labels Apr 3, 2024
@adejanovski adejanovski added the assess Issues in the state 'assess' label Apr 3, 2024
@rzvoncek rzvoncek added to-groom and removed assess Issues in the state 'assess' to-groom labels Apr 3, 2024
@adejanovski adejanovski added the assess Issues in the state 'assess' label Apr 8, 2024
@rzvoncek rzvoncek added help-wanted Issues in the state 'help-wanted' and removed help wanted Extra attention is needed labels Apr 8, 2024
@adejanovski adejanovski moved this from Assess/Investigate to Help Wanted in K8ssandra Apr 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
assess Issues in the state 'assess' configuration help-wanted Issues in the state 'help-wanted'
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants