Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3: medusa getting stuck indefinitely while fetching uploaded object #697

Closed
sandeepmallik opened this issue Dec 18, 2023 · 11 comments · Fixed by #691
Closed

S3: medusa getting stuck indefinitely while fetching uploaded object #697

sandeepmallik opened this issue Dec 18, 2023 · 11 comments · Fixed by #691
Labels
done Issues in the state 'done'

Comments

@sandeepmallik
Copy link

sandeepmallik commented Dec 18, 2023

Project board link

While uploading backup, medusa getting stuck indefinitely while fetching uploaded object. Not sure what is causing this. We have a cluster of 96 nodes. We added 6 nodes recently and it is failing on 4 nodes with same issue. On other old nodes, backup is working fine. Nothing has changed at infrastructure level.

The object exists in S3 but medusa is unable to fetch/recognise it. Is it something to do with checksum after uploading object? Is there a option to retry Getting object.

[2023-12-18 08:15:17,023] INFO: Uploading /var/lib/cassandra/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/snapshots/medusa-s3_prod_12_17_2023_23/mc-6212-big-TOC.txt (92.000B)
[2023-12-18 08:15:17,024] DEBUG: Uploading 152.2.81.127/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/mc-6212-big-TOC.txt as single part
[2023-12-18 08:15:17,121] INFO: Uploading /var/lib/cassandra/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/snapshots/medusa-s3_prod_12_17_2023_23/mc-6212-big-Statistics.db (11.435KiB)
[2023-12-18 08:15:17,121] DEBUG: Uploading 152.2.81.127/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/mc-6212-big-Statistics.db as single part
[2023-12-18 08:15:17,126] DEBUG: https://s3.amazonaws.com:443 "PUT /ec4-backup-prod/152.2.81.127/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/mc-6212-big-TOC.txt HTTP/1.1" 200 0
[2023-12-18 08:15:17,129] INFO: Uploading /var/lib/cassandra/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/snapshots/medusa-s3_prod_12_17_2023_23/mc-6212-big-Summary.db (264.101KiB)
[2023-12-18 08:15:17,129] DEBUG: Uploading 152.2.81.127/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/mc-6212-big-Summary.db as single part
[2023-12-18 08:15:17,131] INFO: Uploading /var/lib/cassandra/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/snapshots/medusa-s3_prod_12_17_2023_23/mc-6212-big-CompressionInfo.db (50.417KiB)
[2023-12-18 08:15:17,131] DEBUG: Uploading 152.2.81.127/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/mc-6212-big-CompressionInfo.db as single part
[2023-12-18 08:15:17,147] INFO: Uploading /var/lib/cassandra/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/snapshots/medusa-s3_prod_12_17_2023_23/mc-6212-big-Data.db (141.154MiB)
[2023-12-18 08:15:17,147] DEBUG: Uploading 152.2.81.127/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/mc-6212-big-Data.db as multi part
[2023-12-18 08:15:17,147] DEBUG: aws s3 cp /var/lib/cassandra/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/snapshots/medusa-s3_prod_12_17_2023_23/mc-6212-big-Data.db s3://ec4-backup-prod/152.2.81.127/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/mc-6212-big-Data.db
[2023-12-18 08:15:17,163] INFO: Uploading /var/lib/cassandra/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/snapshots/medusa-s3_prod_12_17_2023_23/mc-6212-big-Index.db (30.050MiB)
[2023-12-18 08:15:17,163] DEBUG: Uploading 152.2.81.127/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/mc-6212-big-Index.db as single part
[2023-12-18 08:15:17,543] DEBUG: https://s3.amazonaws.com:443 "PUT /ec4-backup-prod/152.2.81.127/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/mc-6212-big-Summary.db HTTP/1.1" 200 0
[2023-12-18 08:15:17,544] DEBUG: https://s3.amazonaws.com:443 "PUT /ec4-backup-prod/152.2.81.127/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/mc-6212-big-CompressionInfo.db HTTP/1.1" 200 0
[2023-12-18 08:15:17,545] DEBUG: https://s3.amazonaws.com:443 "PUT /ec4-backup-prod/152.2.81.127/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/mc-6212-big-Statistics.db HTTP/1.1" 200 0
[2023-12-18 08:15:17,548] INFO: Uploading /var/lib/cassandra/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/snapshots/medusa-s3_prod_12_17_2023_23/mc-6212-big-Digest.crc32 (9.000B)
[2023-12-18 08:15:17,548] DEBUG: Uploading 152.2.81.127/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/mc-6212-big-Digest.crc32 as single part
[2023-12-18 08:15:18,561] DEBUG: https://s3.amazonaws.com:443 "PUT /ec4-backup-prod/152.2.81.127/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/mc-6212-big-Digest.crc32 HTTP/1.1" 200 0
[2023-12-18 08:15:18,654] DEBUG: https://s3.amazonaws.com:443 "PUT /ec4-backup-prod/152.2.81.127/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/mc-6212-big-Index.db HTTP/1.1" 200 0
[2023-12-18 08:15:19,810] DEBUG: [Storage] Getting object 152.2.81.127/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/mc-6212-big-Data.db
[2023-12-18 08:15:19,835] DEBUG: https://s3.amazonaws.com:443 "HEAD /ec4-backup-prod/152.2.81.127/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/mc-6212-big-Data.db HTTP/1.1" 200 0

[root@ip-172-21-181-249 ~]# aws s3 ls s3://ec4-backup-prod/152.2.81.127/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/mc-6212-big-Data.db
2023-12-18 08:15:18 148010978 mc-6212-big-Data.db

@adejanovski
Copy link
Contributor

adejanovski commented Dec 18, 2023

Hi @sandeepmallik, it seems like you're not running a recent version of Medusa. We have had loads of issues with previous versions that used a combination of libcloud and awscli to handle communications with S3.
One of the things we identified but couldn't fix before was that some files were successfully uploaded using the aws s3 cp command (which is what I can see in the logs you pasted) but couldn't be fetched afterwards using libcloud.
This is exactly the behavior your cluster is exhibiting, and I can only advise to upgrade to the latest version which fixes this.
The storage backend has been revamped in v0.16 which now uses boto3 to handle all interactions with S3 and provide a rock steady implementation.

@sandeepmallik
Copy link
Author

Hi. @adejanovski We will update version to 0.16.3 and see if it works.

@sandeepmallik
Copy link
Author

sandeepmallik commented Dec 19, 2023

@adejanovski It didn't work. We are using python3.9/pip3.9. After upgrading medusa to 0.16.3 on new nodes, I need to install these packages as per medusa requirements. Old nodes is on 0.15.0

pip3.9 install click-aliases==1.0.1
pip3.9 install wheel==0.32.0

Medusa having issue with fetching IAM role.

[2023-12-18 23:00:03,944] INFO: Registered backup id s3_prod_12_18_2023_23
[2023-12-18 23:00:03,945] INFO: Monitoring provider is noop
[2023-12-18 23:00:03,975] INFO: Using credentials CensoredCredentials(access_key_id=A..A, secret_access_key=, region=us-east-1)
[2023-12-18 23:00:03,975] INFO: Using S3 URL https://ec4-backup-prod.s3.amazonaws.com
[2023-12-18 23:00:04,329] ERROR: Error getting object from s3://ec4-backup-prod/152.2.82.199/s3_prod_12_18_2023_23/meta/schema.cql
[2023-12-18 23:00:04,329] INFO: Starting backup using Stagger: None Mode: differential Name: s3_prod_12_18_2023_23
[2023-12-18 23:00:04,329] INFO: Updated from existing status: -1 to new status: 0 for backup id: s3_prod_12_18_2023_23
[2023-12-18 23:00:04,330] INFO: Saving tokenmap and schema
[2023-12-18 23:04:05,842] INFO: Resolving ip address 152.2.82.199
[2023-12-18 23:04:05,843] INFO: ip address to resolve 152.2.82.199
.
.
.
[2023-12-18 23:08:02,027] ERROR: An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records.
[2023-12-18 23:08:22,581] ERROR: An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records.
[2023-12-18 23:09:02,706] ERROR: An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records.
[2023-12-18 23:10:22,840] ERROR: An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records.
[2023-12-18 23:12:22,967] ERROR: An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records.
[2023-12-18 23:14:23,133] ERROR: An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records.
[2023-12-18 23:16:23,260] ERROR: An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records.
[2023-12-18 23:16:23,261] ERROR: Issue occurred inside handle_backup Name: s3_prod_12_18_2023_23 Error: An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records.
[2023-12-18 23:16:23,261] INFO: Updated from existing status: 0 to new status: 2 for backup id: s3_prod_12_18_2023_23
[2023-12-18 23:16:23,261] ERROR: Error occurred during backup: An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records.
[2023-12-18 23:16:25,452] INFO: Using credentials CensoredCredentials(access_key_id=A..A, secret_access_key=
, region=us-east-1)
[2023-12-18 23:16:25,453] INFO: Using S3 URL https://ec4-backup-prod.s3.amazonaws.com
[2023-12-18 23:24:47,640] INFO: Monitoring provider is noop
[2023-12-18 23:24:47,641] INFO: Starting purge
[2023-12-18 23:24:47,665] INFO: Using credentials CensoredCredentials(access_key_id=A..A, secret_access_key=
*, region=us-east-1)
[2023-12-18 23:24:47,665] INFO: Using S3 URL https://ec4-backup-prod.s3.amazonaws.com
[2023-12-18 23:24:47,749] INFO: Listing backups for 152.2.82.199
[2023-12-18 23:33:08,373] ERROR: This error happened during the purge: An error occurred (InvalidAccessKeyId) when calling the ListObjectsV2 operation: The AWS Access Key Id you provided does not exist in our records.

$ sudo medusa -vvv list-backups
[2023-12-19 07:20:16,770] DEBUG: Loading configuration from /etc/medusa/medusa.ini
[2023-12-19 07:20:16,773] INFO: Resolving ip address
[2023-12-19 07:20:16,774] INFO: ip address to resolve 172.21.182.99
[2023-12-19 07:20:16,774] DEBUG: Not resolving 172.21.182.99 as requested
[2023-12-19 07:20:16,774] DEBUG: Logging to file options: LoggingConfig(enabled='1', file='/var/log/medusa/medusa.log', format='[%(asctime)s] %(levelname)s: %(message)s', level='INFO', maxBytes='20000000', backupCount='50')
[2023-12-19 07:20:16,776] DEBUG: Loading storage_provider: s3
[2023-12-19 07:20:16,800] INFO: Using credentials CensoredCredentials(access_key_id=A..T, secret_access_key=*****, region=us-east-1)
[2023-12-19 07:20:16,800] INFO: Using S3 URL https://ec4-backup-prod.s3.amazonaws.com
[2023-12-19 07:20:16,800] DEBUG: Connecting to S3
[2023-12-19 07:20:16,901] DEBUG: [Storage] Listing objects in index/backup_index
[2023-12-19 07:20:16,901] DEBUG: Using selector: GeventSelector
[2023-12-19 07:20:36,956] DEBUG: [Storage] Listing objects in index/backup_index
[2023-12-19 07:21:17,018] DEBUG: [Storage] Listing objects in index/backup_index
[2023-12-19 07:22:37,117] DEBUG: [Storage] Listing objects in index/backup_index
[2023-12-19 07:24:37,244] DEBUG: [Storage] Listing objects in index/backup_index
[2023-12-19 07:26:37,383] DEBUG: [Storage] Listing objects in index/backup_index
[2023-12-19 07:28:37,517] DEBUG: [Storage] Listing objects in index/backup_index
[2023-12-19 07:28:37,588] DEBUG: Disconnecting from S3...
Traceback (most recent call last):
File "/usr/local/bin/medusa", line 8, in
sys.exit(cli())
File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.9/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/click/decorators.py", line 92, in new_func
return ctx.invoke(f, obj, *args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/medusa/medusacli.py", line 185, in list_backups
medusa.listing.list_backups(medusaconfig, show_all)
File "/usr/local/lib/python3.9/site-packages/medusa/listing.py", line 41, in list_backups
list_backups_w_storage(config, show_all, storage)
File "/usr/local/lib/python3.9/site-packages/medusa/listing.py", line 45, in list_backups_w_storage
cluster_backups = get_backups(storage, config, show_all)
File "/usr/local/lib/python3.9/site-packages/medusa/listing.py", line 26, in get_backups
cluster_backups = sorted(
File "/usr/local/lib/python3.9/site-packages/medusa/storage/init.py", line 358, in list_cluster_backups
node_backups = sorted(
File "/usr/local/lib/python3.9/site-packages/medusa/storage/init.py", line 179, in list_node_backups
backup_index_blobs = self.list_backup_index_blobs()
File "/usr/local/lib/python3.9/site-packages/medusa/storage/init.py", line 270, in list_backup_index_blobs
return self.storage_driver.list_objects(path)
File "/usr/local/lib/python3.9/site-packages/retrying.py", line 56, in wrapped_f
return Retrying(*dargs, **dkw).call(f, *args, **kw)
File "/usr/local/lib/python3.9/site-packages/retrying.py", line 266, in call
raise attempt.get()
File "/usr/local/lib/python3.9/site-packages/retrying.py", line 301, in get
six.reraise(self.value[0], self.value[1], self.value[2])
File "/usr/local/lib/python3.9/site-packages/six.py", line 719, in reraise
raise value
File "/usr/local/lib/python3.9/site-packages/retrying.py", line 251, in call
attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
File "/usr/local/lib/python3.9/site-packages/medusa/storage/abstract_storage.py", line 71, in list_objects
objects = self.list_blobs(prefix=path)
File "/usr/local/lib/python3.9/site-packages/medusa/storage/abstract_storage.py", line 79, in list_blobs
objects = loop.run_until_complete(self._list_blobs(prefix))
File "/usr/lib64/python3.9/asyncio/base_events.py", line 647, in run_until_complete
return future.result()
File "/usr/local/lib/python3.9/site-packages/medusa/storage/s3_base_storage.py", line 202, in _list_blobs
response = self.s3_client.get_paginator('list_objects_v2').paginate(
File "/usr/local/lib/python3.9/site-packages/botocore/paginate.py", line 479, in build_full_result
for response in self:
File "/usr/local/lib/python3.9/site-packages/botocore/paginate.py", line 269, in iter
response = self._make_request(current_kwargs)
File "/usr/local/lib/python3.9/site-packages/botocore/paginate.py", line 357, in _make_request
return self._method(**current_kwargs)
File "/usr/local/lib/python3.9/site-packages/botocore/client.py", line 553, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/usr/local/lib/python3.9/site-packages/botocore/client.py", line 1009, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (InvalidAccessKeyId) when calling the ListObjectsV2 operation: The AWS Access Key Id you provided does not exist in our records.

$ aws s3 ls s3://ec4-backup-prod/index/backup_index/
PRE s3_prod_12_23_2023_11/
PRE s3_prod_12_23_2023_23/

@adejanovski
Copy link
Contributor

Hi @sandeepmallik, we have a contributed fix for the role issue which you can track here.
Anything you can contribute to the discussion or in terms of testing would help a lot to get this merged in.
Thanks!

@sandeepmallik
Copy link
Author

@adejanovski I have tested the fix on my cluster and it didn't work. I couldn't contribute at code level as I am not familier with Python.

@adejanovski adejanovski moved this to In Progress in K8ssandra Dec 20, 2023
@adejanovski adejanovski added the in-progress Issues in the state 'in-progress' label Dec 20, 2023
@adejanovski adejanovski linked a pull request Dec 20, 2023 that will close this issue
@adejanovski
Copy link
Contributor

Thanks @sandeepmallik, we're having a conversation around using a slightly different fix for this which would build the S3 client without passing the credentials.
I've added #691 as a fix for this PR which will update its status once we merge it.

@sandeepmallik
Copy link
Author

Thanks @adejanovski. As temporary fix, I will comment these lines in s3_base_storage.py as we only use IAM role.

aws_access_key_id=credentials.access_key,
aws_secret_access_key=credentials.secret_key,

@adejanovski
Copy link
Contributor

Sure, let us know how this works.

@sandeepmallik
Copy link
Author

@adejanovski It didn't work. After snapshot got created, it got stuck indefinitely. Nothing is logs even after using verbose.

@sandeepmallik
Copy link
Author

@adejanovski Tested backup on ONE node with new code (https://github.com/thelastpickle/cassandra-medusa/blob/082375b52e6a8f0e376586984444852cb7371c72/medusa/storage/s3_base_storage.py). I upgraded medusa to 0.17.0 and replace above file. Backup worked.

Will test this procedure on remaining nodes. If it works then we are good to merge the change.

[2024-01-01 23:06:12,870] INFO: Creating snapshot
[2024-01-02 01:38:58,185] WARNING: Connection pool is full, discarding connection: backup.s3.amazonaws.com
[2024-01-02 01:38:58,267] WARNING: Connection pool is full, discarding connection: backup.s3.amazonaws.com
[2024-01-02 01:38:58,423] WARNING: Connection pool is full, discarding connection: backup.s3.amazonaws.com
[2024-01-02 03:45:54,869] INFO: Using credentials CensoredCredentials(access_key_id=None, secret_access_key=, region=us-east-1)
[2024-01-02 03:45:54,870] INFO: Connecting to s3 with args {}
[2024-01-02 03:46:09,490] INFO: Using credentials CensoredCredentials(access_key_id=None, secret_access_key=
, region=us-east-1)
[2024-01-02 03:46:09,491] INFO: Connecting to s3 with args {}
[2024-01-02 04:23:31,587] INFO: Updating backup index
[2024-01-02 04:23:32,709] INFO: Backup done
[2024-01-02 04:23:32,710] INFO: - Started: 2024-01-01 23:00:04
- Started extracting data: 2024-01-01 23:06:12
- Finished: 2024-01-02 04:23:32
[2024-01-02 04:23:32,710] INFO: - Real duration: 5:17:19.884014 (excludes time waiting for other nodes)
[2024-01-02 04:23:32,815] INFO: - 66088 files, 1.65 TB
[2024-01-02 04:23:32,816] INFO: - 66088 files copied from host
[2024-01-02 04:23:32,918] INFO: Updated from existing status: 0 to new status: 1 for backup id: s3_backup_01_01_2024_23
[2024-01-02 04:23:35,403] INFO: Using credentials CensoredCredentials(access_key_id=None, secret_access_key=, region=us-east-1)
[2024-01-02 04:23:35,403] INFO: Connecting to s3 with args {}
[2024-01-02 04:24:50,479] INFO: Monitoring provider is noop
[2024-01-02 04:24:50,479] INFO: Starting purge
[2024-01-02 04:24:50,487] INFO: Using credentials CensoredCredentials(access_key_id=None, secret_access_key=
, region=us-east-1)
[2024-01-02 04:24:50,488] INFO: Connecting to s3 with args {}
[2024-01-02 04:24:50,581] INFO: Listing backups for 153.22.11.25
[2024-01-02 04:24:53,486] INFO: 2 backups are candidate to be purged
[2024-01-02 04:24:53,486] INFO: Purging backup s3_backup_12_17_2023_23 from node 153.22.11.25...
[2024-01-02 04:24:53,725] INFO: Purging backup s3_backup_12_14_2023_23 from node 153.22.11.25...
[2024-01-02 04:24:54,003] INFO: Cleaning up orphaned files for 153.22.11.25...
[2024-01-02 04:46:11,195] INFO: Purged 27802 objects with a total size of 654.80 GB

@sandeepmallik
Copy link
Author

@adejanovski Fix #691 worked. Tested it on 6 nodes.

@github-project-automation github-project-automation bot moved this from In Progress to Done in K8ssandra Jan 4, 2024
@adejanovski adejanovski added done Issues in the state 'done' and removed in-progress Issues in the state 'in-progress' labels Jan 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
done Issues in the state 'done'
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants