-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(gossip): crds table contents are too small #69
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
… to epoch duration crds values are typically supposed to have a timeout of the epoch duration, unless it's from a node with zero stake weight, then it should timeout after 15 seconds we don't track stake weight, so we can't do this conditional logic. the existing compromises by using the 15 second timeout for all crds values. it's not practical to purge all crds values after 15 seconds. this causes the crds table to regularly be emptied. instead, this change applies an epoch-long timeout to every value. this is not perfect because 0-stake nodes are overly trusted. for now we can live with the noise to ensure we get some meaningful signal. later, we can apply more aggressive timeouts to filter out values received from nodes with 0 stake.
… a shred version. also always accept ContactInfo
got the idea from the solana labs client. see process_entrypoints
…es in different threads
slices in CrdsData are freed after processing the incoming message, which causes a segfault when the messages need to be serialized and sent to other peers. this change instead only "shallow" frees the protocol message, leaving pointers in the CrdsData intact. The CrdsData pointers are then freed only when removed from the CrdsTable this is a quick and dirty solution for two reasons: 1. it is theoretically possible that a CrdsData could be copied from the CrdsTable (including its contained pointers). then the data could be removed, leading to the pointers being freed. then the copied version of the pointer (now dangling) could attempt a dereference, resulting in a segfault. it does not appear that this happens in the code currently but it would be very easy to accidentally add this bug. it would be preferable if these data structures ensured the pointer is never dangling 2. it is extra maintenance burden to be sure that shallowFree is properly implemented at all times. it would be very easy to overlook this method when making changes to the contained data structures.
dnut
changed the title
gossip crds table contents
gossip crds table contents are too small
Feb 7, 2024
dnut
changed the title
gossip crds table contents are too small
fix(gossip): crds table contents are too small
Feb 8, 2024
0xNineteen
requested changes
Feb 20, 2024
"can we push this lock up in the top level of the fcn so we only have one lock/unlock?" #69 (comment)
…pointsFromCrdsTable #69 (comment)
also include the source of the shred version for both the newly added log message and the existing log message for ip echo. #69 (comment)
0xNineteen
requested changes
Feb 22, 2024
0xNineteen
previously approved these changes
Feb 23, 2024
0xNineteen
approved these changes
Feb 27, 2024
ultd
approved these changes
Feb 29, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
fixes #70
This fixes the bug where Sig's crds table does not store as many crds values as it should be getting from the cluster. On devnet, solana's crds table about 7000 crds values. Sig's crds table never reached as much as 100 records in my testing. Sig's crds table only contained LegacyContactInfo and NodeInstance, missing the following values which are all present in solana's crds table: AccountsHashes, ContactInfo, DuplicateShred, EpochSlots, LegacySnapshotHashes, LowestSlot, SnapshotHashes, Version, and Vote.
I used the code in dnut/crds-hash-dump to dump the crds table contents to CSV files every ten seconds.
After fixing this bug, I encountered a segfault that occurs very soon after starting gossip. This is a pre-existing bug that does not manifest unless the crds table has all the expected items.
With this fix, Sig's crds table is populated with the same values as solana. Gradually over time, it accumulates some extra values due to its inability to filter by stake weight. This is unavoidable for now since we don't track stake weights.
Cause
Crds Table Discrepancy
I identified three causes for the crds table size discrepancy:
Segfault
The segfault was caused by a use-after-free bug.
When gossip messages are received by sig, the deserialization process involves some allocations for dynamically sized slices. The deserialized message structs include pointers to this allocated memory. As soon as gossip handles the incoming messages, it immediately calls
bincode.free
on the protocol message.The problem is that the protocol message being freed still contains a struct for the crds value, and some of these crds values contain pointers to allocated memory (specifically, ContactInfo and EpochSlots contain slices). Meanwhile, the same crds value has been inserted into the crds table with the same pointer. So after
bincode.free
is called, the crds table contains values with dangling pointers.When sending these crds values to other nodes in a new protocol message, the crds value is serialized. The serialization fails due to the a segfault when attempting to dereference the dangling pointer.
This problem was never encountered until after the crds discrepancy was fixed because the only values in the crds table were LegacyContactInfo and NodeInstance. Neither of these values contain pointers to allocated memory.
Fix
Crds Table Discrepancy
Early Eviction
The solana labs client checks the stake weight of the node that a crds value came from. If the stake weight is 0, the crds value is pruned after 15 seconds. Otherwise, the crds value is pruned after the duration of one epoch.
Since we don't have stake weights yet, we have to choose a single prune timeout for all crds values. It's not practical to purge all crds values after 15 seconds. This causes the crds table to regularly be emptied. Instead, this change applies an epoch-long timeout to every value. This is not perfect because 0-stake nodes are overly trusted. For now, we can live with the noise since it's the only way to get a meaningful signal. Once we have access to stake weights, we can apply more aggressive timeouts to filter out values received from nodes with 0 stake.
Shred Version Filter & ContactInfo
Determining the shred version
First, we needed a solution just to figure out what the shred version is.
Solana Labs Approach
Normally, a solana validator executes gossip in multiple stages. First it runs gossip for only 10-20 seconds, just to get introduced to the cluster and start downloading an account snapshot. After receiving the snapshot, it extracts the shred version from the snapshot.
The snapshot's shred version is validated against shred versions from two other sources:
If the validations succeed, gossip is restarted using this as its shred version. If the shred version could not be extracted from the cluster, it falls back to using the shred version from an entrypoint's LegacyContactInfo. Before restarted gossip
New Sig Approach
Getting the shred version from the account snapshot is not currently viable since sig's accounts-db implementation is not yet merged. Instead, I've implemented an IP echo request, and the shred version from that message is used as the authoritative shred version for gossip.
I also implemented the same fallback mechanism as solana. It runs in
buildMessages
. This means I added some logic to associate each entrypoint with a LegacyContactInfo. If sig's shred version is 0, it looks through the entrypoint's LegacyContactInfo structs for a shred version, and uses the first one it finds. A side effect of this change was a simplification tobuildPullRequests
, which was previously executing the same logic to associate entrypoints with their contact info.I made GossipService.my_shred_version an atomic since it is read from and written to in different threads.
Properly applying the filter
Now that we have the correct shred version, the existing shred version filter would work as expected in most cases, with two exceptions:
Segfault
I came up with a quick-and-dirty solution to this problem. It makes things better because we avoid the dangling pointer issue with the current code. It is not ideal because there is nothing about the structs that encourage safe usage. Seemingly correct usages of existing methods could easily cause another segfault.
First, we stop calling
bincode.free
after processing messages. Instead, we callmessage.shallowFree
, which is a new method that only frees the messaging-related data that needs to be freed while processing messages. It does not free the crds data pointers which will be needed later.To avoid a memory leak when crds values are removed from the crds table,
bincode.free
is called on those structs when they are removed from the crds table.This solution is problematic for two reasons:
shallowFree
when message structs are updated.Reference counting would be a safer approach. All crds values could be reference counted, and the pointer is only actually freed when the last instance is freed. This would introduce significant complexity to the codebase. I decided that is not in scope to address this safety issue in the current PR, but it deserves further discussion.
Misc changes