Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(gossip): crds table contents are too small #69

Merged
merged 18 commits into from
Feb 29, 2024

Conversation

dnut
Copy link
Contributor

@dnut dnut commented Jan 30, 2024

fixes #70

This fixes the bug where Sig's crds table does not store as many crds values as it should be getting from the cluster. On devnet, solana's crds table about 7000 crds values. Sig's crds table never reached as much as 100 records in my testing. Sig's crds table only contained LegacyContactInfo and NodeInstance, missing the following values which are all present in solana's crds table: AccountsHashes, ContactInfo, DuplicateShred, EpochSlots, LegacySnapshotHashes, LowestSlot, SnapshotHashes, Version, and Vote.

I used the code in dnut/crds-hash-dump to dump the crds table contents to CSV files every ten seconds.

After fixing this bug, I encountered a segfault that occurs very soon after starting gossip. This is a pre-existing bug that does not manifest unless the crds table has all the expected items.

With this fix, Sig's crds table is populated with the same values as solana. Gradually over time, it accumulates some extra values due to its inability to filter by stake weight. This is unavoidable for now since we don't track stake weights.

Cause

Crds Table Discrepancy

I identified three causes for the crds table size discrepancy:

  1. Early eviction: All records were being evicted from the crds table after 15 seconds when most of them should by kept for a full epoch.
  2. Shred version filter: Records were not included unless they had a shred version of 0. Zero is not a valid shred version, so all crds values were being excluded (except for LegacyContactInfo and NodeInstance, which are exempt from the shred version filter).
  3. ContactInfo was not exempted from the shred version filter.

Segfault

The segfault was caused by a use-after-free bug.

When gossip messages are received by sig, the deserialization process involves some allocations for dynamically sized slices. The deserialized message structs include pointers to this allocated memory. As soon as gossip handles the incoming messages, it immediately calls bincode.free on the protocol message.

The problem is that the protocol message being freed still contains a struct for the crds value, and some of these crds values contain pointers to allocated memory (specifically, ContactInfo and EpochSlots contain slices). Meanwhile, the same crds value has been inserted into the crds table with the same pointer. So after bincode.free is called, the crds table contains values with dangling pointers.

When sending these crds values to other nodes in a new protocol message, the crds value is serialized. The serialization fails due to the a segfault when attempting to dereference the dangling pointer.

This problem was never encountered until after the crds discrepancy was fixed because the only values in the crds table were LegacyContactInfo and NodeInstance. Neither of these values contain pointers to allocated memory.

Fix

Crds Table Discrepancy

Early Eviction

The solana labs client checks the stake weight of the node that a crds value came from. If the stake weight is 0, the crds value is pruned after 15 seconds. Otherwise, the crds value is pruned after the duration of one epoch.

Since we don't have stake weights yet, we have to choose a single prune timeout for all crds values. It's not practical to purge all crds values after 15 seconds. This causes the crds table to regularly be emptied. Instead, this change applies an epoch-long timeout to every value. This is not perfect because 0-stake nodes are overly trusted. For now, we can live with the noise since it's the only way to get a meaningful signal. Once we have access to stake weights, we can apply more aggressive timeouts to filter out values received from nodes with 0 stake.

Shred Version Filter & ContactInfo

Determining the shred version

First, we needed a solution just to figure out what the shred version is.

Solana Labs Approach

Normally, a solana validator executes gossip in multiple stages. First it runs gossip for only 10-20 seconds, just to get introduced to the cluster and start downloading an account snapshot. After receiving the snapshot, it extracts the shred version from the snapshot.

The snapshot's shred version is validated against shred versions from two other sources:

  • if provided, the expected shred version that was passed as a command-line argument.
  • An IP echo request is sent to an entrypoint, and the response includes a shred version.

If the validations succeed, gossip is restarted using this as its shred version. If the shred version could not be extracted from the cluster, it falls back to using the shred version from an entrypoint's LegacyContactInfo. Before restarted gossip

New Sig Approach

Getting the shred version from the account snapshot is not currently viable since sig's accounts-db implementation is not yet merged. Instead, I've implemented an IP echo request, and the shred version from that message is used as the authoritative shred version for gossip.

I also implemented the same fallback mechanism as solana. It runs in buildMessages. This means I added some logic to associate each entrypoint with a LegacyContactInfo. If sig's shred version is 0, it looks through the entrypoint's LegacyContactInfo structs for a shred version, and uses the first one it finds. A side effect of this change was a simplification to buildPullRequests, which was previously executing the same logic to associate entrypoints with their contact info.

I made GossipService.my_shred_version an atomic since it is read from and written to in different threads.

Properly applying the filter

Now that we have the correct shred version, the existing shred version filter would work as expected in most cases, with two exceptions:

  • if no shred version could be identified (my_shred_version == 0), we need to allow any crds values to be persisted into the crds table. I added an exception to bypass the filter in this case
  • ContactInfo needs to be able to circumvent the filter, so that nodes can update their shred versions. I added ContactInfo to the logic that was already allowing LegacyContactInfo and NodeInstance to bypass the filter.

Segfault

I came up with a quick-and-dirty solution to this problem. It makes things better because we avoid the dangling pointer issue with the current code. It is not ideal because there is nothing about the structs that encourage safe usage. Seemingly correct usages of existing methods could easily cause another segfault.

First, we stop calling bincode.free after processing messages. Instead, we call message.shallowFree, which is a new method that only frees the messaging-related data that needs to be freed while processing messages. It does not free the crds data pointers which will be needed later.

To avoid a memory leak when crds values are removed from the crds table, bincode.free is called on those structs when they are removed from the crds table.

This solution is problematic for two reasons:

  • It may not be obvious that you need to update shallowFree when message structs are updated.
  • Crds values may be extracted from the crds table and used anywhere in the codebase. Even a RwLock hints that this kind of operation is thread safe. But those crds values may suddenly start containing a dangling pointer when they are removed from the crds table. At that point they would no longer be safe to use in other threads.

Reference counting would be a safer approach. All crds values could be reference counted, and the pointer is only actually freed when the last instance is freed. This would introduce significant complexity to the codebase. I decided that is not in scope to address this safety issue in the current PR, but it deserves further discussion.

Misc changes

  • change the entrypoint log message on startup to actually show the ip addresses of entrypoints. Previously it just showed some useless struct information.
  • remove some unnecessary newlines from log messages

dnut added 9 commits January 29, 2024 14:42
… to epoch duration

crds values are typically supposed to have a timeout of the epoch duration, unless it's from a node with zero stake weight, then it should timeout after 15 seconds

we don't track stake weight, so we can't do this conditional logic. the existing compromises by using the 15 second timeout for all crds values.

it's not practical to purge all crds values after 15 seconds. this causes the crds table to regularly be emptied.

instead, this change applies an epoch-long timeout to every value. this is not perfect because 0-stake nodes are overly trusted. for now we can live with the noise to ensure we get some meaningful signal. later, we can apply more aggressive timeouts to filter out values received from nodes with 0 stake.
… a shred version. also always accept ContactInfo
got the idea from the solana labs client. see process_entrypoints
slices in CrdsData are freed after processing the incoming message, which causes a segfault when the messages need to be serialized and sent to other peers. this change instead only "shallow" frees the protocol message, leaving pointers in the CrdsData intact. The CrdsData pointers are then freed only when removed from the CrdsTable

this is a quick and dirty solution for two reasons:

1. it is theoretically possible that a CrdsData could be copied from the CrdsTable (including its contained pointers). then the data could be removed, leading to the pointers being freed. then the copied version of the pointer (now dangling) could attempt a dereference, resulting in a segfault. it does not appear that this happens in the code currently but it would be very easy to accidentally add this bug. it would be preferable if these data structures ensured the pointer is never dangling
2. it is extra maintenance burden to be sure that shallowFree is properly implemented at all times. it would be very easy to overlook this method when making changes to the contained data structures.
@dnut dnut changed the title gossip crds table contents gossip crds table contents are too small Feb 7, 2024
@dnut dnut marked this pull request as ready for review February 8, 2024 17:06
@dnut dnut changed the title gossip crds table contents are too small fix(gossip): crds table contents are too small Feb 8, 2024
src/gossip/gossip_service.zig Outdated Show resolved Hide resolved
src/gossip/gossip_service.zig Show resolved Hide resolved
src/gossip/gossip_service.zig Outdated Show resolved Hide resolved
src/gossip/gossip_service.zig Outdated Show resolved Hide resolved
src/gossip/gossip_service.zig Show resolved Hide resolved
src/gossip/gossip_service.zig Show resolved Hide resolved
dnut added 5 commits February 21, 2024 14:54
"can we push this lock up in the top level of the fcn so we only have one lock/unlock?"

#69 (comment)
also include the source of the shred version for both the newly added log message and the existing log message for ip echo.

#69 (comment)
@dnut dnut requested a review from 0xNineteen February 22, 2024 02:14
src/gossip/gossip_service.zig Outdated Show resolved Hide resolved
@dnut dnut requested a review from 0xNineteen February 23, 2024 00:54
0xNineteen
0xNineteen previously approved these changes Feb 23, 2024
@ultd ultd merged commit 53d6048 into main Feb 29, 2024
2 checks passed
@ultd ultd deleted the dnut/fix/crds-remove-labels branch February 29, 2024 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Gossip crds table discrepancy
3 participants