fix(gossip): crds table contents are too small #69

dnut · 2024-01-30T05:00:49Z

fixes #70

This fixes the bug where Sig's crds table does not store as many crds values as it should be getting from the cluster. On devnet, solana's crds table about 7000 crds values. Sig's crds table never reached as much as 100 records in my testing. Sig's crds table only contained LegacyContactInfo and NodeInstance, missing the following values which are all present in solana's crds table: AccountsHashes, ContactInfo, DuplicateShred, EpochSlots, LegacySnapshotHashes, LowestSlot, SnapshotHashes, Version, and Vote.

I used the code in dnut/crds-hash-dump to dump the crds table contents to CSV files every ten seconds.

After fixing this bug, I encountered a segfault that occurs very soon after starting gossip. This is a pre-existing bug that does not manifest unless the crds table has all the expected items.

With this fix, Sig's crds table is populated with the same values as solana. Gradually over time, it accumulates some extra values due to its inability to filter by stake weight. This is unavoidable for now since we don't track stake weights.

Cause

Crds Table Discrepancy

I identified three causes for the crds table size discrepancy:

Early eviction: All records were being evicted from the crds table after 15 seconds when most of them should by kept for a full epoch.
Shred version filter: Records were not included unless they had a shred version of 0. Zero is not a valid shred version, so all crds values were being excluded (except for LegacyContactInfo and NodeInstance, which are exempt from the shred version filter).
ContactInfo was not exempted from the shred version filter.

Segfault

The segfault was caused by a use-after-free bug.

When gossip messages are received by sig, the deserialization process involves some allocations for dynamically sized slices. The deserialized message structs include pointers to this allocated memory. As soon as gossip handles the incoming messages, it immediately calls bincode.free on the protocol message.

The problem is that the protocol message being freed still contains a struct for the crds value, and some of these crds values contain pointers to allocated memory (specifically, ContactInfo and EpochSlots contain slices). Meanwhile, the same crds value has been inserted into the crds table with the same pointer. So after bincode.free is called, the crds table contains values with dangling pointers.

When sending these crds values to other nodes in a new protocol message, the crds value is serialized. The serialization fails due to the a segfault when attempting to dereference the dangling pointer.

This problem was never encountered until after the crds discrepancy was fixed because the only values in the crds table were LegacyContactInfo and NodeInstance. Neither of these values contain pointers to allocated memory.

Fix

Crds Table Discrepancy

Early Eviction

The solana labs client checks the stake weight of the node that a crds value came from. If the stake weight is 0, the crds value is pruned after 15 seconds. Otherwise, the crds value is pruned after the duration of one epoch.

Since we don't have stake weights yet, we have to choose a single prune timeout for all crds values. It's not practical to purge all crds values after 15 seconds. This causes the crds table to regularly be emptied. Instead, this change applies an epoch-long timeout to every value. This is not perfect because 0-stake nodes are overly trusted. For now, we can live with the noise since it's the only way to get a meaningful signal. Once we have access to stake weights, we can apply more aggressive timeouts to filter out values received from nodes with 0 stake.

Shred Version Filter & ContactInfo

Determining the shred version

First, we needed a solution just to figure out what the shred version is.

Solana Labs Approach

Normally, a solana validator executes gossip in multiple stages. First it runs gossip for only 10-20 seconds, just to get introduced to the cluster and start downloading an account snapshot. After receiving the snapshot, it extracts the shred version from the snapshot.

The snapshot's shred version is validated against shred versions from two other sources:

if provided, the expected shred version that was passed as a command-line argument.
An IP echo request is sent to an entrypoint, and the response includes a shred version.

If the validations succeed, gossip is restarted using this as its shred version. If the shred version could not be extracted from the cluster, it falls back to using the shred version from an entrypoint's LegacyContactInfo. Before restarted gossip

New Sig Approach

Getting the shred version from the account snapshot is not currently viable since sig's accounts-db implementation is not yet merged. Instead, I've implemented an IP echo request, and the shred version from that message is used as the authoritative shred version for gossip.

I also implemented the same fallback mechanism as solana. It runs in buildMessages. This means I added some logic to associate each entrypoint with a LegacyContactInfo. If sig's shred version is 0, it looks through the entrypoint's LegacyContactInfo structs for a shred version, and uses the first one it finds. A side effect of this change was a simplification to buildPullRequests, which was previously executing the same logic to associate entrypoints with their contact info.

I made GossipService.my_shred_version an atomic since it is read from and written to in different threads.

Properly applying the filter

Now that we have the correct shred version, the existing shred version filter would work as expected in most cases, with two exceptions:

if no shred version could be identified (my_shred_version == 0), we need to allow any crds values to be persisted into the crds table. I added an exception to bypass the filter in this case
ContactInfo needs to be able to circumvent the filter, so that nodes can update their shred versions. I added ContactInfo to the logic that was already allowing LegacyContactInfo and NodeInstance to bypass the filter.

Segfault

I came up with a quick-and-dirty solution to this problem. It makes things better because we avoid the dangling pointer issue with the current code. It is not ideal because there is nothing about the structs that encourage safe usage. Seemingly correct usages of existing methods could easily cause another segfault.

First, we stop calling bincode.free after processing messages. Instead, we call message.shallowFree, which is a new method that only frees the messaging-related data that needs to be freed while processing messages. It does not free the crds data pointers which will be needed later.

To avoid a memory leak when crds values are removed from the crds table, bincode.free is called on those structs when they are removed from the crds table.

This solution is problematic for two reasons:

It may not be obvious that you need to update shallowFree when message structs are updated.
Crds values may be extracted from the crds table and used anywhere in the codebase. Even a RwLock hints that this kind of operation is thread safe. But those crds values may suddenly start containing a dangling pointer when they are removed from the crds table. At that point they would no longer be safe to use in other threads.

Reference counting would be a safer approach. All crds values could be reference counted, and the pointer is only actually freed when the last instance is freed. This would introduce significant complexity to the codebase. I decided that is not in scope to address this safety issue in the current PR, but it deserves further discussion.

Misc changes

change the entrypoint log message on startup to actually show the ip addresses of entrypoints. Previously it just showed some useless struct information.
remove some unnecessary newlines from log messages

… to epoch duration crds values are typically supposed to have a timeout of the epoch duration, unless it's from a node with zero stake weight, then it should timeout after 15 seconds we don't track stake weight, so we can't do this conditional logic. the existing compromises by using the 15 second timeout for all crds values. it's not practical to purge all crds values after 15 seconds. this causes the crds table to regularly be emptied. instead, this change applies an epoch-long timeout to every value. this is not perfect because 0-stake nodes are overly trusted. for now we can live with the noise to ensure we get some meaningful signal. later, we can apply more aggressive timeouts to filter out values received from nodes with 0 stake.

… a shred version. also always accept ContactInfo

got the idea from the solana labs client. see process_entrypoints

…es in different threads

slices in CrdsData are freed after processing the incoming message, which causes a segfault when the messages need to be serialized and sent to other peers. this change instead only "shallow" frees the protocol message, leaving pointers in the CrdsData intact. The CrdsData pointers are then freed only when removed from the CrdsTable this is a quick and dirty solution for two reasons: 1. it is theoretically possible that a CrdsData could be copied from the CrdsTable (including its contained pointers). then the data could be removed, leading to the pointers being freed. then the copied version of the pointer (now dangling) could attempt a dereference, resulting in a segfault. it does not appear that this happens in the code currently but it would be very easy to accidentally add this bug. it would be preferable if these data structures ensured the pointer is never dangling 2. it is extra maintenance burden to be sure that shallowFree is properly implemented at all times. it would be very easy to overlook this method when making changes to the contained data structures.

src/gossip/gossip_service.zig

…be snake_case"

"can we push this lock up in the top level of the fcn so we only have one lock/unlock?" #69 (comment)

…pointsFromCrdsTable #69 (comment)

also include the source of the shred version for both the newly added log message and the existing log message for ip echo. #69 (comment)

#69 (comment)

src/gossip/gossip_service.zig

#69 (comment)

dnut added 9 commits January 29, 2024 14:42

fix(gossip): don't filter crds values by shred version unless we have…

f684f1d

… a shred version. also always accept ContactInfo

feat(gossip): derive initial shred version from entrypoint

425ae75

got the idea from the solana labs client. see process_entrypoints

fix(gossip): make my_shred_version atomic since it has reads and writ…

d26f773

…es in different threads

docs(gossip): add comment to entrypoints

b40a72f

fix(gossip): init my_shred_version to atomic

f77a7ca

feat(gossip): get shred version from entrypoint using ip echo

f37d2c1

fix(gossip): logging entrypoints and whitespace

3d36b08

dnut changed the title ~~gossip crds table contents~~ gossip crds table contents are too small Feb 7, 2024

dnut marked this pull request as ready for review February 8, 2024 17:06

dnut changed the title ~~gossip crds table contents are too small~~ fix(gossip): crds table contents are too small Feb 8, 2024

0xNineteen requested changes Feb 20, 2024

View reviewed changes

dnut added 5 commits February 21, 2024 14:54

fix(gossip): pr feedback "change to initCapacity and var name should …

149f762

…be snake_case"

fix(gossip): pr feedback: single lock for identifyEntrypoints

9447808

"can we push this lock up in the top level of the fcn so we only have one lock/unlock?" #69 (comment)

fix(gossip): pr feedback: rename identifyEntrypoints to populateEntry…

e4fdca4

…pointsFromCrdsTable #69 (comment)

fix(gossip): pr feedback: log shred version

06eed7c

also include the source of the shred version for both the newly added log message and the existing log message for ip echo. #69 (comment)

docs(gossip): pr feedback: shallowFree explanation

8dccb42

#69 (comment)

dnut requested a review from 0xNineteen February 22, 2024 02:14

0xNineteen requested changes Feb 22, 2024

View reviewed changes

src/gossip/gossip_service.zig Outdated Show resolved Hide resolved

dnut added 2 commits February 22, 2024 17:32

fix(gossip): pr feedback: simplify populateEndpoints

fb0b1c2

#69 (comment)

fix(gossip): pr feedback: simplify entrypoint list init

5e89a80

dnut requested a review from 0xNineteen February 23, 2024 00:54

small fix

22f37ba

0xNineteen previously approved these changes Feb 23, 2024

View reviewed changes

Merge branch 'main' into dnut/fix/crds-remove-labels

ba98e8e

dnut dismissed 0xNineteen’s stale review via ba98e8e February 27, 2024 01:33

0xNineteen approved these changes Feb 27, 2024

View reviewed changes

ultd approved these changes Feb 29, 2024

View reviewed changes

ultd merged commit 53d6048 into main Feb 29, 2024
2 checks passed

ultd deleted the dnut/fix/crds-remove-labels branch February 29, 2024 15:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(gossip): crds table contents are too small #69

fix(gossip): crds table contents are too small #69

dnut commented Jan 30, 2024 •

edited

Loading

fix(gossip): crds table contents are too small #69

fix(gossip): crds table contents are too small #69

Conversation

dnut commented Jan 30, 2024 • edited Loading

Cause

Crds Table Discrepancy

Segfault

Fix

Crds Table Discrepancy

Early Eviction

Shred Version Filter & ContactInfo

Determining the shred version

Solana Labs Approach

New Sig Approach

Properly applying the filter

Segfault

Misc changes

dnut commented Jan 30, 2024 •

edited

Loading