fix(shred-network): keep up with turbine #493

dnut · 2025-01-14T14:35:45Z

⚠️ Note: most of the lines added in this PR are for the grafana dashboard

This change was primarily motivated as a bugfix to deal with the issue that the shred collector would consistently get stuck at a certain slot and stop progressing.

Problem

What this means is that the shred tracker would think that it hasn't received all the shreds for a certain slot, and it would keep trying to repair that slot. Eventually it would lag so far behind that it would stop tracking any future slots ahead. This is because the shred tracker only worries about 1024 slots at a time. So the repair service would get into a degraded state with poor performance because it's looking at a ton of slot unnecessarily, and it would lose the ability to repair slots beyond a certain point. And our metrics in grafana would appear like the shred collector has halted.

The general problem was that slots were skipped but the shred tracker would not understand that it was skipped, so it would keep trying to repair that slot. There are a lot of complicated scenarios were slots can be skipped, and the repair service was not doing a good job of acquiring overall context about these scenarios.

Fix

The fix was to to make the repair service more persistent about requesting high-level information about each slot. It uses HighestShred repairs to get general context about slots and there were some situations where it wouldn't send those requests for slots. For example if it already send out a lot of Shred requests to ask for specific shreds, it might stop sending out HighestShred repairs slots that came after it (which may be skipping it). Also it did not send out HighestShred repairs if it already thought a slot was complete. I changed it to always send out HighestShred repairs for every slot it is aware of, rather than making assumptions about those slots, or stopping due to reaching a limit.

Test

With these changes, the behavior is fixed. I no longer see the shred tracker getting stuck. It is now about able to catch up from behind and remain caught up. I've tested this dozens of times starting from 1000 slots behind, and it catches up reliably within a minute or so, and stays caught up. After about an hour it runs out of memory due to a leak, which is a separate issue that I'll fix in another pr.

Other changes

I made some other changes that are mainly for performance and simplification of the code.

In a way I actually reduced the eagerness of sending HighestShred repairs for future slots. There's a difference between sending out requests for slots we know about and are caught behind (what I described increasing above) and sending out requests for slots that don't even appear to exist (this is what I decreased). It's not necessary to constantly send out hundreds of requests per second for future slots that have not happened yet. It's good enough to just send out a few probing requests to see if we're behind. Otherwise turbine will let us know where we are, and we can fill in the gaps with the logic described above.

I made the repair service a bit more dynamic in its resource usage. When catching up from far behind, it needs to scale up the number of threads it uses to sign and serialize tens of thousands of repair requests. Once it's caught up, it can work with just a single thread.

I made some changes to the way that shreds are registered with the tracker by moving this into the ShredInserter. This unlocked the ability to also register recovered shreds, so we don't attempt to repair shreds that have already been recovered with Reed Solomon. Eventually we can probably remove BasicShredTracker since its role is partly handled by the Index in the blockstore. But for now the shred tracker is working fine.

yewman

Just a couple questions otherwise lgtm

yewman · 2025-01-21T17:41:13Z

src/ledger/shred_inserter/shred_inserter.zig

@@ -189,6 +191,15 @@ pub const ShredInserter = struct {
            const shred_source: ShredSource = if (is_repair) .repaired else .turbine;
            switch (shred) {
                .data => |data_shred| {
+                    if (shred_tracker) |tracker| {
+                        tracker.registerDataShred(&shred.data, milli_timestamp) catch |err|
+                            switch (err) {


Maybe the switch should go in a block here and line 290, the indentation is off a bit strange imo

yewman · 2025-01-22T05:47:14Z

src/shred_network/shred_tracker.zig

+
+    fn maskBit(index: usize) ShredSet.MaskInt {
+        return @as(ShredSet.MaskInt, 1) << @as(ShredSet.ShiftInt, @truncate(index));
+    }


Suggested change

}

}

yewman · 2025-01-22T05:49:38Z

src/shred_network/shred_tracker.zig

    }

    /// returns whether it makes sense to send any repair requests
    pub fn identifyMissing(
        self: *Self,
        slot_reports: *MultiSlotReport,
+        milli_timestamp: i64,


Thoughts on using sig time types?

…g, try to complete slots more eagerly

…e true: - we're caught at least 1024 slots behind or turbine isn't working - the current bottom slot is skipped - the next slot is skipped - the slot 25 slots ahead is also skipped with jitter, we only run into a problem if there are also 40 consecutive slots that are skipped, which is practically impossible

Signed-off-by: Drew Nutter <[email protected]>

0xNineteen assigned dnut Jan 14, 2025

dnut requested a review from yewman January 15, 2025 17:14

yewman reviewed Jan 22, 2025

View reviewed changes

dnut mentioned this pull request Jan 22, 2025

fix(ledger,shred-network): memory leak #510

Open

dnut requested a review from yewman January 22, 2025 16:30

dnut force-pushed the dnut/fix/shred-network/keepup branch from c4178ea to c258bfa Compare January 22, 2025 18:06

dnut added 9 commits January 22, 2025 13:06

fix(shred-network): keep up with turbine

8c47bb1

feat(shred-network): simplify repair logic, remove metric-like loggin…

32cc994

…g, try to complete slots more eagerly

fix(shred-network): tests + linting

9d8954d

fix(shred-network): test

543301f

grafan: update shred collector dashboard

84b8e81

refactor(ledger): switch formatting

df39a11

Signed-off-by: Drew Nutter <[email protected]>

refactor(shred-network): function spacing

b3e1def

Signed-off-by: Drew Nutter <[email protected]>

feat(time): expand Instant to support shred repair

20934fe

Signed-off-by: Drew Nutter <[email protected]>

dnut force-pushed the dnut/fix/shred-network/keepup branch from c258bfa to 20934fe Compare January 22, 2025 18:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(shred-network): keep up with turbine #493

fix(shred-network): keep up with turbine #493

dnut commented Jan 14, 2025 •

edited

Loading

yewman left a comment

yewman Jan 21, 2025

dnut Jan 22, 2025

yewman Jan 22, 2025

dnut Jan 22, 2025

yewman Jan 22, 2025

dnut Jan 22, 2025

fix(shred-network): keep up with turbine #493

Are you sure you want to change the base?

fix(shred-network): keep up with turbine #493

Conversation

dnut commented Jan 14, 2025 • edited Loading

Problem

Fix

Test

Other changes

yewman left a comment

Choose a reason for hiding this comment

yewman Jan 21, 2025

Choose a reason for hiding this comment

dnut Jan 22, 2025

Choose a reason for hiding this comment

yewman Jan 22, 2025

Choose a reason for hiding this comment

dnut Jan 22, 2025

Choose a reason for hiding this comment

yewman Jan 22, 2025

Choose a reason for hiding this comment

dnut Jan 22, 2025

Choose a reason for hiding this comment

dnut commented Jan 14, 2025 •

edited

Loading