Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(shred-network): keep up with turbine #493

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

dnut
Copy link
Contributor

@dnut dnut commented Jan 14, 2025

⚠️ Note: most of the lines added in this PR are for the grafana dashboard

This change was primarily motivated as a bugfix to deal with the issue that the shred collector would consistently get stuck at a certain slot and stop progressing.

Problem

What this means is that the shred tracker would think that it hasn't received all the shreds for a certain slot, and it would keep trying to repair that slot. Eventually it would lag so far behind that it would stop tracking any future slots ahead. This is because the shred tracker only worries about 1024 slots at a time. So the repair service would get into a degraded state with poor performance because it's looking at a ton of slot unnecessarily, and it would lose the ability to repair slots beyond a certain point. And our metrics in grafana would appear like the shred collector has halted.

The general problem was that slots were skipped but the shred tracker would not understand that it was skipped, so it would keep trying to repair that slot. There are a lot of complicated scenarios were slots can be skipped, and the repair service was not doing a good job of acquiring overall context about these scenarios.

Fix

The fix was to to make the repair service more persistent about requesting high-level information about each slot. It uses HighestShred repairs to get general context about slots and there were some situations where it wouldn't send those requests for slots. For example if it already send out a lot of Shred requests to ask for specific shreds, it might stop sending out HighestShred repairs slots that came after it (which may be skipping it). Also it did not send out HighestShred repairs if it already thought a slot was complete. I changed it to always send out HighestShred repairs for every slot it is aware of, rather than making assumptions about those slots, or stopping due to reaching a limit.

Test

With these changes, the behavior is fixed. I no longer see the shred tracker getting stuck. It is now about able to catch up from behind and remain caught up. I've tested this dozens of times starting from 1000 slots behind, and it catches up reliably within a minute or so, and stays caught up. After about an hour it runs out of memory due to a leak, which is a separate issue that I'll fix in another pr.

Other changes

I made some other changes that are mainly for performance and simplification of the code.

In a way I actually reduced the eagerness of sending HighestShred repairs for future slots. There's a difference between sending out requests for slots we know about and are caught behind (what I described increasing above) and sending out requests for slots that don't even appear to exist (this is what I decreased). It's not necessary to constantly send out hundreds of requests per second for future slots that have not happened yet. It's good enough to just send out a few probing requests to see if we're behind. Otherwise turbine will let us know where we are, and we can fill in the gaps with the logic described above.

I made the repair service a bit more dynamic in its resource usage. When catching up from far behind, it needs to scale up the number of threads it uses to sign and serialize tens of thousands of repair requests. Once it's caught up, it can work with just a single thread.

I made some changes to the way that shreds are registered with the tracker by moving this into the ShredInserter. This unlocked the ability to also register recovered shreds, so we don't attempt to repair shreds that have already been recovered with Reed Solomon. Eventually we can probably remove BasicShredTracker since its role is partly handled by the Index in the blockstore. But for now the shred tracker is working fine.

@dnut dnut requested a review from yewman January 15, 2025 17:14
Copy link
Contributor

@yewman yewman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple questions otherwise lgtm

@@ -189,6 +191,15 @@ pub const ShredInserter = struct {
const shred_source: ShredSource = if (is_repair) .repaired else .turbine;
switch (shred) {
.data => |data_shred| {
if (shred_tracker) |tracker| {
tracker.registerDataShred(&shred.data, milli_timestamp) catch |err|
switch (err) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the switch should go in a block here and line 290, the indentation is off a bit strange imo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


fn maskBit(index: usize) ShredSet.MaskInt {
return @as(ShredSet.MaskInt, 1) << @as(ShredSet.ShiftInt, @truncate(index));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}

/// returns whether it makes sense to send any repair requests
pub fn identifyMissing(
self: *Self,
slot_reports: *MultiSlotReport,
milli_timestamp: i64,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on using sig time types?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dnut dnut requested a review from yewman January 22, 2025 16:30
@dnut dnut force-pushed the dnut/fix/shred-network/keepup branch from c4178ea to c258bfa Compare January 22, 2025 18:06
dnut added 9 commits January 22, 2025 13:06
…e true:

- we're caught at least 1024 slots behind or turbine isn't working
- the current bottom slot is skipped
- the next slot is skipped
- the slot 25 slots ahead is also skipped

with jitter, we only run into a problem if there are also 40 consecutive slots that are skipped, which is practically impossible
@dnut dnut force-pushed the dnut/fix/shred-network/keepup branch from c258bfa to 20934fe Compare January 22, 2025 18:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 👀 In review
Development

Successfully merging this pull request may close these issues.

2 participants