-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(shred-network): keep up with turbine #493
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple questions otherwise lgtm
@@ -189,6 +191,15 @@ pub const ShredInserter = struct { | |||
const shred_source: ShredSource = if (is_repair) .repaired else .turbine; | |||
switch (shred) { | |||
.data => |data_shred| { | |||
if (shred_tracker) |tracker| { | |||
tracker.registerDataShred(&shred.data, milli_timestamp) catch |err| | |||
switch (err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the switch should go in a block here and line 290, the indentation is off a bit strange imo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
fn maskBit(index: usize) ShredSet.MaskInt { | ||
return @as(ShredSet.MaskInt, 1) << @as(ShredSet.ShiftInt, @truncate(index)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | |
} | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/shred_network/shred_tracker.zig
Outdated
} | ||
|
||
/// returns whether it makes sense to send any repair requests | ||
pub fn identifyMissing( | ||
self: *Self, | ||
slot_reports: *MultiSlotReport, | ||
milli_timestamp: i64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on using sig time types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c4178ea
to
c258bfa
Compare
…g, try to complete slots more eagerly
…e true: - we're caught at least 1024 slots behind or turbine isn't working - the current bottom slot is skipped - the next slot is skipped - the slot 25 slots ahead is also skipped with jitter, we only run into a problem if there are also 40 consecutive slots that are skipped, which is practically impossible
Signed-off-by: Drew Nutter <[email protected]>
Signed-off-by: Drew Nutter <[email protected]>
Signed-off-by: Drew Nutter <[email protected]>
c258bfa
to
20934fe
Compare
This change was primarily motivated as a bugfix to deal with the issue that the shred collector would consistently get stuck at a certain slot and stop progressing.
Problem
What this means is that the shred tracker would think that it hasn't received all the shreds for a certain slot, and it would keep trying to repair that slot. Eventually it would lag so far behind that it would stop tracking any future slots ahead. This is because the shred tracker only worries about 1024 slots at a time. So the repair service would get into a degraded state with poor performance because it's looking at a ton of slot unnecessarily, and it would lose the ability to repair slots beyond a certain point. And our metrics in grafana would appear like the shred collector has halted.
The general problem was that slots were skipped but the shred tracker would not understand that it was skipped, so it would keep trying to repair that slot. There are a lot of complicated scenarios were slots can be skipped, and the repair service was not doing a good job of acquiring overall context about these scenarios.
Fix
The fix was to to make the repair service more persistent about requesting high-level information about each slot. It uses
HighestShred
repairs to get general context about slots and there were some situations where it wouldn't send those requests for slots. For example if it already send out a lot ofShred
requests to ask for specific shreds, it might stop sending out HighestShred repairs slots that came after it (which may be skipping it). Also it did not send out HighestShred repairs if it already thought a slot was complete. I changed it to always send out HighestShred repairs for every slot it is aware of, rather than making assumptions about those slots, or stopping due to reaching a limit.Test
With these changes, the behavior is fixed. I no longer see the shred tracker getting stuck. It is now about able to catch up from behind and remain caught up. I've tested this dozens of times starting from 1000 slots behind, and it catches up reliably within a minute or so, and stays caught up. After about an hour it runs out of memory due to a leak, which is a separate issue that I'll fix in another pr.
Other changes
I made some other changes that are mainly for performance and simplification of the code.
In a way I actually reduced the eagerness of sending HighestShred repairs for future slots. There's a difference between sending out requests for slots we know about and are caught behind (what I described increasing above) and sending out requests for slots that don't even appear to exist (this is what I decreased). It's not necessary to constantly send out hundreds of requests per second for future slots that have not happened yet. It's good enough to just send out a few probing requests to see if we're behind. Otherwise turbine will let us know where we are, and we can fill in the gaps with the logic described above.
I made the repair service a bit more dynamic in its resource usage. When catching up from far behind, it needs to scale up the number of threads it uses to sign and serialize tens of thousands of repair requests. Once it's caught up, it can work with just a single thread.
I made some changes to the way that shreds are registered with the tracker by moving this into the ShredInserter. This unlocked the ability to also register recovered shreds, so we don't attempt to repair shreds that have already been recovered with Reed Solomon. Eventually we can probably remove BasicShredTracker since its role is partly handled by the Index in the blockstore. But for now the shred tracker is working fine.