fix(ledger,shred-network): memory leak #510

dnut · 2025-01-22T13:40:15Z

This PR is stacked on top of #493 and needs to merge after that.

Leaks fixed in this pr

rocksdb WriteBatch was leaking all data that was stored in the batch before committing. This was fixed in rocksdb-zig by using the correct rocksdb_writebatch_destroy function to deinit the batch after committing.
All shreds received over the network were leaking. This was fixed by deiniting them after calling insertShreds. This also coincided with a change that clarifies the lifetime of shreds, described below
Recovered shreds were leaked. This was fixed by freeing them after inserting them. Sending recovered shreds to retransmit requires some additional thought (currently not fully implemented)
RepairPeerProvider was caching repair peers for 128 slots. This means the entire list of nodes (pubkey + ip + repair port) from the gossip table was being duplicated 128 times. This is extreme overkill so I changed it to 8 slots. I saw the memory usage go down from 1 GB to 6 MB which I'm not sure how to explain, since that's a much bigger improvement than expected. Further testing may reveal high variance in RepairPeerProvider's memory usage.

Shred lifetime clarification

Previously there were a bunch of TODOs in the code about figuring out the lifetime of data used during insertShreds. This led to the memory leak and would also have lead to memory errors once the ShredInserter was hooked up to other validator components that read its data. I cleaned this up a bit by establishing and implementing some basic guidelines for shred lifetimes.

Shreds received over the network are owned by the caller to insertShreds. The ShredInserter must treat them as a reference that will die when insertShreds returns.
Shreds returned from insertShreds are owned by the caller to insertShreds. This means the lifetime must exceed the insertShreds call.
Any shreds created during insertShreds must be either returned or deinitialized before returning.

To satisfy points 2 and 3, it became necessary to clone any shreds that are returned by insertShreds. I explored some alternatives to avoid this cloning, such as reference counting. As a first step, I tried to make Shred immutable. So you'll see some changes in this PR that are steps in the direction of Shred being immutable. But this turned into a very deep rabbit hole that wasted about a day of my time, so I decided it was out of scope and went with the basic cloning approach for this PR. I split off most of the WIP immutable-Shred code into a separate branch here: dnut/refactor/ledger/immutable-shred

this is a first step towards shreds themselves being immutable. the mutations now act on a mutable shred slice instead of the shred itself. but the shred still needs to be mutated sometimes so payloadMut serves this purpose until mutations of Shred is eliminated

previously any shreds created in shred processor or insertShreds leaked. this establishes clarified ShredInserter.insertShreds lifetime rules: - shreds passed in are owned by caller and must be freed by caller (applied this in shred processor) - shreds created in the function need to be cleaned up in the function, unless... - shreds returned from the function need to be owned by the caller, so they need a sufficient lifetime that exceeds the state of insertshreds and the shreds that were passed in. this is currently solved by duping the shreds.

…f 16 in my test i saw its memory usage go down from 1 GB to 6 MB which is obviously a much bigger drop than 16x, so I'm not sure how easy it is to isolate the impact of this change. Nonetheless I don't think this cache needs to go for 128 slots since typically repair does not need to go back more than a few slots. there is some potential to optimize this for the startup process of catching up from behind. but it's working as is in my testing

dnut changed the title ~~fix(ledger): memory leak~~ fix(ledger,shred-network): memory leak Jan 22, 2025

dnut requested a review from dadepo January 22, 2025 16:32

dnut marked this pull request as ready for review January 22, 2025 16:34

dnut self-assigned this Jan 22, 2025

dnut force-pushed the dnut/fix/shred-network/keepup branch 2 times, most recently from c258bfa to 20934fe Compare January 22, 2025 18:06

dnut added 7 commits January 22, 2025 13:08

const shred wip

4a03b5e

refactor: clean up shred inserter params with options struct

ada12a5

fix: memory errors and rocksdb writebatch leak

08c4981

fix: style

f2a99d8

dnut force-pushed the dnut/fix/shred-network/leak branch from 7185a91 to f2a99d8 Compare January 22, 2025 18:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ledger,shred-network): memory leak #510

fix(ledger,shred-network): memory leak #510

dnut commented Jan 22, 2025 •

edited

Loading

fix(ledger,shred-network): memory leak #510

Are you sure you want to change the base?

fix(ledger,shred-network): memory leak #510

Conversation

dnut commented Jan 22, 2025 • edited Loading

Leaks fixed in this pr

Shred lifetime clarification

dnut commented Jan 22, 2025 •

edited

Loading