Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(ledger,shred-network): memory leak #510

Open
wants to merge 7 commits into
base: dnut/fix/shred-network/keepup
Choose a base branch
from

Conversation

dnut
Copy link
Contributor

@dnut dnut commented Jan 22, 2025

This PR is stacked on top of #493 and needs to merge after that.

Leaks fixed in this pr

  • rocksdb WriteBatch was leaking all data that was stored in the batch before committing. This was fixed in rocksdb-zig by using the correct rocksdb_writebatch_destroy function to deinit the batch after committing.
  • All shreds received over the network were leaking. This was fixed by deiniting them after calling insertShreds. This also coincided with a change that clarifies the lifetime of shreds, described below
  • Recovered shreds were leaked. This was fixed by freeing them after inserting them. Sending recovered shreds to retransmit requires some additional thought (currently not fully implemented)
  • RepairPeerProvider was caching repair peers for 128 slots. This means the entire list of nodes (pubkey + ip + repair port) from the gossip table was being duplicated 128 times. This is extreme overkill so I changed it to 8 slots. I saw the memory usage go down from 1 GB to 6 MB which I'm not sure how to explain, since that's a much bigger improvement than expected. Further testing may reveal high variance in RepairPeerProvider's memory usage.

Shred lifetime clarification

Previously there were a bunch of TODOs in the code about figuring out the lifetime of data used during insertShreds. This led to the memory leak and would also have lead to memory errors once the ShredInserter was hooked up to other validator components that read its data. I cleaned this up a bit by establishing and implementing some basic guidelines for shred lifetimes.

  1. Shreds received over the network are owned by the caller to insertShreds. The ShredInserter must treat them as a reference that will die when insertShreds returns.
  2. Shreds returned from insertShreds are owned by the caller to insertShreds. This means the lifetime must exceed the insertShreds call.
  3. Any shreds created during insertShreds must be either returned or deinitialized before returning.

To satisfy points 2 and 3, it became necessary to clone any shreds that are returned by insertShreds. I explored some alternatives to avoid this cloning, such as reference counting. As a first step, I tried to make Shred immutable. So you'll see some changes in this PR that are steps in the direction of Shred being immutable. But this turned into a very deep rabbit hole that wasted about a day of my time, so I decided it was out of scope and went with the basic cloning approach for this PR. I split off most of the WIP immutable-Shred code into a separate branch here: dnut/refactor/ledger/immutable-shred

@dnut dnut changed the title fix(ledger): memory leak fix(ledger,shred-network): memory leak Jan 22, 2025
@dnut dnut requested a review from dadepo January 22, 2025 16:32
@dnut dnut marked this pull request as ready for review January 22, 2025 16:34
@dnut dnut self-assigned this Jan 22, 2025
@dnut dnut force-pushed the dnut/fix/shred-network/keepup branch 2 times, most recently from c258bfa to 20934fe Compare January 22, 2025 18:06
dnut added 7 commits January 22, 2025 13:08
this is a first step towards shreds themselves being immutable. the mutations now act on a mutable shred slice instead of the shred itself. but the shred still needs to be mutated sometimes so payloadMut serves this purpose until mutations of Shred is eliminated
previously any shreds created in shred processor or insertShreds leaked.

this establishes clarified ShredInserter.insertShreds lifetime rules:
- shreds passed in are owned by caller and must be freed by caller (applied this in shred processor)
- shreds created in the function need to be cleaned up in the function, unless...
- shreds returned from the function need to be owned by the caller, so they need a sufficient lifetime that exceeds the state of insertshreds and the shreds that were passed in. this is currently solved by duping the shreds.
…f 16

in my test i saw its memory usage go down from 1 GB to 6 MB which is obviously a much bigger drop than 16x, so I'm not sure how easy it is to isolate the impact of this change. Nonetheless I don't think this cache needs to go for 128 slots since typically repair does not need to go back more than a few slots. there is some potential to optimize this for the startup process of catching up from behind. but it's working as is in my testing
@dnut dnut force-pushed the dnut/fix/shred-network/leak branch from 7185a91 to f2a99d8 Compare January 22, 2025 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 👀 In review
Development

Successfully merging this pull request may close these issues.

1 participant