Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(event cache): avoid storing useless prev-batch tokens #4427

Merged
merged 9 commits into from
Dec 19, 2024

Conversation

bnjbvr
Copy link
Member

@bnjbvr bnjbvr commented Dec 17, 2024

We have a problem right now, with the event cache storage: as soon as you receive a previous-batch token, either from the sync or from a previous back-pagination, we consider that we may have more events to retrieve in the past. Later, after running back-paginations, we may realize those events are duplicated. But since a back-pagination will return another previous-batch token, until we hit the start of the timeline, then we're getting stuck in back-pagination mode, until we've reached the start of the timeline again.

That's bad, because it makes the event cache store basically useless: every time you restore a session, you may receive a previous-batch token from sync (even more so with sliding sync, which will give one previous-batch token when timeline_limit=1, then another one when the timeline limit expands to 20). And so you back-paginate back until the start of the history.

This series of commits fixes that, by introducing two new rules:

  • in back-pagination, don't assume the absence of a gap always means we're waiting for a prev-batch token from sync. This is only true if there was no events stored in the linked chunk before; if there's any, then we don't need to restart back-pagination from the end of the timeline to the beginning.
  • in back-pagination or sync, don't store a previous-batch token, if all the events (at least one) we've received were already known (and are duplicated).

With these two rules, we're not storing useless previous-batch tokens anymore, and we'll back-paginate at most one, for a given room. Interestingly, this might uncover some bugs related to back-pagination orderings, so we'll desperately want #4408 <3

Part of #3280.

@bnjbvr bnjbvr requested a review from a team as a code owner December 17, 2024 17:05
@bnjbvr bnjbvr requested review from stefanceriu and removed request for a team December 17, 2024 17:05
@Hywan Hywan requested review from Hywan and removed request for stefanceriu December 18, 2024 12:54
@Hywan
Copy link
Member

Hywan commented Dec 18, 2024

Stealing the review to @stefanceriu because <3.

…decryption` more stable

When starting to back-paginate, in this test, we:

- either have a previous-batch token, that points to the first event
*before* the message was sent,
- or have no previous-batch token, because we stopped sync before
receiving the first sync result.

Because of the behavior introduced in 944a922, we don't restart
back-paginating from the end, if we've reached the start. Now, if we are
in the case described by the first bullet item, then we may backpaginate
until the start of the room, and stop then, because we've back-paginated
all events. And so we'll never see the message sent by Alice after we
stopped sync'ing.

One solution to get to the desired state is to clear the internal state
of the room event cache, thus deleting the previous-batch token, thus
causing the situation described in the second bullet item. This achieves
what we want, that is, back-paginating from the end of the timeline.
@bnjbvr bnjbvr force-pushed the bnjbvr/no-useless-gaps branch from baa3623 to 47193f1 Compare December 18, 2024 13:39
Copy link

codecov bot commented Dec 18, 2024

Codecov Report

Attention: Patch coverage is 98.33333% with 1 line in your changes missing coverage. Please review.

Project coverage is 85.42%. Comparing base (373709f) to head (b986f96).
Report is 26 commits behind head on main.

Files with missing lines Patch % Lines
crates/matrix-sdk/src/event_cache/pagination.rs 97.05% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #4427   +/-   ##
=======================================
  Coverage   85.42%   85.42%           
=======================================
  Files         283      283           
  Lines       31490    31518   +28     
=======================================
+ Hits        26899    26923   +24     
- Misses       4591     4595    +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bnjbvr
Copy link
Member Author

bnjbvr commented Dec 18, 2024

Fwiw, there's actually a small issue with back-paginations: if we're not storing a previous-batch token, we should also not replace the gap with the events returned by back-pagination.

Consider the following timeline, that exists as such on the server, where the numbers represent event ids: [0, 1, 2, 3, 4, 5].

Let's take the notation that G(n) is a token for a back-pagination (gap, hence G), starting at (including) the n-th event, and going backwards. For instance, using the pagination token G(3) by requesting 2 events will return [2, 3] (in reverse order, but that's an implementation detail we don't care about here).

Conceptually, a token for back-pagination may be present in the timeline everywhere there's a timeline gap, so we can a mixed representation of a timeline like [G(3), 4, 5]: it starts with a pagination token that would help fetching events backwards from 3 included, and then the timeline contains events 4 and 5.

We start sliding sync:

  • first, with a timeline_limit set to 0.
    • We receive from the server [G(4), 5]
    • our local state becomes [G(4), 5]: we have the pagination token for events 4 and before; we received event 5
  • then, we subscribe to the room, and expand the timeline_limit to 3.
    • We receive from the server [G(2), 3, 4, 5].
    • After sync, we "push" (and deduplicate) the sync response, so our local state becomes [G(4), G(2), 3, 4, 5], after deduplicating 5.
  • we use /messages and consume G(2), requesting 3 events.
    • We receive from the server [1, 2]
    • We're replacing G(2) with [1, 2], so our local state becomes [G(4), 1, 2, 3, 4, 5].
    • There's no new previous-batch token because we've hit the start of the timeline.
  • we use /messages and consume G(4), requesting 3 events.
    • We receive from the server [G(1), 2, 3, 4].
    • With this PR, we're not storing G(1), because all the returned events are known.
    • We're replacing G(4) with the server response, and deduplicate as we go along: our local state becomes [2, 3, 4, 1, 5] 💥
    • Because we haven't stored G(1), we can't start another back-pagination that would fix the order by deduplicating (1) into its final correct position anymore 🤯

I think the solution is to not do anything when all the events are known. After all, they are at their right location in the linked chunk, and as such, they shouldn't move anymore. I'll need to think a bit more about this, to make sure this is the right solution.

Copy link
Member

@Hywan Hywan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure to understand if more work is required, but so far, it looks good. I'm approving the PR :-).

Thank you for taking care of the tests.

crates/matrix-sdk/src/event_cache/room/events.rs Outdated Show resolved Hide resolved
Comment on lines 434 to 436
pub fn deduplicated_all_new_events(&self) -> bool {
self.num_new_unique > 0 && self.num_new_unique == self.num_duplicated
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you simply test self.num_duplicated > 0?

Actually, I don't understand the code here. The code, the doc, and the method name don't seem to do the same thing. Can you revisit this a bit please?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you simply test self.num_duplicated > 0?

No, what we're trying to know is:

  • there was at least 1 new event
  • AND all the new events we saw have been deduplicated

Which means we don't have any work to do (i.e. not update the current chunks, based on the assumption they were correctly positioned), and we should stop storing the previous-batch tokens (since we already knew about the events, if we needed a previous-batch token, we should have stored it already).

The code, the doc, and the method name don't seem to do the same thing. Can you revisit this a bit please?

Yep, I will expand upon this comment, and likely revisit the approach: the last commit that fixes the ordering issue observed in #4427 (comment) makes it a bit unwieldy now, and maybe we can spare the existence of AddEventReport entirely.

Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've rewritten the last commit and put a super large comment, hopefully it helps. Let me know if it doesn't, post-merge, and we can rephrase it 🙏

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like you've removed AddEventReport, it seems cleaner this way. I may have prefer a Option<T> instead of (bool, T), but it's a personal taste (I believe it matches std a bit better, but that's a detail, I'll try to submit a PR to not forget that and being able to discuss it).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HHmm, it conflicts with other API which already returns an Option. It would end up with Result<Option<Option<Positon>>, Error> for example, which is a bit ugly… Let's say we are good here.

…'t cause meaningful changes

See comment on top of `deduplicated_all_new_events`.
@bnjbvr bnjbvr force-pushed the bnjbvr/no-useless-gaps branch from c11b4ce to b986f96 Compare December 19, 2024 13:03
@bnjbvr bnjbvr enabled auto-merge (rebase) December 19, 2024 13:04
@bnjbvr bnjbvr merged commit bc8c4f5 into main Dec 19, 2024
39 checks passed
@bnjbvr bnjbvr deleted the bnjbvr/no-useless-gaps branch December 19, 2024 13:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants