-
-
Notifications
You must be signed in to change notification settings - Fork 677
Dendrite 0.6.2 fails to sync/federate #2150
Comments
Can you please see if this behaviour is any better as of commit a2b4777? |
Hi, I'm seeing the same issue. I can send messages within the instance, but anything from outside doesn't come through. I'm seeing a lot of the following errors in my log:
@neilalexander I tried to build the container from commit a2b4777 however the issue persists and the above stated error messages are still flowing. |
with a2b4777 federation wasn't perfect, but was far better than 0.6.2 (where just about everything was locked up). |
I just updated to 0.6.3 and federation is still broken. When grepping for |
Broken for me as well in 0.6.3. Downgrading to 0.6.0 doesn't help. Downgrading further seems to require a database rollback, which is not an option as these backups have expired already. |
Can you please try the latest |
I tried a new docker container with 5106cc8 and the issue is still present. In addition now also the inter-instance communication is broken. Also: when logging in via my Android phone the app was performing a initial sync, despite me being logged in before. When grepping for I'm attaching here the output of |
Using 5106cc8 helps a little. Now rooms all correctly sync up immediately after restarting dendrite. Over the next 10-30 minutes the problem rooms will start to drop events. |
Unfortunately 5106cc8 doesn't help at all here. I have not a single message after February 8 in any of the rooms I'm in. At that time I was running 0.6 since January 29. So for me the problem started with 0.6, but only ~10 days after upgrading to it. |
Short update: 5106cc8 seems to help. On Sunday after deploying my instance was still silent, however today I see that some of the noisier channels are finally filling again with messages from Sunday onwards. I still see a message hole between the date of updating to 0.6 and Sunday. I keep monitoring this issue, but I'm mildly optimistic that 5106cc8 might help to resolve the issue or at least help. |
I am having the same issue, lots of |
FWIW Out of curiosity, are you all running the internal NATS deployment built into Dendrite or standalone NATS Server? If any of you are running a standalone NATS Server, which options are you running with? |
I'm also using the NATS build directly into Dendrite. I am using the same config as provided in the project yaml example: # Configuration for NATS JetStream
jetstream:
# A list of NATS Server addresses to connect to. If none are specified, an
# internal NATS server will be started automatically when running Dendrite
# in monolith mode. It is required to specify the address of at least one
# NATS Server node if running in polylith mode.
#addresses:
# - jetstream:4222
# Keep all NATS streams in memory, rather than persisting it to the storage
# path below. This option is present primarily for integration testing and
# should not be used on a real world Dendrite deployment.
in_memory: false
# Persistent directory to store JetStream streams in. This directory
# should be preserved across Dendrite restarts.
storage_path: ./
# The prefix to use for stream names for this homeserver - really only
# useful if running more than one Dendrite on the same NATS deployment.
topic_prefix: Dendrite
# Configuration for Prometheus metric collection. |
@neilalexander Do you want me to file another issue for the |
Thanks, that seems to have helped. Also using built-in. |
OK, so to understand what's really going on, I could use a goroutine trace and a profile from Dendrites that are experiencing these issues. To do this, you need to start Dendrite with the Then the next time you run into problems, capture the following profiles:
... and then upload all three files along with the commit ID that you are running — they don't contain configuration or anything sensitive (apart from possibly the folder names that Dendrite was built in) so should be safe to share. The two |
A lot of those issues will be genuine connection errors or bad keys so I wouldn't worry about those log lines unless you are having problems with E2EE specifically — in that case best to open a separate issue. |
I think I'm seeing this too. Deleting the jetstream folder and restarting dendrite does fix it but only temporarily and after a while my Element will stop connecting properly again. The log is surprisingly quiet for me though, other than the "context canceled" errors that occur after my client gives up, and the response.WriteHeader messages in #2123. |
@imyxh Please follow the instructions a couple posts up and if you can supply profiles from the next time it happens, that’d be amazing. Deleting the entire JetStream folder is not ideal and doing so is a very good way for downstream components to get in an out-of-sync state with the roomserver, so I can’t recommend that as a fix. A much much safer approach if absolutely necessary is to delete just the |
Whoops, skipped over that. Here they are! https://upload.disroot.org/r/9QJS70Hn#Kbx45aT6u79C2hAcB8D6ReproE37SPN1s6aZxQvD90U= |
@imyxh Thanks for these, the profiles are extremely useful. Can you please just confirm for me which commit ID of Dendrite you are running? I’m seeing a pattern in the goroutine trace — there are a few roomserver workers that are all blocked on the select query in Can you please also get a few more details for me:
Thanks! |
@imyxh Actually, looking more closely, I suspect your specific issue may have been fixed already in #2178 — it’s just that it hasn’t made its way into a release yet. I can see this because your goroutine profiles claim to be stuck in One way to find out is to update to commit 5106cc8 or anything on the |
I just tested from latest git main and indeed, there is no problem :P Thanks for all your work! |
@imyxh Glad to hear that’s helped — if you run into any more problems, please capture and chuck up some new profiles and we can look again. :-) You’ve also got headroom of 20 unused database connections so you could increase the roomserver’s |
Will there be a 0.6.4 release soon with the above mentioned PRs in it so we can switch from a custom commit back to the release channel? 🙂 |
@grisu48 Yep, sometime this week. |
Using commit 002429c I am unable to sign in with new element sessions, I see some rooms but the data isn't in sync and on existing sessions Federated messages aren't working. |
@alistair23 What happens if you try to send outbound messages? |
If I send a message in an E2EE room on an existing session it seems to work, but the recipient can't decrypt it. On a new session it also seems to send, but it is sent unencrypted |
There have been a number of improvements in Dendrite 0.6.4 both for the original issue and for E2EE, anyone who is having outstanding issues please test on the latest version and let me know how you get on. |
Hi! I've updated now to dendrite 0.6.4 using
Not sure if this is related to this issue or if this is a new one though. It might be related to #2222. In the logs I don't find anything really pointing at something, This is when I send a message to a room on my instance:
I see a lot of those "Failed to retrieve any keys" warnings, but I don't know if they are related or not. |
After updating to 002429c, I wasn't receiving any messages. Deleting just the jetstream/$G/streams/DendriteInputRoomEvent directory didn't help, deleting the complete jetstream directory did. |
Continuing to experience this issue with docker 0.6.4
Got the former point, but the latter is iffy. Intra-instance is working for me, but federation is incredibly spotty; some messages may send but most do not. And the logs are quiet for me as well.
I also deleted the jetstream directory and it removed the
|
@ElDifinitivo Built-in NATS or a standalone NATS Server? |
@neilalexander Built-in monolith |
Any update? Now I'm seeing messages not going out |
I could finally solve this issue by deleting the old @alistair23 maybe that's also worth a shot for you? I just renamed the jetstream directory, and once everything worked, dumped it completely. |
I've also had a similar experience. Upgrading to 0.6.5 jetstream broke, unable to parse something in that directory. After removing the directory messages are nearly always showing up correctly for the past few days. However, #2142 (which I had thought fixed, then a symptom of this issue) is back. |
Please open a new issue if this is still a problem. |
Background information
Description
Dendrite fails to receive new events for any room and fails to sync existing events to some clients as of version 0.6.2. In the clients, is shown by either frozen rooms and disconnection messages (element) or never finishing the initial sync (Fluffychat and Hydrogen). Rolling back to 0.6.0 resolves the issue.
The following logs may be relevant:
NOTE: user data from the logs has been stripped.
The text was updated successfully, but these errors were encountered: