Skip to content
This repository has been archived by the owner on Nov 25, 2024. It is now read-only.

Dendrite 0.6.2 fails to sync/federate #2150

Closed
Undef-a opened this issue Feb 6, 2022 · 40 comments
Closed

Dendrite 0.6.2 fails to sync/federate #2150

Undef-a opened this issue Feb 6, 2022 · 40 comments
Labels
T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. urgent

Comments

@Undef-a
Copy link

Undef-a commented Feb 6, 2022

Background information

Dendrite version or git SHA: v0.6.2 (last known good was 0.6.0)
Monolith or Polylith?: monolith
SQLite3 or Postgres?: postgresql
Running in Docker?: no
go version: 1.16.13 and 1.18beta1
Client used (if applicable): app.element.io, Hydrogen and Fluffychat

Description

Dendrite fails to receive new events for any room and fails to sync existing events to some clients as of version 0.6.2. In the clients, is shown by either frozen rooms and disconnection messages (element) or never finishing the initial sync (Fluffychat and Hydrogen). Rolling back to 0.6.0 resolves the issue.

The following logs may be relevant:

level=debug msg="Transaction: Failed to query room version for room!" error="context canceled" req.id= req.method=PUT req.path=/_matrix/federation/v1/send/<event 1>
level=debug msg="Transaction: Failed to parse event JSON of event {\"auth_events\":[\"\",\"\",\"\"],\"content\":{\"algorithm\":\"m.megolm.v1.aes-sha2\",\"ciphertext\":\"\",\"device_id\":\"\",\"sender_key\":\"\",\"session_id\":\"\"},\"depth\":20,\"hashes\":{\"sha256\":\"h0hdF/eV9J+\"},\"origin\":\"matrix.org\",\"origin_server_ts\":,\"prev_events\":[\"$\"],\"prev_state\":[],\"room_id\":\"!:.\",\"sender\":\"@\",\"signatures\":{\"matrix.org\":{\"ed25519:a_RXGa\":\"h2jmBBg\"}},\"type\":\"m.room.encrypted\",\"unsigned\":{\"age_ts\":}}" error="gomatrixserverlib: unsupported room version ''" req.id=KzqABiJjMlbU req.method=PUT req.path=/_matrix/federation/v1/send/<event 1>
http: superfluous response.WriteHeader call from github.com/prometheus/client_golang/prometheus/promhttp.(*responseWriterDelegator).WriteHeader (delegator.go:65)

NOTE: user data from the logs has been stripped.

@neilalexander
Copy link
Contributor

neilalexander commented Feb 8, 2022

Can you please see if this behaviour is any better as of commit a2b4777?

@grisu48
Copy link

grisu48 commented Feb 9, 2022

Hi, I'm seeing the same issue. I can send messages within the instance, but anything from outside doesn't come through.

I'm seeing a lot of the following errors in my log:

time="2022-02-09T07:51:35.290943700Z" level=error msg="failed to query device keys for some users" context=missing failed=1 server_name=libre.brussels total=1 wait=1h0m0s
time="2022-02-09T07:51:45.290814047Z" level=error msg="failed to query device keys for some users" context=missing failed=1 server_name=jloa.ovh total=1 wait=1h0m0s
time="2022-02-09T07:51:45.348646367Z" level=error msg="failed to query device keys for some users" context=missing failed=1 server_name=thebrown.men total=1 wait=1h0m0s
time="2022-02-09T07:51:45.382300790Z" level=error msg="failed to query device keys for some users" context=missing failed=2 server_name=chatserver.ca total=2 wait=1h0m0s
time="2022-02-09T07:51:55.292139584Z" level=error msg="failed to query device keys for some users" context=missing failed=1 server_name=bergercloud.eu total=1 wait=1h0m0s

@neilalexander I tried to build the container from commit a2b4777 however the issue persists and the above stated error messages are still flowing.

@Undef-a
Copy link
Author

Undef-a commented Feb 9, 2022

with a2b4777 federation wasn't perfect, but was far better than 0.6.2 (where just about everything was locked up).

@grisu48
Copy link

grisu48 commented Feb 10, 2022

I just updated to 0.6.3 and federation is still broken. When grepping for level=error in the docker container output, I still see a lot of "failed to query device keys for some users" entries, same as above. The instance was working just fine prior to updating to 0.6.0, now internal messaging works but federation is still broken.

@stintel
Copy link

stintel commented Feb 11, 2022

Broken for me as well in 0.6.3. Downgrading to 0.6.0 doesn't help. Downgrading further seems to require a database rollback, which is not an option as these backups have expired already.

@neilalexander
Copy link
Contributor

Can you please try the latest main commit 5106cc8 and report back on if it helps or not?

@grisu48
Copy link

grisu48 commented Feb 12, 2022

I tried a new docker container with 5106cc8 and the issue is still present.

In addition now also the inter-instance communication is broken. Also: when logging in via my Android phone the app was performing a initial sync, despite me being logged in before.

When grepping for level=error I still see a lot of the "failed to query device keys for some user context=missing" errors.

I'm attaching here the output of docker logs dendrite_monolith_1 in case you find this useful: dendrite-logs.txt

@Undef-a
Copy link
Author

Undef-a commented Feb 14, 2022

Using 5106cc8 helps a little. Now rooms all correctly sync up immediately after restarting dendrite. Over the next 10-30 minutes the problem rooms will start to drop events.

@stintel
Copy link

stintel commented Feb 14, 2022

Unfortunately 5106cc8 doesn't help at all here. I have not a single message after February 8 in any of the rooms I'm in. At that time I was running 0.6 since January 29. So for me the problem started with 0.6, but only ~10 days after upgrading to it.

@grisu48
Copy link

grisu48 commented Feb 16, 2022

Short update: 5106cc8 seems to help. On Sunday after deploying my instance was still silent, however today I see that some of the noisier channels are finally filling again with messages from Sunday onwards. I still see a message hole between the date of updating to 0.6 and Sunday.
Also intra-instance communications works again.

I keep monitoring this issue, but I'm mildly optimistic that 5106cc8 might help to resolve the issue or at least help.

@alistair23
Copy link

I am having the same issue, lots of failed to query device keys for some users in the logs. Using 5a39512 is no help

@neilalexander
Copy link
Contributor

FWIW failed to query device keys for some users is specific to E2EE and ultimately a separate issue to failing to federate in public rooms.

Out of curiosity, are you all running the internal NATS deployment built into Dendrite or standalone NATS Server? If any of you are running a standalone NATS Server, which options are you running with?

@alistair23
Copy link

I deleted the jetstream dir (#2181) and now it appears to be working after waiting awhile. I also bumped my max connections up to try and avoid #2173

I'm using the NATS build into Dendrite

@grisu48
Copy link

grisu48 commented Feb 17, 2022

I'm also using the NATS build directly into Dendrite. I am using the same config as provided in the project yaml example:

  # Configuration for NATS JetStream
  jetstream:
    # A list of NATS Server addresses to connect to. If none are specified, an
    # internal NATS server will be started automatically when running Dendrite
    # in monolith mode. It is required to specify the address of at least one
    # NATS Server node if running in polylith mode.
    #addresses:
    #  - jetstream:4222

    # Keep all NATS streams in memory, rather than persisting it to the storage
    # path below. This option is present primarily for integration testing and
    # should not be used on a real world Dendrite deployment.
    in_memory: false

    # Persistent directory to store JetStream streams in. This directory 
    # should be preserved across Dendrite restarts.
    storage_path: ./

    # The prefix to use for stream names for this homeserver - really only
    # useful if running more than one Dendrite on the same NATS deployment.
    topic_prefix: Dendrite

  # Configuration for Prometheus metric collection.

@grisu48
Copy link

grisu48 commented Feb 17, 2022

@neilalexander Do you want me to file another issue for the failed to query device keys for some users? I did a quick search in the issues list but could not find anything matching for it.

@stintel
Copy link

stintel commented Feb 17, 2022

I deleted the jetstream dir (#2181) and now it appears to be working after waiting awhile.

Thanks, that seems to have helped. Also using built-in.

@neilalexander
Copy link
Contributor

neilalexander commented Feb 17, 2022

OK, so to understand what's really going on, I could use a goroutine trace and a profile from Dendrites that are experiencing these issues.

To do this, you need to start Dendrite with the PPROFLISTEN=localhost:65432 environment variable set and then leave it running with that so that the profiler is accessible when the problem occurs. You should see a level=warning msg="Starting pprof on ..." line at startup if the profiler starts successfully.

Then the next time you run into problems, capture the following profiles:

curl -O http://localhost:65432/debug/pprof/goroutine?debug=1
curl -O http://localhost:65432/debug/pprof/goroutine?debug=2
curl -O http://localhost:65432/debug/pprof/profile

... and then upload all three files along with the commit ID that you are running — they don't contain configuration or anything sensitive (apart from possibly the folder names that Dendrite was built in) so should be safe to share. The two goroutine profiles should download pretty much instantly, the profile one will take 30 seconds to complete.

@neilalexander
Copy link
Contributor

@grisu48

Do you want me to file another issue for the failed to query device keys for some users? I did a quick search in the issues list but could not find anything matching for it.

A lot of those issues will be genuine connection errors or bad keys so I wouldn't worry about those log lines unless you are having problems with E2EE specifically — in that case best to open a separate issue.

@imyxh
Copy link

imyxh commented Feb 18, 2022

I think I'm seeing this too. Deleting the jetstream folder and restarting dendrite does fix it but only temporarily and after a while my Element will stop connecting properly again. The log is surprisingly quiet for me though, other than the "context canceled" errors that occur after my client gives up, and the response.WriteHeader messages in #2123.

@neilalexander
Copy link
Contributor

neilalexander commented Feb 18, 2022

@imyxh Please follow the instructions a couple posts up and if you can supply profiles from the next time it happens, that’d be amazing.

Deleting the entire JetStream folder is not ideal and doing so is a very good way for downstream components to get in an out-of-sync state with the roomserver, so I can’t recommend that as a fix. A much much safer approach if absolutely necessary is to delete just the jetstream/$G/streams/DendriteInputRoomEvent sub-folder only rather than the whole JetStream folder.

@imyxh
Copy link

imyxh commented Feb 18, 2022

@neilalexander
Copy link
Contributor

neilalexander commented Feb 18, 2022

@imyxh Thanks for these, the profiles are extremely useful. Can you please just confirm for me which commit ID of Dendrite you are running?

I’m seeing a pattern in the goroutine trace — there are a few roomserver workers that are all blocked on the select query in BulkSelectEventID, and then there are a few other roomserver workers waiting to obtain a database connection, likely blocked behind the others.

Can you please also get a few more details for me:

  • the result of the SHOW max_connections; query against your database
  • the value you have for max_open_conns in the room_server block of your Dendrite config file
  • the sum total value of all of the max_open_conns lines across your entire Dendrite config file in all sections
  • whether all of your components are sharing one PostgreSQL database, or if they are using their own separate databases

Thanks!

@neilalexander
Copy link
Contributor

@imyxh Actually, looking more closely, I suspect your specific issue may have been fixed already in #2178 — it’s just that it hasn’t made its way into a release yet. I can see this because your goroutine profiles claim to be stuck in QueryMissingAuthPrevEvents which is an API call that no longer exists after that PR.

One way to find out is to update to commit 5106cc8 or anything on the main branch more recent than that — latest commit would do — and then re-test and check the above details.

@imyxh
Copy link

imyxh commented Feb 18, 2022

@neilalexander

  • I'm running release 0.6.3, so I guess that's commit 9ac27ca.
  • roomserver max_open_conns: 10
  • total max_open_conns: 10 + 10 + 10 + 5 + 5 + 10 + 10 + 10 + 10 = 80
  • all of the components are sharing one PostgreSQL database
  • from psql: max_connections = 100

I just tested from latest git main and indeed, there is no problem :P

Thanks for all your work!

@neilalexander
Copy link
Contributor

@imyxh Glad to hear that’s helped — if you run into any more problems, please capture and chuck up some new profiles and we can look again. :-)

You’ve also got headroom of 20 unused database connections so you could increase the roomserver’s max_open_conns by another 10 or 15 to make better use of those resources. That should help with processing data for more rooms in parallel if you aren’t already limited by CPU or RAM.

@grisu48
Copy link

grisu48 commented Feb 20, 2022

Will there be a 0.6.4 release soon with the above mentioned PRs in it so we can switch from a custom commit back to the release channel? 🙂

@neilalexander
Copy link
Contributor

@grisu48 Yep, sometime this week.

@alistair23
Copy link

Using commit 002429c I am unable to sign in with new element sessions, I see some rooms but the data isn't in sync and on existing sessions Federated messages aren't working.

Output.zip

@neilalexander
Copy link
Contributor

@alistair23 What happens if you try to send outbound messages?

@alistair23
Copy link

If I send a message in an E2EE room on an existing session it seems to work, but the recipient can't decrypt it. On a new session it also seems to send, but it is sent unencrypted

@neilalexander
Copy link
Contributor

There have been a number of improvements in Dendrite 0.6.4 both for the original issue and for E2EE, anyone who is having outstanding issues please test on the latest version and let me know how you get on.

@grisu48
Copy link

grisu48 commented Feb 23, 2022

Hi! I've updated now to dendrite 0.6.4 using docker-compose now and encountered the following issues:

  • I got signed out from all of my devices. Needed to use the backup keys to log in again
  • I can not receive or send any messages, even intra-instance communication is not working

Not sure if this is related to this issue or if this is a new one though. It might be related to #2222.

In the logs I don't find anything really pointing at something, This is when I send a message to a room on my instance:

time="2022-02-23T20:12:28.650696285Z" level=warning msg="Failed to retrieve any keys" context=missing fetcher=DirectKeyFetcher
time="2022-02-23T20:12:29.269011299Z" level=error msg="Failed to fetch key for server" context=missing error="Post \"matrix://mpserver.pro/_matrix/key/v2/query\": lookup failed for \"mpserver.pro\"" fetcher=DirectKeyFetcher
time="2022-02-23T20:12:29.269078681Z" level=warning msg="Failed to retrieve any keys" context=missing fetcher=DirectKeyFetcher
time="2022-02-23T20:12:38.409693783Z" level=info msg="Sent event to roomserver" event_id="$BjG2vabZL7XgR51abnnncg5azDoWtjqZ5KuzH6Z7TfY" req.id=n8re6Vf0KsYa req.method=PUT req.path="/_matrix/client/r0/rooms/!ayPKSV6hofj2uZxI:matrix.feldspaten.org/send/m.room.encrypted/m1645647158230.30" room_id="!ayPKSV6hofj2uZxI:matrix.feldspaten.org" room_version=6 user_id="@phoenix:matrix.feldspaten.org"

I see a lot of those "Failed to retrieve any keys" warnings, but I don't know if they are related or not.

@stintel
Copy link

stintel commented Feb 23, 2022

After updating to 002429c, I wasn't receiving any messages. Deleting just the jetstream/$G/streams/DendriteInputRoomEvent directory didn't help, deleting the complete jetstream directory did.

@ElDifinitivo
Copy link

Continuing to experience this issue with docker 0.6.4

Hi! I've updated now to dendrite 0.6.4 using docker-compose now and encountered the following issues:

  • I got signed out from all of my devices. Needed to use the backup keys to log in again
  • I can not receive or send any messages, even intra-instance communication is not working

Got the former point, but the latter is iffy. Intra-instance is working for me, but federation is incredibly spotty; some messages may send but most do not. And the logs are quiet for me as well.

Deleting just the jetstream/$G/streams/DendriteInputRoomEvent directory didn't help, deleting the complete jetstream directory did.

I also deleted the jetstream directory and it removed the context cancelled errors I had been receiving, but it doesn't do anything to help the federation. I've also been getting big memory increases (go1.17.7) that I hadn't been before, from a fresh server boot startup at 30-40MB constantly increasing to 1GB+ (2-user instance) until the machine becomes unresponsive. Tried it with GODEBUG=madvdontneed=1 and it doesn't change anything. It takes about an hour, but this is always where I end up:

monolith_1  | [1] [INF] pipe - cid:47 - "v1.13.0:go" - Slow Consumer Detected: WriteDeadline of 10s exceeded with 1 chunks of 158 total bytes.
monolith_1  | [1] [WRN] pipe - cid:51 - "v1.13.0:go" - Readloop processing time: 23.583038563s
monolith_1  | [1] [WRN] pipe - cid:54 - "v1.13.0:go" - Readloop processing time: 2m12.323340906s
monolith_1  | [1] [WRN] pipe - cid:50 - "v1.13.0:go" - Readloop processing time: 5m20.187121708s
monolith_1  | [1] [WRN] pipe - cid:42 - "v1.13.0:go" - Readloop processing time: 2m7.013070596s
monolith_1  | [1] [WRN] pipe - cid:53 - "v1.13.0:go" - Readloop processing time: 3m2.370674975s
monolith_1  | time="2022-02-24T18:29:02.139386258Z" level=fatal msg="nats: timeout" subject=DendriteOutputRoomEvent

@neilalexander
Copy link
Contributor

@ElDifinitivo Built-in NATS or a standalone NATS Server?

@ElDifinitivo
Copy link

@neilalexander Built-in monolith

@alistair23
Copy link

Any update? Now I'm seeing messages not going out

@grisu48
Copy link

grisu48 commented Mar 7, 2022

I could finally solve this issue by deleting the old jetstream directory. Before I had still some ongoing issues with messages not going out and not being received. Now dendrite 0.6.5 is rocking again since Saturday 🙂

@alistair23 maybe that's also worth a shot for you? I just renamed the jetstream directory, and once everything worked, dumped it completely.

@Undef-a
Copy link
Author

Undef-a commented Mar 7, 2022

I've also had a similar experience. Upgrading to 0.6.5 jetstream broke, unable to parse something in that directory. After removing the directory messages are nearly always showing up correctly for the past few days.

However, #2142 (which I had thought fixed, then a symptom of this issue) is back.

@kegsay kegsay added T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. and removed bug labels Dec 5, 2022
@kegsay
Copy link
Member

kegsay commented Dec 6, 2022

Please open a new issue if this is still a problem.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. urgent
Projects
None yet
Development

No branches or pull requests

8 participants