-
Notifications
You must be signed in to change notification settings - Fork 792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash - "Signal channel is terminated and empty." #1730
Comments
Yeah, it should have been fixed since quite some time: paritytech/polkadot#6656 So, maybe we are seeing some other reason here. CC @ordian |
I have the same issue running a zeitgeist node v0.4.0 and v0.3.11.
|
Probably another reason indeed since the node doesn't seem to be in major sync:
Other than surprisingly high amount of forks on relay chain, there's nothing suspicious in the logs. I'm not sure about quite a few
but it's probably unrelated. |
We are experiencing the same issue synching a Centrifuge fullnode from scratch (Centrifuge client v0.10.34 which uses Polkadot v0.9.38). We are neither using Logs
|
@wischli this Polkadot version is quite old and we already have fixed some of these issues in later releases. |
Thanks for the quick response. That's what I figured. I am aware we are lacking behind, which should be resolved soon. Can we expect the issue to be fixed with Polkadot v0.9.43 or does it require at least v1.0.0? |
0.9.43 is still affected as far as I can tell |
@crystalin how good is this reproducible for you? |
It seems someone is able to reproduce it without too much difficulties @bkchr : moonbeam-foundation/moonbeam#2540 |
@ordian can you please look into it? |
The linked issue moonbeam-foundation/moonbeam#2540 appears to be different: Regarding #1730 (comment), this indeed should have been fixed as it was caused by unnecessary processing during major sync. #1730 (comment) zeitgeist seem to be based on polkadot-v0.9.38, so likely same issue during major sync that was fixed later. Re original post, it doesn't seem to be easily reproducible according to moonbeam-foundation/moonbeam#2502 (comment). I'll try to repro. In general, this error ( For One dirty quickfix could be patching In order to understand where this slowness is coming from, maybe we could use |
moonbeam-foundation/moonbeam#2540 (comment) this seems interesting but then why isn't it runtime api subsystem being stalled 🤔 |
Also, just a reminder, that even 0.9.43 supports running collator with minimal relay chain client: https://github.com/paritytech/cumulus/blob/9cb14fe3ceec578ccfc4e797c4d9b9466931b711/client/service/src/lib.rs#L270 which doesn't even have av-store subsystem. |
Hello, we have the same issue since some months:
What we can do? |
@ordian is there any further progress we can make here? The team have contacted me saying this is still an issue and their RPC is resetting daily. Thanks in advance. |
@helloitsbirdo we need more logs. At least 10 min before the restart would be good. |
@ordian hello, how are you? 21 minutes ago we had a new reboot:
10 minutes before the reboot we have logs like the following:
Let me know what else can I give you to help us. Thank you |
Thanks everyone for the logs. I've asked someone from the team to take a look at the issue. It seems to be happening more on collators than validators. While its being investigated, for collators specifically, we have a workaround mentioned here. @skunert do we have a guide for collators running with minimal relay chain node? I guess that requires ideally running a separate relay chain rpc node locally and specifying the |
The flags that we are running is:
|
The cumulus readme contains a description of the setup https://github.com/paritytech/polkadot-sdk/tree/master/cumulus#external-relay-chain-node . At this point, I think we don't even need to recommend running the relay chain on the same machine anymore, we have seen multiple setups that work just fine connecting to some self-hosted machine in the network. Using the minimal overseer should for sure be possible for the collators, but has need not been implemented because it was not a priority. I think it also is appealing that the standard embedded mode just spawns an off-the shelf full-node without much modification. |
@Ciejo please provide the full logs and not just an excerpt. So, from 10minutes before the restart until the restart, all the logs. |
I'd say a quickfix would be to modify the overseer gen for collator here:
to be that of the minimal relay node
that doesn't contain av-store, dispute-coordinator, etc. |
rpc-01-logs.json |
This bug now exists since 2+ years and we don't fix it. I'm getting a little bit fed up by this situation. Maybe instead of trying more and more band-aid, we can finally go down and fix it? |
@Ciejo which polkadot-sdk version is being used by this composable node you are running there? |
I don't disagree the underlying root cause needs to be investigated. But to me it seems there are 2 different issues: one for validators, one for collators. The overseer/subsystems are designed mainly with validators in mind, so my point here is that instead of trying to make them work on collators well, we should not run them on collators in the first place. The issue for validators definitely deserves proper fix (and repro). I think the whole system with timeouts is a bandaid by itself and IIRC was implemented there as a poor man's detection of deadlocks between subsystems and it that case it makes sense to shutdown. However, I don't think it's unreasonable that sometimes subsystems actually take a long time to process messages (esp if run in a VM that shares CPU resources), so I would question this mechanism in the first place. If it is caused by an actual deadlock, that definitely needs to be fixed along with a better prevention mechanisms. |
Hello, we are running: v0.9.43 |
Okay, that is quite old. Please upgrade to a newer version. |
) Currently, collators and their alongside nodes spin up a full-scale overseer running a bunch of subsystems that are not needed if the node is not a validator. That was considered to be harmless; however, we've got problems with unused subsystems getting stalled for a reason not currently known, resulting in the overseer exiting and bringing down the whole node. This PR aims to only run needed subsystems on such nodes, replacing the rest with `DummySubsystem`. It also enables collator-optimized availability recovery subsystem implementation. Partially solves #1730.
I believe this is fixed by #3061 with tickets existing for further improvements. Feel free to re-open, if I got this wrong. |
Is there an existing issue?
Experiencing problems? Have you tried our Stack Exchange first?
Description of bug
Moonbeam nodes v0.33.0 (based on polkadot v0.9.43) are crashing with the following Relaychain error.
This happens on multiple relaychain network (kusama, polkadot...)
Crash logs: moonbase_crash.log
Steps to reproduce
Running Moonbase alphanet (Moonbeam testnet) node:
The text was updated successfully, but these errors were encountered: