-
Notifications
You must be signed in to change notification settings - Fork 792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move candidate-validation on blocking tasks #3122
Move candidate-validation on blocking tasks #3122
Conversation
Signed-off-by: Alexandru Gheorghe <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this'll close: #599
yes, it will. |
If I understand this correctly, we move the entire susbsystem to a blocking task? Instead of doing it correctly and only have the blocking operations being put into a blocking task? So, the subsystem can block and bring down the entire node again because it thinks that the subsystem isn't answering requests? |
The subsystem doesn't do much else, it is basically doing all the cpu intensive work and then calls into the pvf-worker.
If by again you are referring to the problem on rococo of yesterday, that was cause by this Now, if you think that this subsystem could timeout because it processes its messages too slow on a single blocking thread, that's not really a concern for me, because this subsystem has volume of around 6-7 messages per block and a bounded queue of around 4096 messages, the blocking work takes at most 10s of millis and it can't be longer because of MAX_POV and MAX_CODE size, the work is also multi-tasked internally in a FuturesUnordered queue, so it is not like we fully process serially each message. So, I would say it is very unlikely to happen for this subsystem. |
I refer to this: #1730 I just have seen the new discussion started around removing the timeout stuff. |
Yeah, that is definitely different from what this addresses and shoud be fixed by removing the timeout. |
Candidate validation has a lot of operations that are cpu bound on its main loop things, like:
When you add all that you for large POV and CODE it is going to take in the order of 10s of ms and because these are cpu bound operation it is going to hog the executor thread and negatively affect other subsystems around it, so it is better to just move the subsystem on the blocking pool to make sure such unexpected behaviour is avoided.
Note! In practice this subsystem does not have a high number of work to be done, so probably the impact of it is really low, but better safe than sorry.