Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move candidate-validation on blocking tasks #3122

Merged

Conversation

alexggh
Copy link
Contributor

@alexggh alexggh commented Jan 30, 2024

Candidate validation has a lot of operations that are cpu bound on its main loop things, like:

validation_code.hash()
sp_maybe_compressed_blob::decompress(
		&validation_code.0,
		VALIDATION_CODE_BOMB_LIMIT,
)

sp_maybe_compressed_blob::decompress(&pov.block_data.0, POV_BOMB_LIMIT) 
let code_hash = sp_crypto_hashing::blake2_256(&code).into();

When you add all that you for large POV and CODE it is going to take in the order of 10s of ms and because these are cpu bound operation it is going to hog the executor thread and negatively affect other subsystems around it, so it is better to just move the subsystem on the blocking pool to make sure such unexpected behaviour is avoided.

Note! In practice this subsystem does not have a high number of work to be done, so probably the impact of it is really low, but better safe than sorry.

@alexggh alexggh added R0-silent Changes should not be mentioned in any release notes T8-polkadot This PR/Issue is related to/affects the Polkadot network. labels Jan 30, 2024
Copy link
Contributor

@alindima alindima left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this'll close: #599

@alexggh
Copy link
Contributor Author

alexggh commented Jan 30, 2024

I think this'll close: #599

yes, it will.

@sandreim sandreim added this pull request to the merge queue Jan 30, 2024
@alindima alindima linked an issue Jan 30, 2024 that may be closed by this pull request
@bkchr
Copy link
Member

bkchr commented Jan 30, 2024

If I understand this correctly, we move the entire susbsystem to a blocking task? Instead of doing it correctly and only have the blocking operations being put into a blocking task? So, the subsystem can block and bring down the entire node again because it thinks that the subsystem isn't answering requests?

Merged via the queue into master with commit ff2e7db Jan 30, 2024
129 of 131 checks passed
@sandreim sandreim deleted the alexaggh/feature/make_candidate_validation_blocking branch January 30, 2024 09:47
@alexggh
Copy link
Contributor Author

alexggh commented Jan 30, 2024

If I understand this correctly, we move the entire susbsystem to a blocking task? Instead of doing it correctly and only have the blocking operations being put into a blocking task? So, the subsystem can block and bring down the entire node again because it thinks that the subsystem isn't answering requests?

The subsystem doesn't do much else, it is basically doing all the cpu intensive work and then calls into the pvf-worker.
Calling spawn.blocking for each message wouldn't bring us too much value and there is also a big downside since the number of spawn_blocking task is limited, so you need to be careful with how much you spawn before you hit this: https://docs.rs/tokio/latest/tokio/runtime/struct.Builder.html#method.max_blocking_threads and block(it is doable but it is not a panacea).

So, the subsystem can block and bring down the entire node again because it thinks that the subsystem isn't answering requests?

If by again you are referring to the problem on rococo of yesterday, that was cause by this https://github.com/paritytech/orchestra/pull/71 and it is not related at all to this PR, it is just something I noticed while investigating it.

Now, if you think that this subsystem could timeout because it processes its messages too slow on a single blocking thread, that's not really a concern for me, because this subsystem has volume of around 6-7 messages per block and a bounded queue of around 4096 messages, the blocking work takes at most 10s of millis and it can't be longer because of MAX_POV and MAX_CODE size, the work is also multi-tasked internally in a FuturesUnordered queue, so it is not like we fully process serially each message. So, I would say it is very unlikely to happen for this subsystem.

@bkchr
Copy link
Member

bkchr commented Jan 30, 2024

If by again you are referring to the problem on rococo of yesterday

I refer to this: #1730

I just have seen the new discussion started around removing the timeout stuff.

@alexggh
Copy link
Contributor Author

alexggh commented Jan 30, 2024

If by again you are referring to the problem on rococo of yesterday

I refer to this: #1730

I just have seen the new discussion started around removing the timeout stuff.

Yeah, that is definitely different from what this addresses and shoud be fixed by removing the timeout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
R0-silent Changes should not be mentioned in any release notes T8-polkadot This PR/Issue is related to/affects the Polkadot network.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Move ZSTD PVF decompression to a blocking task
6 participants