-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature suggestion: Minimum batch size + batch max wait time #560
Comments
Sounds like a worthy feature to have! :) |
I am working through the PR review on it - but i will wait until @astubbs has a look as well. I am wondering on a concept a little bit though - what benefit is in enforcing min batch size - is it purely slight optimization for processing function / downstream resource use? Generally batching is self balancing - if downstream processing is faster than consumption from kafka / flow rate - batching is not adding any advantage from purely processing speed, once flow rate exceeds time taken to process individual messages - batching starts to kick in. If processing explicitly depends on messages being batched - probably best bet to add small KStream / KSQL job on incoming topic to aggregate messages into batches (windowed etc). From implementation side - is minimum batch size adding any value over just Linger ? similar to Kafka producer behaviour - i.e. keep current batch size and add linger (i.e. timeout for waiting to fill the batch)? |
Hi @rkolesnev can you elaborate how can I "keep current batch size and add linger (i.e. timeout for waiting to fill the batch)" I thought linger is only in the producer side? As for best bet adding KStream / KSQL, if I understand correctly it means adding another hop to kafka which is more expensive. I totally agree that adding min batch size makes the library more complex and might harm performance. |
a) The point on KStream / KSQL is about guarantees - the batching mechanism that exists in Parallel Consumer and this proposed change - are more about best effort optimisation. KStream / KSQL pre-processing to batch data give better control and more guarantees if downstream processing depends on data being batched - things like late arriving data, re-ordering etc would be better taken care of with KStreams / KSQL and windowing - but that is more flow architecture discussion. b) For the Linger point - yes it is a concept in Kafka Producer and in Kafka Producer it controls data batching for Produce requests - which sounds similar to what this change is doing (granted its batching for processing function instead of produce). What i am asking - would adding Linger (or in other words batch timeout) without adding minBatchSize be enough? So on work acquisition does 2 checks
c) I guess batching only makes sense for Unordered and Key ordered processing - for Partition ordering it would only make sense for high number of partitions processed by single PC instance - do you agree? Should check be added to warn if this behaviour is enabled for Partition ordered processing or check how many partitions are subscribed to against min batch size? |
Hi @rkolesnev, Just to give an example, we are using this library to write both to s3 and RDS and later sending notification back to kafka. We would like to have some kind of batching when writing s3. Now you may say that I can use Kafka connect instead with 2 writes, the problem with that is the latency and pricing. So the PC becomes very helpful, but I can not have a real batching if I want to.
|
Hi @Ehud-Lev-Forter,
I have left review comment on PR - general batching flow logic looks ok to me, but i would really like to see more tests there covering different usage modes - Transactional consume + produce, different ordering guarantees, non-reactor usage of PC - the change touches logic quite deep in the code and has a potential for edge case issues - the additional tests should help to surface them and establish regression protection going forward for that new batching feature as well. |
Hi Team, Does it make sense to support
minimumBatchSize
withmaxTimeToWait
in Parallel consumer? The idea is that I want to use the PC as a micro batch consumer, but since my consumers are faster than my stream I get 1 event most of the time, I do get larger batches but only when system is in lag. Seems that I am not the only one that consider that: seeWhat do you think about this idea?
Is this something you don't want to support? or it is just not the focus right now?
The text was updated successfully, but these errors were encountered: