-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Truncating state #546
Comments
Hi, I have found a bug in offset encoding / decoding that can lead to those warnings in certain conditions. The warning is only shown on partition assignment - and first poll - subsequent polls dont trigger this logic. The bug that i have found is in offset encoding / decoding when RunLength encoding is used and there are no incomplete offsets below highest seen / polled offset when partition is revoked and offsets are committed. The warning then reads Truncating state .... Bootstrap polled [OFFSET] but expected 1 from loaded commit data.... I am not sure why the other warning could be shown - that particular bug wouldn't cause that - Truncating state ... Was expecting 1081799 but bootstrap poll was 1081798.... I will raise a PR to fix that bug shortly and will incorporate the PR raised for the logging change - to include the Topic and Partition - into it. |
Hi @rkolesnev , Thanks for looking into this. With regards to the warnings I see that the following PR was included in the latest release: However this only adds the partition, not the topic. And it only does so for one of the warnings, not for the "Bootstrap polled offset has been reset to an earlier offset" warning. This is fixed by: Did you have time to look into a fix for the bug in offset encoding / decoding? Thanks! |
Hi @lennehendrickx, Yes - I plan to raise a PR for it shortly (today - tomorrow). |
@lennehendrickx - could you try the latest snapshot build - with this PR merged to see if that fixes your issue? |
@rkolesnev Thanks. I have configured some of our services to use the |
@rkolesnev I have been monitoring the services that are using the SNAPSHOT version. We do not see the warning that starts with
We still regularly see the warning below:
Consumers are configured to reset to LATEST and these are topics that are used very frequently. Any idea what could be causing this? |
@lennehendrickx - nothing that i can point a finger at and say - ah that might be the cause - so needs further investigation. Can you see from logs - is it on application startup or on existing running application rebalancing? - that warning is only possible on partition assignment - but it can be on app instance that is being started or on the already running instance that gets rebalanced due to new app instance joining / leaving the group. Anything that stands out in the logs? Is processing generally fast or slow? do you have a rough idea how long does it take to process an event in processing function? What flow rate / throughput approximately on the topic ? How many partitions on the topic? Is there any pattern to those warnings - i.e. only happen at high load / low load / when there is slowdown in processing or anything like that? I will see if i can setup a long running soak test to reproduce this behaviour - but ideally i want to set it up with similar load / characteristics to your setup. |
Hi @rkolesnev , The warnings that I see are logged by consumers that are already running and are now assigned new partitions. Throughput varies. I have now deployed the SNAPSHOT version to our develop environment. Throughput is typically low here (some messages per minute). The changes are being promoted to staging and in one week to production. So I will monitor the behavior on the different environments. Some topics for which we see the warning have 10 partitions, some only 2.
I do see that there often is only a difference of a 1 item in the polled offset and the expected offset:
|
Hi @rkolesnev , The changes you have made are definitely already an improvement. This already eliminates the |
Hi @lennehendrickx - it is merged - #563 or do you mean a different PR? |
Hi @rkolesnev , My mistake, thanks for looking into this. Any idea when release 0.5.2.6 will be released? |
I don't think we have a concrete date in mind yet. We plan to release it once metrics feature is finished - that will be main driver for the release, at least per current plans. |
@rkolesnev We have seen the same error with version 0.5.2.7. Here is what we observed and the deployment info
We didn't see this issue in the previous version 0.5.2.4. Do you have any idea how to proceed? Thanks! |
more clues to the above issue When we checked the offset status from Kafk CLI, the values of "CURRENT-OFFSET" on some partitions were empty
|
We're experiencing this as well on 0.5.2.5. Sometimes happens on restarts where a partition is shuffled from one pod to another. Kafka client version 3.5.0. Don't know if it's the bump in kafka version or PC version that has caused this (or both) 🤔 |
Ok, this looks weirder and weirder - I don't see how the committed offsets could get reset / removed from offsets topic / log - as it looks like offset marker is not present for those partitions at all. I will investigate but it does not look like an easy issue to reproduce in synthetic environment / test. This warning can ever only be logged on partition assignment - first poll after new partition is assigned - so i am not 100% sure what you mean by "6. It happened on both new consumer group and existing consumer group". |
@rkolesnev My observation was that the offset polling from the broker sometimes was not synced to the internal Offset Map. So, it ended up truncating the state and resetting the offset. Interestingly, the offset polling from the broker was usually one less than the offset from the internal Offset Map in my case. Furthermore, I've done two experiments to see if it's relevant to the legacy consumer group. |
@LOG-INFO |
@colinkuo with addition of metrics in last release - the internal offset data is reported through metrics - could you try to run it with metrics enabled and collected and see if there is anything useful? |
@rkolesnev Thanks for clarification :) |
@colinkuo @rkolesnev from my perspective, the most possible guess is the offset from See for bitSet. |
Hi @astubbs ,
We started using the parallel consumer recently for some of our services. Thanks for creating this great new consumer.
We sometimes see the following warnings in our logs and wonder what could be causing them.
Kafka Configuration
PC Configuration
Warning message
The expected value is also sometimes 0 instead of 1
We also see the following warning, which might be related
I have also created a PR to add the partition and topic to the truncate logging.
Do you know how we could explain these warnings?
The text was updated successfully, but these errors were encountered: