-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Containers drop off bridge networks unexpectedly #258
Comments
[klutchell] This issue has attached support thread https://jel.ly.fish/a8eee5b3-4cdd-47fd-aaf2-90443a47f2ab |
[klutchell] This issue has attached support thread https://jel.ly.fish/d530e774-0d99-4a90-9c6a-d8646495246b |
I know this is the engine repo and not the Supervisor but on devices running balenaOS, the Supervisor manages the containers on the engine. It can be possible that the Supervisor is responsible for removing the network from the containers which seems unlikely because it would only do that if the target state has changed. I just wanted to add we can confirm it's not the Supervisor because the Supervisor would delete the existing container and create a new one. Therefore, if we can reproduce this issue then deploy your containers and note the created_at field for the container with access to the network. Perform the steps to get the network to be removed and verify the container no longer is on the network. Once there check if the created_at has changed. The supervisor logs would also indicate that it's going to recreate the container. This is the only way it would remove the network from a container is be recreating it. |
This could be happening when a privileged / A log like this then means that NM removed the container from the network, leading to this scenario:
balenaOS' own NetworkManger is configured to ignore those |
[pipex] This issue has attached support thread https://jel.ly.fish/950fec98-fb0b-440a-9c56-8e034833c7c0 |
This could be related to #261. I am looking at a device that is refusing to update with error balena-engine version
balena-engine info
This happened to them on at least 10% of their fleet of 10000 devices |
@pipex just FYI this seems to still be happening with every update to a number of devices. Pushing a random fleet variable such as FOO=BAR to all services seems to allow them to get past this issue but that's very annoying to have to do manually each time |
I'm also seeing this behavior randomly on about 10% of our devices in Balena. We also have a privileged container in host networking mode used for configuring our devices networking over bluetooth. This container also has the balena socket exposed and when this occurs the socket is no longer accessible from that container. Another container that has the docker/balena socket exposed as well (also privileged but not in host networking mode) can still access the socket |
[cywang117] This issue has attached support thread https://jel.ly.fish/d9da1684-f2d8-4929-934a-7f738ee4a0da |
[pdcastro] This issue has attached support thread https://jel.ly.fish/41b56e32-5fae-4a2e-b5bb-05f9f5af1f0f |
[zwhitchcox] This issue has attached support thread https://jel.ly.fish/d215c693-4477-4359-b06e-e158be58e837 |
[lmbarros] This issue has attached support thread https://jel.ly.fish/3386a82e-c9a9-4a03-8774-b0e617761a22 |
[cywang117] This issue has attached support thread https://jel.ly.fish/6e8b31bc-cd9a-4d50-8e36-19ef200ded77 |
In this support ticket above, I observed that during device startup, the engine restarts due to As a result, when inspecting the network, the container is not on it, but when inspecting the container, it shows as being on the network, which is consistent with the observations originally made in this GitHub issue. Anyone seeing this behavior in the future, please check for conflicting |
I saw another case of Engine timing out during startup. Increasing the startup timeout worked around the issue:
I am planning to investigate further over the next days. |
I reviewed the 7 support tickets we have attached to this issue. We don't usually have all the data to check if they were cases of Engine startup timeouts, but for 4 tickets (from two different users) there is strong evidence this was indeed the case. Two other tickets (from two other users) were more difficult to analyze and probably involved more than one single issue -- but we have noticed unexpected behavior after reboot (in of them, it caught my attention the application was using 11 containers, which could translate to a higher Engine startup time). Still, very importantly, in one ticket we have some good evidence that a container got dropped off the network without a reboot or Engine restart (the Engine had a 49-day uptime on this case). So, startup timeouts seem to be a common cause of this issue, but not the only one. |
balenaOS v2.98.4 and later shall help with the cases in which this issue is triggered by Engine startup timeouts (see balena-os/meta-balena#2584). We have good evidence that there are still other possible ways to trigger this error, so we'll keep investigating. |
Description
In support we have seen some recent cases where containers are removed from the bridge network unexpectedly.
Steps to reproduce the issue:
TBD
Describe the results you received:
balena inspect ${CONTAINER_ID}
will still show up on the proper network.balena network inspect ${CONTAINER_ID}
the network does not include the containerDescribe the results you expected:
Additional information you deem important (e.g. issue happens only occasionally):
Issue happens frequently since upgrading from v2.58.4 but can be resolved by restarting the container.
Output of
balena-engine version
:Output of
balena-engine info
:Additional environment details (device type, OS, etc.):
The text was updated successfully, but these errors were encountered: