Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mctpd does not track network interface changes correctly #55

Open
rmd4ctf opened this issue Nov 22, 2024 · 2 comments
Open

Mctpd does not track network interface changes correctly #55

rmd4ctf opened this issue Nov 22, 2024 · 2 comments

Comments

@rmd4ctf
Copy link

rmd4ctf commented Nov 22, 2024

The mctpd daemon was loaded. During the OS operation, mctpi2c* interfaces were repeatedly created and deleted.
After some time, a device was added via the SetupEndpoint d-bus interface. The device was added, but there was no communication with it via the mctpd daemon. Only rebooting the daemon and re-adding the device helped.
I enabled debug output (by adding "-v" to mctpd) and received the following entries (the log was cleaned up a bit):

Feb 28 18:49:46 local mctpd[30382]: Adding interface #13 mctpi2c61
Feb 28 18:49:46 local mctpd[30382]: mctpd: Warning, interface mctpi2c61 is down
...
Feb 28 18:49:54 local systemd-networkd[306]: mctpi2c61: Link UP
Feb 28 18:49:54 local systemd-networkd[306]: mctpi2c61: Gained carrier
...
Feb 28 18:49:58 local mctpd[30382]: Adding local eid 8 net 1
Feb 28 18:49:58 local mctpd[30382]: mctpd: emit_endpoint_added: /xyz/openbmc_project/mctp/1/8
Feb 28 18:49:58 local pldmd[187]: Successfully initialized platform manager
Feb 28 18:50:09 local systemd-journald[69]: Forwarding to syslog missed 163 messages.
Feb 28 18:50:09 local mctpd[30382]: mctpd: read_message got from sockaddr_mctp_ext eid 9 net 1 type 0x00 if 13 hw len 1 0x32 len 6
Feb 28 18:50:09 local mctpd[30382]: Adding neigh to peer eid 9 net 1 phys physaddr if 13 hw len 1 0x32 state 1
Feb 28 18:50:09 local mctpd[30382]: Adding route to peer eid 9 net 1 phys physaddr if 13 hw len 1 0x32 state 1
Feb 28 18:50:09 local mctpd[30382]: mctpd: BUG peer_route_update: Unknown ifindex 13
Feb 28 18:50:09 local mctpd[30382]: mctpd: Failed adding route for peer eid 9 net 1 phys physaddr if 13 hw len 1 0x32 state 1: No such device
Feb 28 18:50:09 local mctpd[30382]: mctpd: endpoint_query_addr: sendto(sockaddr_mctp_ext eid 9 net 1 type 0x00 if 0 hw len 0 0x) 2 bytes failed. No route to host
Feb 28 18:50:09 local mctpd[30382]: mctpd: Error getting endpoint types for peer eid 9 net 1 phys physaddr if 13 hw len 1 0x32 state 1. Ignoring error -113 No route to host

Then I analyzed and debugged mctpd and found the essence of the problem. Receiving information and working with the kernel (in terms of managing addresses, routes and neighbors) is done via Netlink sockets. For this purpose, the context structures [1] "mctp_nl *nl" and "mctp_nl *nl_query" are defined in the ctx context. They both contain a list of kernel interfaces [2], which are filled when creating the nl, nl_query contexts.
When mctpd is running and receiving information from the kernel about network interface changes, changes are made only to the list of interfaces in the nl context (when calling the mctp_nl_handle_monitor function). The list of interfaces in the nl_query context does not change, so it becomes obsolete. Since the nl_query context is used to execute mctp_nl_route_del/mctp_nl_route_add, peer_neigh_update, peer_route_update queries, these queries operate with an obsolete interface table. I conducted an experiment by replacing nl_query with nl in the above queries and mctpd began to work correctly. I do not know for what purposes the work with the kernel Netlink interface was done through 2 contexts. In addition, each context contains 2 sockets (sd, sd_monitor), although sd_monitor is not used for the nl_query context. It turns out that the work is carried out via 3 netlink sockets. Therefore, perhaps my method is not correct. I would like to emphasize separately that this is only about exchanging netlink messages with the kernel and has nothing to do with exchanging with other mctp devices.
I also considered other ways to solve this situation. Since there is no need to keep 2 tables of network interfaces in each context, there were other options:

  1. In the second context (nl_query), replace the array of interfaces with a pointer to the array from the first context (quick fix).
  2. When making changes to the array of interfaces of the first context, simultaneously make changes to the second context.

[1] https://github.com/CodeConstruct/mctp/blob/main/src/mctpd.c#L190
[2] https://github.com/CodeConstruct/mctp/blob/main/src/mctp-netlink.c#L44

@rmd4ctf
Copy link
Author

rmd4ctf commented Dec 17, 2024

In our solution we decided to use small fix, described in first option:
In the second context (nl_query), replace the array of interfaces with a pointer to the array from the first context (quick fix).
I am attaching a patch with my changes (Github doesn't allow me to attach a .patch file, so I renamed it).

0001-fix-mctpd-freezing-when-manipulating-interfaces.txt

@jk-ozlabs

@jk-ozlabs
Copy link
Member

Thanks for the report; for some reason this had not been visible for me until now. I'll take a look and review the proposed fix too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants