#671 (UKF cleanup) leads to infinite loops and hanging of the whole state estimation #777

peci1 · 2022-10-23T13:20:56Z

Describe the bug
Running ukf_localization_node (or the nodelet) with #671 merged leads to the node stopping working a while after startup. The responsible commit was found with git bisect.

To Reproduce
Steps to reproduce the behavior:

Run on a Noetic computer
Download the bag husky_ukf_bug.zip
Run ukf_localization_node with the config mentioned below.
Observe /imu_odom topic (output of the filter)
Messages will stop coming after some while.

node config (file odom_localization.yaml):

odom_frame: odom
base_link_frame: base_link
world_frame: odom

two_d_mode: false
frequency: 20
sensor_timeout: 0.03
# publish_tf: false  # set from the launch file
reset_on_time_jump: true

odom0: husky_velocity_controller/odom
odom0_differential: false
odom0_nodelay: true
odom0_queue_size: 10
odom0_config: [
  false, false, false,   # x_pos, y_pos, z_pos
  false, false, false,   # roll, pitch, yaw
  true, true, true,      # x_vel, y_vel, z_vel 
  false, false, true,    # roll_vel, pitch_vel, yaw_vel
  false, false, false    # x_accel, y_accel, z_accel
]

imu0: imu/data_filtered
imu0_differential: false
imu0_nodelay: true
imu0_queue_size: 10
imu0_remove_gravitational_acceleration: true
imu0_config: [
  false, false, false,  # x_pos, y_pos, z_pos
  true, true, false,    # roll, pitch, yaw
  false, false, false,  # x_vel, y_vel, z_vel
  true, true, true,     # roll_vel, pitch_vel, yaw_vel
  false, false, false   # x_accel, y_accel, z_accel
]

Launch the node with:

<launch>
       <node pkg="robot_localization" type="ukf_localization_node" name="ekf_odom"
                  clear_params="true" respawn="true" output="screen">
                <rosparam command="load" file="$(dirname)/odom_localization.yaml" />
                <param name="publish_tf" value="false" />
                <remap from="odometry/filtered" to="imu_odom" />
        </node>
</launch>

Attaching to the running filter with gdb, it is always stuck here:

#0  RobotLocalization::FilterUtilities::clampRotation (rotation=-10636217665855.5) at ./src/filter_utilities.cpp:138                                                                                               
#1  0x00007f0de0c2b83e in RobotLocalization::FilterBase::wrapStateAngles (this=0x7ffc078a7078) at /usr/include/eigen3/Eigen/src/Core/util/Meta.h:300                                                               
#2  0x00007f0de0e20b7e in RobotLocalization::Ukf::correct (this=0x7ffc078a7078, measurement=...) at ./src/ukf.cpp:298                                                                                              
#3  0x00007f0de0c2d2ab in RobotLocalization::FilterBase::processMeasurement (this=this@entry=0x7ffc078a7078, measurement=...) at ./src/filter_base.cpp:234                                                         
#4  0x00007f0de14f7aa7 in RobotLocalization::RosFilter<RobotLocalization::Ukf>::integrateMeasurements (this=this@entry=0x7ffc078a6990, currentTime=...) at /usr/include/c++/9/bits/atomic_base.h:549               
#5  0x00007f0de14f8efa in RobotLocalization::RosFilter<RobotLocalization::Ukf>::periodicUpdate (this=0x7ffc078a6990, event=...) at /usr/include/c++/9/ext/new_allocator.h:119                                      
#6  0x00007f0de1361608 in ros::TimerManager<ros::Time, ros::Duration, ros::TimerEvent>::TimerQueueCallback::call() () from /opt/ros/noetic/lib/libroscpp.so                                                        
#7  0x00007f0de1383172 in ros::CallbackQueue::callOneCB(ros::CallbackQueue::TLS*) () from /opt/ros/noetic/lib/libroscpp.so                                                                                         
#8  0x00007f0de1384883 in ros::CallbackQueue::callAvailable(ros::WallDuration) () from /opt/ros/noetic/lib/libroscpp.so                                                                                            
#9  0x00007f0de13d7fcf in ros::SingleThreadedSpinner::spin(ros::CallbackQueue*) () from /opt/ros/noetic/lib/libroscpp.so                                                                                           
#10 0x00007f0de13c021f in ros::spin() () from /opt/ros/noetic/lib/libroscpp.so                                          
#11 0x000055d0fb55beb0 in main (argc=<optimized out>, argv=<optimized out>) at ./src/ukf_localization_node.cpp:54

Notice the gigantic rotation argument of clampRotation(). Apparently, the while loop has a lot to do to get this number to -2pi +2pi by subtracting 2pi at a time. Of course, using a modulo function would prevent this stucking, but if I look into the logfile, the state is apparently exploding, so it would not fix the underlying problem.

Here is the debug log of the filter: debug_log.zip Apparently, the state explodes near the end of the log file. No more records are being added to this file once the filter gets stuck,

Version 2.6.3 does not result in this behavior, the filter runs normally.

The text was updated successfully, but these errors were encountered:

ayrton04 · 2022-10-24T09:52:22Z

Thanks, I'll take a look. I think I replaced those wrapStateAngles calls in ROS 2 with the methods in the angles library.

The old UKF logic wasn't really properly functioning as a UKF, so there are likely some parameters that need to be tuned now.

ayrton04 · 2022-10-24T10:26:00Z

@peci1 the 671-wrap-angles branch handles the wrapping better.

Looking at the logs you posted, I don't see a single IMU measurement being received by the filter. You are attempting to produce a 3D state estimate, but your roll and pitch values have no reference. See this covariance matrix:

[0.57132     -0.077639   -0.00074252 5.7332e+05  6.3011e+05  -3.4506e+05 -1.7039e-05 -6.2144e-07 -3.3632e-06 -0.020026   0.020237    2.263e-13   -0.00053363 -2.0628e-05 -2.8947e-05 
 -0.077639   0.46173     0.00039103  -2.9696e+05 -3.2708e+05 1.7864e+05  -2.0653e-06 -1.862e-05  -9.9904e-06 -0.062799   0.061786    7.1349e-13  -6.2154e-05 -0.00057945 -8.907e-05  
 -0.00074252 0.00039103  0.50618     -1801.4     -2485.9     1103.2      1.0646e-05  1.0808e-05  -3.852e-05  0.31989     -0.32507    -3.6917e-12 0.00032444  0.00032948  -0.00033645 
 5.7332e+05  -2.9696e+05 -1801.4     -8.6652e+06 1.0946e+07  2.0866e+07  -22.448     -23.081     -112.34     -6.9081e+05 7.054e+05   7.9559e-06  -693.29     -712.81     -993.34     
 6.3011e+05  -3.2708e+05 -2485.9     1.0946e+07  -3.3151e+07 -2.137e+07  19.301      19.959      101.06      6.2201e+05  -6.2273e+05 -7.0537e-06 614.27      635.23      881.57      
 -3.4506e+05 1.7864e+05  1103.2      2.0866e+07  -2.137e+07  -3.0473e+07 20.652      21.156      103.01      6.343e+05   -6.4607e+05 -7.3079e-06 638.7       654.42      909.49      
 -1.7039e-05 -2.0653e-06 1.0646e-05  -22.448     19.301      20.652      0.0016376   3.9481e-05  0.00019072  1.174       -1.1962     -1.3544e-11 0.0011814   0.0012155   0.0016877   
 -6.2144e-07 -1.862e-05  1.0808e-05  -23.081     19.959      21.156      3.9481e-05  0.0016399   0.00019642  1.209       -1.231      -1.3941e-11 0.0012155   0.0012519   0.001738    
 -3.3632e-06 -9.9904e-06 -3.852e-05  -112.34     101.06      103.01      0.00019072  0.00019642  0.00352     5.9109      -6.0051     -6.7947e-11 0.0059219   0.0060997   0.0084748   
 -0.020026   -0.062799   0.31989     -6.9081e+05 6.2201e+05  6.343e+05   1.174       1.209       5.9109      36416       -36923      -4.1798e-07 36.436      37.527      52.119      
 0.020237    0.061786    -0.32507    7.054e+05   -6.2273e+05 -6.4607e+05 -1.1962     -1.231      -6.0051     -36923      37577       4.2508e-07  -37.058     -38.141     -52.997     
 2.263e-13   7.1349e-13  -3.6917e-12 7.9559e-06  -7.0537e-06 -7.3079e-06 -1.3544e-11 -1.3941e-11 -6.7947e-11 -4.1798e-07 4.2508e-07  0.0012794   -4.1939e-10 -4.3175e-10 -5.9975e-10 
 -0.00053363 -6.2154e-05 0.00032444  -693.29     614.27      638.7       0.0011814   0.0012155   0.0059219   36.436      -37.058     -4.1939e-10 0.03721     0.037634    0.052275    
 -2.0628e-05 -0.00057945 0.00032948  -712.81     635.23      654.42      0.0012155   0.0012519   0.0060997   37.527      -38.141     -4.3175e-10 0.037634    0.039406    0.053842    
 -2.8947e-05 -8.907e-05  -0.00033645 -993.34     881.57      909.49      0.0016877   0.001738    0.0084748   52.119      -52.997     -5.9975e-10 0.052275    0.053842    0.075756    ]```

All of your absolute orientation variables have massive covariance values, presumably because (a) you have no roll/pitch reference, and (b) the correlation between yaw, roll, and pitch mean that you are going to see yaw covariance also increase. The correlation with the rest of the state variables will eventually lead to those exploding too.

Looking at the bag you sent, the wheel encoder odometry is publishing for nearly 22 seconds before the IMU data even starts arriving. I echoed the first topic from each, then started playing the bag:

$ rostopic echo /husky_velocity_controller/odom/header -n 1
seq: 133769
stamp: 
  secs: 1666530636
  nsecs: 555635664
frame_id: "odom"

$ rostopic echo /imu/data_filtered/header -n 1
seq: 0
stamp: 
  secs: 1666530658
  nsecs: 117259502
frame_id: "imu"

So you are asking the filter to produce a 3D state estimate, but starving it of 3D orientation measurements for over 20 seconds. That's going to make the covariance explode.

peci1 · 2022-10-24T10:34:38Z

The thing with missing IMU measurements in the log is a consequence of the reported problem. If I comment out receiving odom messages, the filter processes the IMU messages. If both are used, the filter gets stuck earlier than the IMU measurements start coming.

You are right that the IMU messages are not coming for some time. Our IMU does bias estimation on startup and does not output anything until the biases are estimated. I didn't think this could be a problem because I'm fusing the absolute roll/pitch from IMU, so when messages start coming, the covariances should quickly get to reasonable numbers. And they did before #671. As the robot is forced to be stationary during the bias estimation, all odom messages should read zero velocity. Therefore, the filter should not be forced to project non-zero velocities along the exploding roll and pitch.

I see it could be better to the filter to start it only when IMU measurements are available. Is there a setting supporting this (except the explicit pause service)? I.e., can I mark a sensor as required (so that predictions are not done if the sensor is timed out)?

ayrton04 · 2022-10-24T10:59:32Z

If I comment out receiving odom messages, the filter processes the IMU messages. If both are used, the filter gets stuck earlier than the IMU measurements start coming.

That's because the filter won't start until it receives its first measurement.

I didn't think this could be a problem because I'm fusing the absolute roll/pitch from IMU, so when messages start coming, the covariances should quickly get to reasonable numbers.

It might still do that if the filter was able to get to that point before things started freezing. For that reason, maybe try the branch I mentioned to see if it helps?

And they did before #671.

Yeah, the filter was effectively not properly implementing a UKF. Now it is, and I also found that it's more sensitive to this kind of thing. But I'd rather have the math be right, even if it means the filter has greater sensitivity to sensor measurement holidays. To be fair, 22 seconds is a very long time to go without any reference in one of your state variables.

Therefore, the filter should not be forced to project non-zero velocities along the exploding roll and pitch.

Prediction is done using the filter's state, not the measurements. If there's a lot of instability in the covariance values, then I can see the "almost zero" state variables (they are really on the order of 1e-15) getting corrected via massive Kalman gains into non-zero values.

That's not to say there isn't another issue lurking (and I'll poke around). It's just that I can see a legitimate cause for this behavior.

I see it could be better to the filter to start it only when IMU measurements are available. Is there a setting supporting this (except the explicit pause service)? I.e., can I mark a sensor as required (so that predictions are not done if the sensor is timed out)?

This does not exist, but it's a great idea (I think there might be a ticket for it?). I'd welcome a PR. :)

ayrton04 · 2022-10-24T11:08:09Z

Yeah, one thing I am not happy about is this part of the logs:

Corrected full state is:
[1.4444e-13  5.5672e-14  1.169e-15   -7.9048e-11 -3.9406e-12 -8.5224e-13 4.9838e-16  5.1531e-16  -4.3609e-16 1.1484e-14  8.4267e-14  -5.4854e-15 6.1239e-14  -8.464e-15  -1.6683e-14 ]

Corrected full estimate error covariance is:
[0.40633     -4.7622e-26 3.7759e-17  9.1548e-24  -1.1752e-16 2.5607e-18  9.2335e-05  8.2408e-17  3.239e-16   4.2898e-27  1.3729e-15  1.0919e-19  -0.00051717 -3.6678e-16 -4.8162e-16 
 -4.7622e-26 0.40633     2.6919e-17  1.1752e-16  -1.2661e-24 8.4148e-17  -8.2408e-17 9.2335e-05  1.7534e-15  -1.3729e-15 -7.9448e-27 5.1301e-19  3.6678e-16  -0.00051717 -4.2194e-16 
 3.7759e-17  2.6919e-17  0.48739     -9.2442e-18 -3.0298e-16 1.1725e-24  -3.2867e-16 -1.6908e-15 8.2446e-05  -9.9185e-16 4.9278e-16  -2.359e-27  2.6028e-16  -3.8102e-17 -0.00079265 
 9.982e-24   1.1752e-16  -9.2442e-18 1.9925      2.6211e-21  5.6607e-13  3.4721e-25  -4.5071e-19 -1.4761e-19 0.32536     -7.5032e-24 9.6579e-16  -4.3674e-24 -7.6716e-18 -3.9759e-18 
 -1.1752e-16 -1.2665e-24 -3.0298e-16 2.6228e-21  1.9925      -1.1356e-12 4.5071e-19  -9.7783e-26 1.733e-19   8.7895e-24  0.32536     3.2638e-14  7.6716e-18  -1.0274e-24 2.7577e-18  
 2.5607e-18  8.4148e-17  1.1733e-24  5.6607e-13  -1.1356e-12 0.55163     3.6287e-20  -5.9609e-20 2.6298e-26  8.0321e-14  -2.9913e-13 0.0080254   7.4316e-19  -5.9225e-19 1.9295e-24  
 9.2335e-05  -8.2408e-17 -3.2867e-16 4.0537e-25  4.5071e-19  3.6287e-20  0.0024853   -1.7333e-30 -1.3247e-30 3.8695e-29  1.3191e-28  2.0971e-29  0.00083505  9.5695e-30  5.3446e-30  
 8.2408e-17  9.2335e-05  -1.6908e-15 -4.5071e-19 -9.7783e-26 -5.9609e-20 -1.7333e-30 0.0024853   -4.3797e-31 -1.1455e-28 2.2475e-29  -8.0514e-30 5.9518e-29  0.00083505  -5.6549e-30 
 3.239e-16   1.7534e-15  8.2446e-05  -1.4761e-19 1.733e-19   2.6296e-26  -1.4893e-30 -4.3797e-31 0.0033612   -2.9953e-29 -1.4473e-28 2.0068e-29  -1.4452e-29 5.5338e-29  0.00091879  
 4.2898e-27  -1.3729e-15 -9.9185e-16 0.32536     8.8154e-24  8.0321e-14  2.6073e-29  -1.1455e-28 -3.6264e-29 0.080958    -1.1532e-26 1.1031e-27  2.6822e-27  -2.7606e-27 -1.0205e-27 
 1.3729e-15  -7.6419e-27 4.9278e-16  -7.5032e-24 0.32536     -2.9913e-13 1.3191e-28  2.2475e-29  -1.7313e-28 -1.1532e-26 0.080958    -8.116e-28  4.8143e-27  -1.9896e-27 -4.9599e-27 
 1.0919e-19  5.1301e-19  -2.3588e-27 9.6579e-16  3.2638e-14  0.0080254   2.5259e-29  -7.0001e-31 1.1689e-29  1.141e-27   -8.116e-28  0.012872    2.22e-29    6.0675e-28  -1.2944e-28 
 -0.00051717 3.6678e-16  2.6028e-16  -4.3674e-24 7.6716e-18  7.4316e-19  0.00083505  5.794e-29   -1.9812e-30 2.6822e-27  4.6124e-27  1.0387e-29  0.016223    2.4414e-28  2.5716e-28  
 -3.6678e-16 -0.00051717 -3.8102e-17 -7.6716e-18 -1.0274e-24 -5.9225e-19 3.2586e-30  0.00083505  4.9027e-29  -2.7606e-27 -1.9391e-27 6.2076e-28  2.6938e-28  0.016223    5.9008e-28  
 -4.8162e-16 -4.2194e-16 -0.00079265 -3.9759e-18 2.7577e-18  1.9295e-24  -8.1786e-31 6.5601e-31  0.00091879  -1.0205e-27 -4.9599e-27 -1.3117e-28 2.5726e-28  5.9008e-28  0.024507    ]

But then in the next prediction, we get

Predicted state is:
[1.3392e-13  -8.4105e-15 -1.0415e-14 -3.1416     3.1416      -8.0218e-13 6.6562e-15  -2.3334e-15 -2.8866e-15 1.9008e-14  9.9602e-14  -1.0982e-15 5.297e-14   -7.7369e-16 -7.1054e-15 ]

So the roll and pitch get projected into -pi and pi, respectively. Will dig into that.

ayrton04 · 2022-10-24T11:54:42Z

OK, I think I spotted an issue with how Euler angles are projected. Fix is pushed to the same 671-wrap-angles branch. Let me know if it helps.

ayrton04 · 2022-10-24T12:00:01Z

No, that won't do what I thought. OK, I'll have to give the bag a go myself. Will try to find cycles for that.

peci1 · 2022-10-24T13:08:58Z

Good, thanks a lot!

ayrton04 · 2022-10-26T11:06:17Z

@peci1 can you do me a favour and add these parameters to your config file?

alpha: 1.414213562
beta: 2.0
kappa: 0.0

And then run and let me know how it goes.

peci1 · 2022-10-26T11:07:43Z

Ok, do you mean on the real robot or just running with the bag file? The robot will not be available until next week probably...

ayrton04 · 2022-10-26T11:09:09Z

I know it'll work on the bag file you sent, but if you have any others that caused failure, that would be great. But please at least verify on the bag you sent.

peci1 · 2022-10-26T11:26:37Z

Ok, with the new parameters, it does not get stuck on the bag. However, the localization is still erratic when the IMU messages start coming.

Also, nans are lurking in (although not in any fatal way) - at least rviz says so when visualizing the odometries (orientation shape_scale contains NaN: Vector3(-nan, 0.001, 1.84705)). But I haven't seen any NaNs in the rostopic echo output, so it might be invalid covariances or denormalized quaternions?

I'll test on the real robot next week.

Unrelated: do you plan to backport all these news to melodic, too? I've just compiled noetic-devel branch on my Melodic PC and it seems to work ok.

ayrton04 · 2022-10-26T11:37:42Z

Are you by chance filtering out the "bad" values from the bag before you run again?

ayrton04 · 2022-10-26T11:40:41Z

I changed the launch file so it publishes to odometry/filtered again, and rviz seems happy for the entire bag:

peci1 · 2022-10-26T11:43:03Z

Are you by chance filtering out the "bad" values from the bag before you run again?

What are bad values? You mean the time when only odom is coming? No, I was now testing on the same bag that's mentioned in the original post.

What fixed frame do you use in rviz to get this result? I used 'odom'.

peci1 · 2022-10-26T11:44:00Z

Ah, I see, you mean the /imu_odom topic recorded in the bag. I forgot about it! I'll rerun my test.

peci1 · 2022-10-26T11:47:18Z

Good, when filtering out the old /imu_odom topic, I see exactly the result you show (when scaling up the covariances a lot in rviz).

Do you have an explanation for the UKF parameter values that fixed it?

ayrton04 · 2022-10-26T15:12:17Z

Well, I don't think it's really a fix, as it turns out. It will solve that particular issue, but that breaks other things (like our regression tests). I know the main issue; I just need to think of a solution. In the meantime, I'd either switch to the EKF or add the feature whereby we wait for all sensor inputs before starting the filter.

ayrton04 · 2022-11-14T13:06:00Z

@peci1 can you try this branch again?

#778

Thanks!

peci1 · 2022-11-14T14:29:27Z

Thanks, I'll have a look. I got busy in the meantime, but I think I can do this soon.

daniel-zawadzki-vay · 2023-10-09T12:48:41Z

@ayrton04 this while loop is a disaster waiting to happen. My project is affected by this bug after migration from ROS melodic to noetic. robot_localization process is stuck at RobotLocalization::FilterUtilities::clampRotation(double). I confirmed that this fix makes this issue go away. Please consider merging it.

ayrton04 · 2023-10-09T13:00:08Z

@daniel-zawadzki-vay that change touches far more than the rotation clamping. Please feel free to submit a separate PR with that in isolation. However, that clamping logic has been around since the very early days of the package, so I'm not clear on why it's suddenly an issue.

This package is deprecated right now, so while I will definitely merge a PR to address the angle wrapping, I don't have the cycles to iterate on the UKF in general right now.

ayrton04 mentioned this issue Oct 24, 2022

Fix UKF Euler angle handling #778

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#671 (UKF cleanup) leads to infinite loops and hanging of the whole state estimation #777

#671 (UKF cleanup) leads to infinite loops and hanging of the whole state estimation #777

peci1 commented Oct 23, 2022

ayrton04 commented Oct 24, 2022

ayrton04 commented Oct 24, 2022 •

edited

Loading

peci1 commented Oct 24, 2022

ayrton04 commented Oct 24, 2022

ayrton04 commented Oct 24, 2022

ayrton04 commented Oct 24, 2022

ayrton04 commented Oct 24, 2022

peci1 commented Oct 24, 2022

ayrton04 commented Oct 26, 2022 •

edited

Loading

peci1 commented Oct 26, 2022

ayrton04 commented Oct 26, 2022

peci1 commented Oct 26, 2022

ayrton04 commented Oct 26, 2022

ayrton04 commented Oct 26, 2022

peci1 commented Oct 26, 2022

peci1 commented Oct 26, 2022

peci1 commented Oct 26, 2022

ayrton04 commented Oct 26, 2022

ayrton04 commented Nov 14, 2022

peci1 commented Nov 14, 2022

daniel-zawadzki-vay commented Oct 9, 2023

ayrton04 commented Oct 9, 2023

#671 (UKF cleanup) leads to infinite loops and hanging of the whole state estimation #777

#671 (UKF cleanup) leads to infinite loops and hanging of the whole state estimation #777

Comments

peci1 commented Oct 23, 2022

ayrton04 commented Oct 24, 2022

ayrton04 commented Oct 24, 2022 • edited Loading

peci1 commented Oct 24, 2022

ayrton04 commented Oct 24, 2022

ayrton04 commented Oct 24, 2022

ayrton04 commented Oct 24, 2022

ayrton04 commented Oct 24, 2022

peci1 commented Oct 24, 2022

ayrton04 commented Oct 26, 2022 • edited Loading

peci1 commented Oct 26, 2022

ayrton04 commented Oct 26, 2022

peci1 commented Oct 26, 2022

ayrton04 commented Oct 26, 2022

ayrton04 commented Oct 26, 2022

peci1 commented Oct 26, 2022

peci1 commented Oct 26, 2022

peci1 commented Oct 26, 2022

ayrton04 commented Oct 26, 2022

ayrton04 commented Nov 14, 2022

peci1 commented Nov 14, 2022

daniel-zawadzki-vay commented Oct 9, 2023

ayrton04 commented Oct 9, 2023

ayrton04 commented Oct 24, 2022 •

edited

Loading

ayrton04 commented Oct 26, 2022 •

edited

Loading