Decryption errors start occurring after several days of uptime #15267

mattico · 2023-09-12T17:36:07Z

mattico
Sep 12, 2023

I have a pair of identical servers running Debian 11 with kernel 6.1.0-0.deb11.6-amd64 and OpenZFS 2.1.11-1~bpo11+1. They have an SSD mirror root pool (rpool) and a RAIDZ2 data pool (tank). I recently re-created both pools from snapshots to enable native encryption. Shortly after I noticed that one of the tank pools started reporting permanent errors:

  pool: rpool
 state: ONLINE
  scan: scrub repaired 0B in 00:06:09 with 0 errors on Mon Sep 11 00:12:51 2023
config:

        NAME                                                   STATE     READ WRITE CKSUM
        rpool                                                  ONLINE       0     0     0
          mirror-0                                             ONLINE       0     0     0
            nvme-Micron_7300_MTFDHBA800TDG_204XXXXXXXXD-part2  ONLINE       0     0     0
            nvme-Micron_7300_MTFDHBA800TDG_204XXXXXXXXB-part2  ONLINE       0     0     0

errors: No known data errors

  pool: tank
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 1 days 00:29:13 with 0 errors on Mon Sep 11 00:53:15 2023
config:

        NAME                                                   STATE     READ WRITE CKSUM
        tank                                                   ONLINE       0     0     0
          raidz2-0                                             ONLINE       0     0     0
            ata-WDC_WUH721414ALE6L4_Y6XXXXXX                   ONLINE       0     0     0
            ata-WDC_WUH721414ALE6L4_Y6XXXXXX                   ONLINE       0     0     0
            ata-WDC_WUH721414ALE6L4_Y6XXXXXX                   ONLINE       0     0     0
            ata-WDC_WUH721414ALE6L4_Y6XXXXXX                   ONLINE       0     0     0
            ata-WDC_WUH721414ALE6L4_Y6XXXXXX                   ONLINE       0     0     0
            ata-WDC_WUH721414ALE6L4_Y6XXXXXX                   ONLINE       0     0     0
        special
          mirror-2                                             ONLINE       0     0     0
            nvme-INTEL_SSDPE21D280GA_PHM2XXXXXXX7280AGN-part2  ONLINE       0     0     0
            nvme-INTEL_SSDPE21D280GA_PHM2XXXXXXXF280AGN-part2  ONLINE       0     0     0
        logs
          mirror-1                                             ONLINE       0     0     0
            nvme-INTEL_SSDPE21D280GA_PHM2XXXXXXX7280AGN-part1  ONLINE       0     0     0
            nvme-INTEL_SSDPE21D280GA_PHM2XXXXXXXF280AGN-part1  ONLINE       0     0     0
        spares
          ata-WDC_WUH721414ALE6L4_Y5XXXXXX                     AVAIL   

errors: Permanent errors have been detected in the following files:

        <0x2901>:<0x0>
        <0x9a02>:<0x1>
        <0x7d20>:<0x1>
        <0x8e57>:<0x1>
        <0x8160>:<0x1>
        tank/wordpress@syncoid_casa2_2023-09-07:01:15:08:<0x0>
        <0x8286>:<0x0>
        <0x9ca4>:<0x1>
        <0x89a7>:<0x1>
        <0x89aa>:<0x1>
        <0x69d8>:<0x1>
        <0x151e4>:<0x0>
        <0x13bf7>:<0x0>

From what I've seen, all of the errors have been in snapshots or <hex numbers> which I assume are deleted snapshots. (Deleting the snapshots turns them into hex numbers in the list) I've never seen any errors reported in any of the pool devices, nor in any files. I've never seen a ZFS scrub report any errors. I think the errors I have seen have been in just two specific datasets: tank/wordpress and tank/caddy.

I was able to get all the errors to disappear by doing the following:

Disable syncoid (which would have sent/recv'd from the server with the errors to the other one)
Delete any snapshots with permanent errors.
Run zpool clear tank (I'm not sure if this is necessary)
Run zpool scrub tank and wait for it to complete.
Run zpool scrub tank and wait for it to complete.

(It's supposed to take two scrubs for reported errors to get cleared from what I understand)

I thought that had fixed the issue but after re-enabling syncoid, the errors started re-appearing. When syncoid attempts to send/recv some snapshots (in tank/wordpress or tank/caddy) it gets an I/O error.

Update:

I can confirm that attempting to zfs send the snapshots (with syncoid) is what causes the errors to be detected and appear in zpool status. At the moment the tank/caddy dataset is fine but tank/wordpress and tank/nextcloud/data are failing. I also get a new error: cannot receive incremental stream: kernel modules must be upgraded to receive this stream. which is weird because I confirmed that both kernel modules are the same version. Perhaps it's just data corruption causing that.

Update:

I did a few cycles of running syncoid and deleting those snapshots which had errors and I got syncoid to complete successfully. So for the moment every dataset is able to create and send snapshots without error.

Update:

The errors started appearing again shortly after.

Are these errors real? Why doesn't scrub find them?

What more can I do to continue diagnosing this? How can I fix this?

rincebrain · 2023-09-12T18:40:10Z

rincebrain
Sep 12, 2023
Collaborator

Usually if you don't see errors from scrub but do see them at runtime, it means they're from decrypting things, since scrub doesn't try decrypting things or else it wouldn't be able to scrub without unlocking things.

There were, historically, a couple of cases of things that could cause decryption errors spuriously, but I think those should all be fixed by 2.1.11.

There's counters in like, /proc/spl/kstat/kcf/ for how many times decryption failed:

$ grep . /proc/spl/kstat/kcf/*
/proc/spl/kstat/kcf/framework_stats:1 1 0x01 9 2448 6371351793 7393675988408194
/proc/spl/kstat/kcf/framework_stats:name                            type data
/proc/spl/kstat/kcf/framework_stats:total threads in pool           2    0
/proc/spl/kstat/kcf/framework_stats:idle threads in pool            2    0
/proc/spl/kstat/kcf/framework_stats:min threads in pool             2    1
/proc/spl/kstat/kcf/framework_stats:max threads in pool             2    2
/proc/spl/kstat/kcf/framework_stats:requests in gswq                2    0
/proc/spl/kstat/kcf/framework_stats:max requests in gswq            2    4194304
/proc/spl/kstat/kcf/framework_stats:threads for HW taskq            2    8
/proc/spl/kstat/kcf/framework_stats:minalloc for HW taskq           2    64
/proc/spl/kstat/kcf/framework_stats:maxalloc for HW taskq           2    2097152
/proc/spl/kstat/kcf/NONAME_provider_stats:2 1 0x01 4 1088 6371364690 7393675988443264
/proc/spl/kstat/kcf/NONAME_provider_stats:name                            type data
/proc/spl/kstat/kcf/NONAME_provider_stats:kcf_ops_total                   4    0
/proc/spl/kstat/kcf/NONAME_provider_stats:kcf_ops_passed                  4    0
/proc/spl/kstat/kcf/NONAME_provider_stats:kcf_ops_failed                  4    0
/proc/spl/kstat/kcf/NONAME_provider_stats:kcf_ops_returned_busy           4    0

so you can see that that's what's going on.

As far as getting errors with send/recv, if you try using -e from an unencrypted dataset to an encrypted receive, that'll fail, since embedded_data records can't be stored on native encryption datasets.

1 reply

mattico Sep 12, 2023
Author

Thanks! It looks like there were probably some decryption errors:

/proc/spl/kstat/kcf/NONAME_provider_stats:2 1 0x01 4 1088 3791837579 876533869008250
/proc/spl/kstat/kcf/NONAME_provider_stats:name                            type data
/proc/spl/kstat/kcf/NONAME_provider_stats:kcf_ops_total                   4    66930706
/proc/spl/kstat/kcf/NONAME_provider_stats:kcf_ops_passed                  4    66930664
/proc/spl/kstat/kcf/NONAME_provider_stats:kcf_ops_failed                  4    42
/proc/spl/kstat/kcf/NONAME_provider_stats:kcf_ops_returned_busy           4    0

If those are non-spurious decryption errors, what could be the cause of that? Some hardware/software bug introducing errors between encryption and checksum calculation? Hardware/software bug in the decryption code?

ofthesun9 · 2023-09-25T05:05:14Z

ofthesun9
Sep 25, 2023

There were, historically, a couple of cases of things that could cause decryption errors spuriously, but I think those should all be fixed by 2.1.11.

Hello,
Unfortunately, I still have those errors with zfs 2.1.12 (when the uptime exceeds 10-15 days , which I avoid by restarting the server every week)

sudo zpool status -v
  pool: Pool1
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 12:32:50 with 0 errors on Sun Sep 10 12:56:52 2023
config:

        NAME                        STATE     READ WRITE CKSUM
        Pool1                       ONLINE       0     0     0
          mirror-0                  ONLINE       0     0     0
            wwn-0x50014ee2b70dd941  ONLINE       0     0     0
            wwn-0x50014ee6aeb29750  ONLINE       0     0     0
          mirror-1                  ONLINE       0     0     0
            wwn-0x50014ee214067526  ONLINE       0     0     0
            wwn-0x50014ee20f9725de  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <0x842>:<0x0>
        <0x3c3>:<0x0>

grep . /proc/spl/kstat/kcf/*
/proc/spl/kstat/kcf/NONAME_provider_stats:2 1 0x01 4 1088 6630030281 1442086408917154
/proc/spl/kstat/kcf/NONAME_provider_stats:name                            type data
/proc/spl/kstat/kcf/NONAME_provider_stats:kcf_ops_total                   4    2357242
/proc/spl/kstat/kcf/NONAME_provider_stats:kcf_ops_passed                  4    2357236
/proc/spl/kstat/kcf/NONAME_provider_stats:kcf_ops_failed                  4    6
/proc/spl/kstat/kcf/NONAME_provider_stats:kcf_ops_returned_busy           4    0
/proc/spl/kstat/kcf/framework_stats:1 1 0x01 9 2448 6630010241 1442086409260761
/proc/spl/kstat/kcf/framework_stats:name                            type data
/proc/spl/kstat/kcf/framework_stats:total threads in pool           2    0
/proc/spl/kstat/kcf/framework_stats:idle threads in pool            2    0
/proc/spl/kstat/kcf/framework_stats:min threads in pool             2    1
/proc/spl/kstat/kcf/framework_stats:max threads in pool             2    2
/proc/spl/kstat/kcf/framework_stats:requests in gswq                2    0
/proc/spl/kstat/kcf/framework_stats:max requests in gswq            2    4194304
/proc/spl/kstat/kcf/framework_stats:threads for HW taskq            2    8
/proc/spl/kstat/kcf/framework_stats:minalloc for HW taskq           2    64
/proc/spl/kstat/kcf/framework_stats:maxalloc for HW taskq           2    2097152

3 replies

ofthesun9 Sep 25, 2023

The issue reveals itself when syncoid kicks in (and when the uptime is greater then approx 10days in my use case)

Hereafter what I found in the log:

journalctl -b -g "eid="
Sep 08 14:21:44 styx zed[1318]: Processing events since eid=0
Sep 08 14:21:44 styx zed[1364]: eid=2 class=config_sync pool='Pool1'
Sep 08 14:21:44 styx zed[1365]: eid=5 class=config_sync pool='Pool1'
Sep 08 14:21:44 styx zed[1366]: eid=3 class=pool_import pool='Pool1'
Sep 10 00:24:02 styx zed[253628]: eid=5213 class=scrub_start pool='Pool1'
Sep 10 12:56:52 styx zed[1476620]: eid=7063 class=scrub_finish pool='Pool1'
Sep 25 00:09:03 styx zed[4140819]: eid=57704 class=data pool='Pool1' priority=2 err=5 flags=0x180 bo>
Sep 25 00:09:03 styx zed[4140828]: eid=57705 class=data pool='Pool1' priority=2 err=5 flags=0x180 bo>
Sep 25 00:09:03 styx zed[4140837]: eid=57707 class=data pool='Pool1' priority=2 err=5 flags=0x80 boo>
Sep 25 00:09:03 styx zed[4140838]: eid=57706 class=authentication pool='Pool1' priority=2 err=5 flag>
Sep 25 03:09:03 styx zed[4161624]: eid=58131 class=data pool='Pool1' priority=2 err=5 flags=0x180 bo>
Sep 25 03:09:03 styx zed[4161641]: eid=58132 class=data pool='Pool1' priority=2 err=5 flags=0x180 bo>
Sep 25 03:09:03 styx zed[4161655]: eid=58134 class=data pool='Pool1' priority=2 err=5 flags=0x80 boo>
Sep 25 03:09:03 styx zed[4161660]: eid=58133 class=authentication pool='Pool1' priority=2 err=5 flag>
Sep 25 12:09:49 styx zed[39221]: eid=59461 class=authentication pool='Pool1' bookmark=5285:0:0:0
Sep 25 13:06:03 styx zed[45742]: eid=59561 class=authentication pool='Pool1' bookmark=5285:0:0:0
Sep 25 13:10:01 styx zed[46458]: eid=59609 class=authentication pool='Pool1' bookmark=5285:0:0:0
Sep 25 13:10:01 styx zed[46466]: eid=59610 class=authentication pool='Pool1' bookmark=5383:0:0:0

mattico Oct 2, 2023
Author

I think I can confirm that the errors are uptime-related. After 6 days uptime I only have one error reported. I'll see if the error frequency increases.

mattico Oct 3, 2023
Author

At 7 days 20 hours, I'm up to 7 errors listed. I'm going to write a script which tracks the appearance of errors over time.

mattico · 2023-09-25T17:05:26Z

mattico
Sep 25, 2023
Author

I'm trying to find where in the code the /proc/spl/kstat/kcf/NONAME_provider_stats:kcf_ops_failed counter gets incremented, but not having any luck.

1 reply

rincebrain Oct 2, 2023
Collaborator

You can't find it because it was deleted for being "useless" in #12901, so I guess maybe it'll have a different name in 2.2, or just be gone entirely.

Who can say.

e: Having just checked, no, it's just deleted.

Good thing nobody ever tries to debug encryption bugs, or else that might be a shame.

mattico · 2023-10-03T15:29:26Z

mattico
Oct 3, 2023
Author

Here's the script I'm using to log when zpool errors appear:

https://gist.github.com/mattico/d89172579cd69a4d8b8077c2e4fe8c17

I suppose it could be useful to also log /proc/spl/kstat/kcf/NONAME_provider_stats:kcf_ops_failed

7 replies

rincebrain Oct 9, 2023
Collaborator

If it's a decryption error, it won't be from scrub - scrub doesn't try decrypting the blocks, ever, because then you couldn't scrub if you didn't have the key loaded.

It might be caused by scrub running at the same time, somehow, but scrubbing itself should never try decrypting anything.

rincebrain Oct 9, 2023
Collaborator

My suspicion , thinking about this a bit, would be if hypothetically, there were a case with decryption where reads timing out was incorrectly treated as a decryption error.

I don't know at the moment of any such bug existing, but that would be what I would say the shape of this sounds like.

ofthesun9 Oct 17, 2023

Since my last server restart, I changed two things:

changed from zfs 2.1.12 to 2.1.13
reduced the number of syncoid tasks planned every hour. Previously, I had 1 service unit file executing 3 syncoid tasks in a row. Now I have 3 different unit files to execute only 1 syncoid task each. The first unit is scheduled at 1PM every 3 hours. The second unit is scheduled at 2PM every 3 hours.

My uptime is 18 days, which is the best uptime since I started to use encryption dataset.

Might be just a coincidence though....

mattico Oct 18, 2023
Author

@ofthesun9 Interesting datapoint. It's not easy for me to update my ZFS version because I'm using Debian Bullseye which doesn't have updated packages yet.

I also use a syncoid service (on a 15 minute timer) which executes multiple syncoid processes one after the other. I'll try increasing the timer period and inserting some sleeps between the syncoid calls. It does seem that the errors are caused by (or noticed by) the syncoid process running, but I can't imagine why that would be. Maybe there's a kernel rate-limit on creating new KCF contexts? I can't find anything like that but I guess many things can act as unintentional rate limits. I'm not sure why zfs send would be any different than any other file read, but I guess it could end up reading all of the dataset data when most processes do not.

rincebrain Oct 18, 2023
Collaborator

I would assume it's not a direct rate limit but something like an allocation failure being treated as a decryption failure or some similar outcome, I'm not really sure.

zfs send isn't really like a file read, it's walking the metadata and finding all the metadata and data things that changed since the last incremental (or at all, for a non-incremental), and spitting them out.

mattico · 2023-10-24T17:54:48Z

mattico
Oct 24, 2023
Author

Okay, this time I'm running kernel 6.1.38-4~bpo11+1 and have a 60 second sleep between each syncoid dataset sync, still on a 15 minute timer. The errors started appearing at 6 days, 2:55:27.740000 uptime, a few hours after the scrub finished, so I think we can conclude that scrubs are unrelated. Going to try longer delays between syncoid runs just to rule that out, but I'm pretty sure updating zfs to 2.1.13 is more likely to fix it.

0 replies

mattico · 2023-11-14T18:06:54Z

mattico
Nov 14, 2023
Author

Another odd symptom: my sanoid service has been stuck trying to take a snapshot for 1 day 11h:

/sbin/zfs snapshot tank/nextcloud/data@autosnap_2023-11-13_06:30:04_frequently

4 replies

mattico Nov 14, 2023
Author

Huh, here's something new in the dmesg. Looks like a ZFS kernel task timed out and panic'd?

[Mon Nov 13 00:18:44 2023] Showing stack for process 3729
[Mon Nov 13 00:18:44 2023] CPU: 62 PID: 3729 Comm: z_rd_int_2 Tainted: P           OE      6.1.0-0.deb11.11-amd64 #1  Debian 6.1.38-4~bpo11+1
[Mon Nov 13 00:18:44 2023] Hardware name: Supermicro AS -2014CS-TR/H12SSW-AN6, BIOS 2.6 04/18/2023
[Mon Nov 13 00:18:44 2023] Call Trace:
[Mon Nov 13 00:18:44 2023]  <TASK>
[Mon Nov 13 00:18:44 2023]  dump_stack_lvl+0x45/0x5e
[Mon Nov 13 00:18:44 2023]  spl_panic+0xd1/0xe9 [spl]
[Mon Nov 13 00:18:44 2023]  ? __x86_return_thunk+0x5/0x6
[Mon Nov 13 00:18:44 2023]  ? __x86_return_thunk+0x5/0x6
[Mon Nov 13 00:18:44 2023]  ? __wake_up_common_lock+0x91/0xd0
[Mon Nov 13 00:18:44 2023]  ? __x86_return_thunk+0x5/0x6
[Mon Nov 13 00:18:44 2023]  ? preempt_count_add+0x70/0xa0
[Mon Nov 13 00:18:44 2023]  ? __x86_return_thunk+0x5/0x6
[Mon Nov 13 00:18:44 2023]  ? __x86_return_thunk+0x5/0x6
[Mon Nov 13 00:18:44 2023]  ? __x86_return_thunk+0x5/0x6
[Mon Nov 13 00:18:44 2023]  ? zfs_zevent_post+0x259/0x280 [zfs]
[Mon Nov 13 00:18:44 2023]  arc_buf_destroy+0xed/0xf0 [zfs]
[Mon Nov 13 00:18:44 2023]  arc_read_done+0x24e/0x4b0 [zfs]
[Mon Nov 13 00:18:44 2023]  zio_done+0x402/0x1140 [zfs]
[Mon Nov 13 00:18:44 2023]  ? __x86_return_thunk+0x5/0x6
[Mon Nov 13 00:18:44 2023]  zio_execute+0x81/0x120 [zfs]
[Mon Nov 13 00:18:44 2023]  ? __x86_return_thunk+0x5/0x6
[Mon Nov 13 00:18:44 2023]  taskq_thread+0x2cf/0x500 [spl]
[Mon Nov 13 00:18:44 2023]  ? wake_up_q+0x90/0x90
[Mon Nov 13 00:18:44 2023]  ? zio_gang_tree_free+0x70/0x70 [zfs]
[Mon Nov 13 00:18:44 2023]  ? taskq_thread_spawn+0x60/0x60 [spl]
[Mon Nov 13 00:18:44 2023]  kthread+0xe7/0x110
[Mon Nov 13 00:18:44 2023]  ? kthread_complete_and_exit+0x20/0x20
[Mon Nov 13 00:18:44 2023]  ret_from_fork+0x22/0x30
[Mon Nov 13 00:18:44 2023]  </TASK>
[Mon Nov 13 00:22:24 2023] INFO: task z_rd_int_2:3729 blocked for more than 120 seconds.
[Mon Nov 13 00:22:24 2023]       Tainted: P           OE      6.1.0-0.deb11.11-amd64 #1 Debian 6.1.38-4~bpo11+1
[Mon Nov 13 00:22:24 2023] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Mon Nov 13 00:22:24 2023] INFO: task dp_sync_taskq:3818 blocked for more than 120 seconds.
[Mon Nov 13 00:22:24 2023]       Tainted: P           OE      6.1.0-0.deb11.11-amd64 #1 Debian 6.1.38-4~bpo11+1
[Mon Nov 13 00:22:24 2023] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Mon Nov 13 00:22:24 2023] INFO: task txg_sync:4093 blocked for more than 120 seconds.
[Mon Nov 13 00:22:24 2023]       Tainted: P           OE      6.1.0-0.deb11.11-amd64 #1 Debian 6.1.38-4~bpo11+1
[Mon Nov 13 00:22:24 2023] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Mon Nov 13 00:22:24 2023] INFO: task vdev_autotrim:4180 blocked for more than 120 seconds.
[Mon Nov 13 00:22:24 2023]       Tainted: P           OE      6.1.0-0.deb11.11-amd64 #1 Debian 6.1.38-4~bpo11+1
[Mon Nov 13 00:22:24 2023] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Mon Nov 13 00:24:24 2023] INFO: task z_rd_int_2:3729 blocked for more than 241 seconds.
[Mon Nov 13 00:24:24 2023]       Tainted: P           OE      6.1.0-0.deb11.11-amd64 #1 Debian 6.1.38-4~bpo11+1
[Mon Nov 13 00:24:24 2023] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Mon Nov 13 00:24:24 2023] INFO: task dp_sync_taskq:3818 blocked for more than 241 seconds.
[Mon Nov 13 00:24:24 2023]       Tainted: P           OE      6.1.0-0.deb11.11-amd64 #1 Debian 6.1.38-4~bpo11+1
[Mon Nov 13 00:24:24 2023] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Mon Nov 13 00:24:24 2023] INFO: task vdev_autotrim:4180 blocked for more than 241 seconds.
[Mon Nov 13 00:24:24 2023]       Tainted: P           OE      6.1.0-0.deb11.11-amd64 #1 Debian 6.1.38-4~bpo11+1
[Mon Nov 13 00:24:24 2023] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Mon Nov 13 00:26:25 2023] INFO: task z_rd_int_2:3729 blocked for more than 362 seconds.
[Mon Nov 13 00:26:25 2023]       Tainted: P           OE      6.1.0-0.deb11.11-amd64 #1 Debian 6.1.38-4~bpo11+1
[Mon Nov 13 00:26:25 2023] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Mon Nov 13 00:26:25 2023] INFO: task dp_sync_taskq:3818 blocked for more than 362 seconds.
[Mon Nov 13 00:26:25 2023]       Tainted: P           OE      6.1.0-0.deb11.11-amd64 #1 Debian 6.1.38-4~bpo11+1
[Mon Nov 13 00:26:25 2023] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Mon Nov 13 00:26:25 2023] INFO: task vdev_autotrim:4180 blocked for more than 362 seconds.
[Mon Nov 13 00:26:25 2023]       Tainted: P           OE      6.1.0-0.deb11.11-amd64 #1 Debian 6.1.38-4~bpo11+1
[Mon Nov 13 00:26:25 2023] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

In the console it additionally says

VERIFY3(0 == remove_reference(hdr, NULL, tag)) failed (0 == 1)
PANIC at arc.c:3847:arc_buf_destroy()

That must be what's causing the hang.

rincebrain Nov 14, 2023
Collaborator

#6881

Good thing native encryption gets bug fixes so reliably and fast, huh.

mattico Nov 14, 2023
Author

Another new development: errors are being detected at offsets other than 0x0 and 0x1. 0x20 is pretty popular, and there are some others:

    <0x7801>:<0x0>
    <0x7801>:<0x1>
    <0x7801>:<0x20>
    <0x7801>:<0x21>
    <0x7801>:<0x139230>
    <0x7801>:<0x19b942>
    <0x7801>:<0x5ce44>
    <0xd927>:<0x0>
    <0x2a2e>:<0x0>
    <0x2a2e>:<0x20>

rincebrain Nov 14, 2023
Collaborator

For datasets, those aren't offsets, they're object IDs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decryption errors start occurring after several days of uptime #15267

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 16 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Decryption errors start occurring after several days of uptime #15267

mattico Sep 12, 2023

Replies: 6 comments · 16 replies

rincebrain Sep 12, 2023 Collaborator

mattico Sep 12, 2023 Author

ofthesun9 Sep 25, 2023

ofthesun9 Sep 25, 2023

mattico Oct 2, 2023 Author

mattico Oct 3, 2023 Author

mattico Sep 25, 2023 Author

rincebrain Oct 2, 2023 Collaborator

mattico Oct 3, 2023 Author

rincebrain Oct 9, 2023 Collaborator

rincebrain Oct 9, 2023 Collaborator

ofthesun9 Oct 17, 2023

mattico Oct 18, 2023 Author

rincebrain Oct 18, 2023 Collaborator

mattico Oct 24, 2023 Author

mattico Nov 14, 2023 Author

mattico Nov 14, 2023 Author

rincebrain Nov 14, 2023 Collaborator

mattico Nov 14, 2023 Author

rincebrain Nov 14, 2023 Collaborator

mattico
Sep 12, 2023

Replies: 6 comments 16 replies

rincebrain
Sep 12, 2023
Collaborator

mattico Sep 12, 2023
Author

ofthesun9
Sep 25, 2023

mattico Oct 2, 2023
Author

mattico Oct 3, 2023
Author

mattico
Sep 25, 2023
Author

rincebrain Oct 2, 2023
Collaborator

mattico
Oct 3, 2023
Author

rincebrain Oct 9, 2023
Collaborator

rincebrain Oct 9, 2023
Collaborator

mattico Oct 18, 2023
Author

rincebrain Oct 18, 2023
Collaborator

mattico
Oct 24, 2023
Author

mattico
Nov 14, 2023
Author

mattico Nov 14, 2023
Author

rincebrain Nov 14, 2023
Collaborator

mattico Nov 14, 2023
Author

rincebrain Nov 14, 2023
Collaborator