Lengthy Duration of xfs_repair Process

dwnogino · July 10, 2023, 6:44am

Hello everyone,

I am currently facing an issue with a server that contains 54TB of data configured in a RAID 5 setup, AlmaLinux 9. The server is equipped with a total of 14 disks. Recently, two of the disks encountered corruption and were subsequently replaced. Following the replacement and a server reboot, it indicated that a repair was required. Consequently, we initiated the xfs_repair command to address the issue.

However, the xfs_repair process has been running continuously for the past six days, with the same message being displayed every 15 minutes: “rebuild AG headers and trees - 55 of 55 allocation groups done.”

I think its taking too much time and maybe something need to be done. I would greatly appreciate any insights, suggestions, or thoughts.

MartinR · July 10, 2023, 7:53am

Sounds way too long. I’ve just run xfs_repair on a known faulty filesystem of about 1 TiB:

# time xfs_repair /dev/mapper/Tamar-Photos
Phase 1 - find and verify superblock...
        - reporting progress in intervals of 15 minutes
Phase 2 - using internal log
        - zero log...
totally zeroed log
        - scan filesystem freespace and inode maps...
        - 08:48:00: scanning filesystem freespace - 33 of 33 allocation groups done
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - 08:48:00: scanning agi unlinked lists - 33 of 33 allocation groups done
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 30
...
        - agno = 14
        - 08:48:01: process known inodes and inode discovery - 23296 of 23296 inodes done
        - process newly discovered inodes...
        - 08:48:01: process newly discovered inodes - 33 of 33 allocation groups done
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - 08:48:01: setting up duplicate extent list - 33 of 33 allocation groups done
        - check for inodes claiming duplicate blocks...
        - agno = 1
        - agno = 3
        - agno = 2
...
        - agno = 31
        - agno = 32
        - 08:48:01: check for inodes claiming duplicate blocks - 23296 of 23296 inodes done
Phase 5 - rebuild AG headers and trees...
        - 08:48:01: rebuild AG headers and trees - 33 of 33 allocation groups done
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
        - 08:48:01: verify and correct link counts - 33 of 33 allocation groups done
Maximum metadata LSN (1:139264) is ahead of log (0:0).
Format log to cycle 4.
xfs_repair: libxfs_device_zero write failed: Input/output error

real	0m34.308s
user	0m0.121s
sys	0m0.101s

Obviously you can’t simply scale from one to the other, but 34 seconds to 6 days seems a bit much for a 50-fold increase in filesystem size.

dwnogino · July 10, 2023, 8:20am

Thank you @MartinR
Do you think it would be OK to reboot and attempt the process again?

MartinR · July 10, 2023, 8:57am

I’m sorry, but without a great deal more information, particularly how the server is used, I can’t give a simple yes/no. As the saying goes: “if you break it, you get to keep the pieces”.

Having been unhelpful, IME it is safe to stop xfs_repair and restart it, indeed on the man page under “EXIT STATUS” for some errors it says “In this case, xfs_repair should be restarted”. Are there other filesystems/services being used on this machine? If the machine is otherwise unusable then I would be tempted to:

Ensure you are monitoring the system for any hardware problems, for instance tailing /var/log/messages through grep to watch for any hardware problems. If you are using an external hardware RAID controller, keep an eye on its logs.
Keep monitoring the system for IO activity: pcp, iotop etc.
If nothing is happening, stop the xfs_repair and restart it, watch the various monitors.
Consider running bechmarking on the replaced spindles.
Finally, reboot. XFS is a pretty stable filesystem with normally good recovery.

Oh, and you do have good backups don’t you?