We updated all of our Alma 8 systems from 8.8 to 8.9, to all the latest packages as of March 5. We have three systems that will “bind up” and become gradually unresponsive after 5-10min. Two of them are production servers and had to be rolled back. Another one is a test system, so we can force-reboot it and try to determine what is the problem.
What we found is that after about 5min., processes start getting stuck waiting on disk I/O.
The systems are all VMs, so failing HDD is ruled out. The are on different VMware hosts, so the host doesn’t seem to be the problem. The underlying storage for the three are on different underlying all-flash storage arrays.
All three affected servers run web applications, but two use nginx and another uses apache.
The first processes that get stuck on Disk I/O are: systemd, crond, syslog-ng, and sssd_be.
Has anyone else seen this behavior? Does anyone have advice?