Howdy all! I’ll try to keep this concise. First, thanks for the project, we’ve been using Alma9 for the last 6 months or so and it has been mostly great.
Environment: Openvz7 VPS + Alma9 template.
Issue: After a yum/dnf update, many systemd services fail. Rebooting the container fixes all issues, and it works again until without issue the next update.
Some examples:
MariaDB is crashed, required a pkill -f and systemctl restart of the services to bring it back online.
Apache is crashed, same as above.
sshd is crashed, same as above, but even after restarting the service and getting it running, you cannot connect - it just hangs after accepting the password.
Other services, php-fpm, saslauthd, polkit, have also been found to be crashed, and similar solutions seem to get these going.
The only solution we’ve found is to restart the container, all services come back up, everything works as expected.
The only relevant error in the output of dnf/yum update is:
2023-10-10T11:11:51-0600 SUBDEBUG
Traceback (most recent call last):
File "/usr/lib/python3.9/site-packages/dnf/cli/main.py", line 176, in resolving
base.do_transaction(display=displays)
File "/usr/lib/python3.9/site-packages/dnf/cli/cli.py", line 246, in do_transaction
tid = super(BaseCli, self).do_transaction(display)
File "/usr/lib/python3.9/site-packages/dnf/base.py", line 1034, in do_transaction
tid = self._run_transaction(cb=cb)
File "/usr/lib/python3.9/site-packages/dnf/lock.py", line 147, in __exit__
os.unlink(self.target)
FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/dnf/rpmdb_lock.pid'
2023-10-10T11:11:51-0600 CRITICAL [Errno 2] No such file or directory: '/var/lib/dnf/rpmdb_lock.pid'
I’m not sure this is related to the issue, since yum updates finish just fine after that output. This feels more systemd related.
We’ve tried quite a few things, including extensive systemd debugging, including:
systemctl daemon-reload
systemctl daemon-reexec
As well as manually killing/restarting the failed services, but the containers just don’t work quite right (particularly ssh) until it’s actually rebooted, then all is well again, indefinitely, until the next round of updates. It seems like any service that gets updated fails to start afterwards. We noticed in the system journal output that a lot of these services fail to start because of lock files.
We ran Centos7 for years prior to this, and I can’t think of a single time that we ran into this kind of issue. So far I was hoping it was a fluke, but every single round of yum updates has required container restarts to get everything working again.
Any input here is greatly appreciated. For now we’ll just keep rebooting containers post updates.