Alma9 and DNF/YUM update issues

natebrownrice · October 10, 2023, 5:50pm

Howdy all! I’ll try to keep this concise. First, thanks for the project, we’ve been using Alma9 for the last 6 months or so and it has been mostly great.

Environment: Openvz7 VPS + Alma9 template.

Issue: After a yum/dnf update, many systemd services fail. Rebooting the container fixes all issues, and it works again until without issue the next update.

Some examples:

MariaDB is crashed, required a pkill -f and systemctl restart of the services to bring it back online.

Apache is crashed, same as above.

sshd is crashed, same as above, but even after restarting the service and getting it running, you cannot connect - it just hangs after accepting the password.

Other services, php-fpm, saslauthd, polkit, have also been found to be crashed, and similar solutions seem to get these going.

The only solution we’ve found is to restart the container, all services come back up, everything works as expected.

The only relevant error in the output of dnf/yum update is:

2023-10-10T11:11:51-0600 SUBDEBUG 
Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/dnf/cli/main.py", line 176, in resolving
    base.do_transaction(display=displays)
  File "/usr/lib/python3.9/site-packages/dnf/cli/cli.py", line 246, in do_transaction
    tid = super(BaseCli, self).do_transaction(display)
  File "/usr/lib/python3.9/site-packages/dnf/base.py", line 1034, in do_transaction
    tid = self._run_transaction(cb=cb)
  File "/usr/lib/python3.9/site-packages/dnf/lock.py", line 147, in __exit__
    os.unlink(self.target)
FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/dnf/rpmdb_lock.pid'
2023-10-10T11:11:51-0600 CRITICAL [Errno 2] No such file or directory: '/var/lib/dnf/rpmdb_lock.pid'

I’m not sure this is related to the issue, since yum updates finish just fine after that output. This feels more systemd related.

We’ve tried quite a few things, including extensive systemd debugging, including:

systemctl daemon-reload
systemctl daemon-reexec

As well as manually killing/restarting the failed services, but the containers just don’t work quite right (particularly ssh) until it’s actually rebooted, then all is well again, indefinitely, until the next round of updates. It seems like any service that gets updated fails to start afterwards. We noticed in the system journal output that a lot of these services fail to start because of lock files.

We ran Centos7 for years prior to this, and I can’t think of a single time that we ran into this kind of issue. So far I was hoping it was a fluke, but every single round of yum updates has required container restarts to get everything working again.

Any input here is greatly appreciated. For now we’ll just keep rebooting containers post updates.

jordan · May 29, 2024, 7:51pm

Same issue here. I’ve reported it as a bug to OpenVZ devs as well…

Are you using a control panel?

My theory is that systemctl daemon-reload occurring prior to system updates is causing services to fail to reconnect to the reloaded systemd daemon, which results in lost messages being passed from the processes to systemd. This also results in a build up of transient files in /run/systemd/transients.

Clearing out the transients helps delay the problem from occurring, but it doesn’t solve it.

jordan · May 30, 2024, 12:11am

Looks like this is the issue and a patch is available: systemctl daemon-reexec forgets running services and tries to restart all services · Issue #28184 · systemd/systemd · GitHub

They’ve also identified the specifics of what the Virtuozzo packages are sending to systemd that’s causing the issue, so a workaround could be provided by Virtuozzo devs rather than wait for RedHat to incorporate the patch into systemd: core: reorder systemd arguments on reexec · systemd/systemd@06afda6 · GitHub

Also: I went through all the changelogs for systemd redhat builds since July 2023 (when the patch was written) and can confirm this patch does not appear to have been backported, which is consistent with our findings.

jordan · May 30, 2024, 7:20pm

I wonder if it’s possible to get a patch, like the above-linked one, included in an almalinux specific build for systemd, or if packages must come strictly from RHEL upstream and so we have to wait for RedHat devs to backport the patch before it hits an almalinux repo.