Following the upgrade to 4.18.0-553.81.1.el8_10.x86_64 we see large differences between the used memory as reported by free and the sum of RSS as reported by ps.
Over a relatively short time (less than 48 hours), this difference can grow to become as large as 80GB, and ultimately results in OOMs on a system which otherwise (e.g. on 4.18.0-553.80.1.el8_10.x86_64) has plenty of spare RAM.
For example:
[root@better ~]# uname -r
4.18.0-553.80.1.el8_10.x86_64
[root@better ~]# ps aux | awk '{sum +=$6}END{print sum/1024/1024}'
99.8527
[root@better ~]# free -h
total used free shared buff/cache available
Mem: 178Gi 101Gi 12Gi 190Mi 63Gi 75Gi
Swap: 4.0Gi 38Mi 4.0Gi
Compared to:
[root@better ~]# uname -r
4.18.0-553.81.1.el8_10.x86_64
[root@better ~]# ps aux | awk '{sum +=$6}END{print sum/1024/1024}'
86.5422
[root@better ~]# free -h
total used free shared buff/cache available
Mem: 178Gi 172Gi 4.4Gi 210Mi 1.7Gi 4.4Gi
Swap: 4.0Gi 3.2Gi 810Mi
The same was observed on 4.18.0-553.89.1.el8_10.x86_64.
Reverting back to 4.18.0-553.80.1.el8_10.x86_64 is a short term workaround, but any suggestions how to investigate the underlying cause further, and which upstream project needs to hear about it if/when some details are discovered?
Looking at the changelog for 4.18.0-553.81.1.el8_10.x86_64 I suppose that one or more of the changes introduced by RHEL-104909 is likely responsible, but unfortunately that doesn’t appear to be public.
Which is the appropriate upstream for AlmaLinux 8 issues now, since CentOS Stream 8 is EOL?