Poorly-coded programs with infinite loops killing AL 8.8 machines

Hi. In an academic computer science lab running AL 8.8, if a CS student writes and runs a program with an infinite loop, the Dell machine becomes unusable. We have to hard reset the power to get the machine back up.

Previous lab iteration was CentOS 7 on Lenovo, which never exhibited this problem.

Should I try to solve this with ulimit? cgroups? Or is this somehow triggering a known bug in 8.8 that would be fixed by an upgrade?

Thanks!

Have you tried either SSHing in or else using CTRL/ALT/Fn (1<n<5)? It may be slow, but you can then kill the errant student process.

Any infinite loop or does it depend on what it’s doing in the loop ?

I just ran
while true ; do echo "test" ; done
from the command line in AL 8.9 and no problems. ( Dell Precision )

I have had issues since I made 8.x my main OS about 2 yrs ago ;

If there’s significant prolonged disk access e.g. copying large files ,the machine will become unusable and ANY typing or mouse activity becomes near impossible, also issues with USB where again i/o has problems. I can have a near unusable machine with 60% of resources available at times.

I’ve learned not to do things that “upset” it so I’ll plug a phone in on a particular USB port where its on a bus that doesn’t cause the problem etc.

On 8.9 now and will be stopping on 8.10 until EOL (2030 ?) . Someone else can bleed on the cutting edge ! :slight_smile:

I was on Windows before so I can live with the issues and happily - life is much better - at least when I get around to fixing something it stays fixed - RHEL don’t circumvent my changes in the next update like Redmond do. And I’ve had less down time / issues than I was having with Windows.

1 Like

That and/or nice. One thing you can do is run the scripts in podman and use cpu limits to limit max CPU time, etc. This should cause the Linux CFS to keep it honest based on what you config the limits there, similar to how the request/limit works in Kubernetes.

Otherwise, you can do it directly with cgroupsv2, but slightly more complicated.

Attempts to interact with the machine, locally or via ssh, both fail.

Here is the python code that is bringing the machines down. It’s supposed to produce a list containing the first n Fibonacci numbers:

def fibonacci(n) :
    fibs = [0,1]
    z=0
    for z in fibs:
        z=fibs[-1]+fibs[-2]
        fibs.append(z)
    return fibs

These are beginning computer science students, so I can’t require them to do anything extra before they run their code, like use nice or something. I need to solve it at the system level without them knowing they are being limited.

Bit old-fashioned here, but it may not be a CPU issue, rather a memory issue. Sure these days everyone pretends that memory is infinite, but you are appending an element each pass, and so array fibs will grow without limit. OOM situations can be impossible to break into, even with full root privileges. I used to see such a lockup on diskless compute nodes when there was a memory leak, which is what you have.

1 Like

You are right, Martin! Not CPU, but memory. Does that open up other avenues to consider? I guess what I would have expected in such a case is for the OOM killer to jump in and at least kill something, but that didn’t happen on any of the machines.

ChatGPT is recommending the following sysctl changes, due to the fact that I have 32GB of RAM with 32GB of swap:

  • vm.overcommit_memory: Changing this to 2 (from the default, 0) will prevent overcommitting of memory, making the kernel stricter about memory allocation requests.
  • vm.oom_kill_allocating_task: Changing this to 1 (from the default, 0) will make the OOM killer target the task that triggered the out-of-memory condition, potentially making it more responsive in situations where a single task rapidly consumes all available memory.

CGroups should be able to limit the amount of memory (and memory+swap) that a session can allocate. See https://www.redhat.com/en/blog/world-domination-cgroups-rhel-8-welcome-cgroups-v2

Since you will do that for all users, they don’t have to do a thing.

Action grinds to halt when system starts to allocate memory from swap. It then takes a long time to reach the “all RAM and swap in use”, at which point the OOM killer finally activates.

1 Like

As a python programmer, there are a few things they could do to make this more efficient. If they just need the output, then they can use yield and only update the variables they need to compute the next one rather than constraintly iterating over the same ever growing array.

Since it needs solved at the system level, cgroups is the way to go.

Are you going to tell them what you had to do afterwards or release them into the wild like that ?

(just curious)

lol, I am not the prof, just the sysadmin. I think the prof will educate them about the matter.

1 Like

Hah, understood! This kind of stuff is part of why we moved python stuff to JupyterHub with request/limits for cpu and memory in the helm chart in my dayjob. If you want to go that route eventually, k3s runs on AlmaLinux if you want a simple way to get to Kubernetes. It can even run natively on one node.