NFS Server having "D"-State processes

mcm · July 11, 2025, 7:41am

Hello,
we have a few Alma9.5 NFS Server providing a share for other VMs and OpenShift Worker Nodes. We never experienced any issues when providing the share to other VMs.
We already have an issue open on the RedHat site.

We noticed an issue on the NFS Server side, which leaves a lot of nfsd-processes in “D” state.

Below you can see the nfsd processes, how they looked when we first experienced an issue.

root          12  0.0  0.0      0     0 ?        I    Feb13   0:31 [kworker/u16:1-nfsd4_callbacks]
root        1391  0.0  0.0   5456  1536 ?        Ss   Feb13   0:00 /usr/sbin/nfsdcld
root        1418  0.0  0.0      0     0 ?        I    Feb13   0:40 [nfsd]
root        1419  0.0  0.0      0     0 ?        I    Feb13   1:56 [nfsd]
root        1420  0.0  0.0      0     0 ?        I    Feb13  13:47 [nfsd]
root        1421  0.0  0.0      0     0 ?        I    Feb13  74:22 [nfsd]
root        1422  0.0  0.0      0     0 ?        D    Feb13   9:17 [nfsd]
root        1423  0.0  0.0      0     0 ?        D    Feb13  39:39 [nfsd]
root        1424  0.1  0.0      0     0 ?        D    Feb13 348:50 [nfsd]
root        1425  0.5  0.0      0     0 ?        D    Feb13 1183:44 [nfsd]
root     2363338  0.0  0.0   6408  2432 pts/0    S+   14:27   0:00 grep --color=auto nfsd
root     4062210  0.0  0.0      0     0 ?        I    Jun01   0:00 [kworker/u16:2-nfsd4_callbacks]

When the hang occurs we are seeing this hanging process on the client side:

Since the hang, the manager occupies one CPU:

  root      735317 98.8  0.0      0     0 ?        R    Jul08 1300:43 [10.4.16.22-mana]


[<0>] kmem_cache_alloc+0x24e/0x300
[<0>] rpc_new_task+0x62/0xb0 [sunrpc]
[<0>] rpc_run_task+0x48/0x1c0 [sunrpc] // also: rpc_run_task+0x1b/0x1c0 [sunrpc], rpc_run_task+0x150/0x1c0 [sunrpc]
[<0>] nfs4_call_sync_sequence+0x76/0xb0 [nfsv4]
[<0>] _nfs41_test_stateid+0xb7/0x140 [nfsv4]
[<0>] nfs41_test_and_free_expired_stateid+0xe2/0x170 [nfsv4]
[<0>] nfs_server_reap_expired_delegations+0x10e/0x1e0 [nfsv4]
[<0>] nfs_client_for_each_server+0x48/0x120 [nfs]
[<0>] nfs4_state_manager+0x731/0x840 [nfsv4]
[<0>] nfs4_run_state_manager+0xa1/0x150 [nfsv4]
[<0>] kthread+0xd6/0x100
[<0>] ret_from_fork+0x1f/0x30

[<0>] rpc_wait_bit_killable+0x1e/0xb0 [sunrpc]
[<0>] __rpc_execute+0xab/0x350 [sunrpc]
[<0>] rpc_execute+0xc5/0xf0 [sunrpc]
[<0>] rpc_run_task+0x150/0x1c0 [sunrpc]
[<0>] nfs4_call_sync_sequence+0x76/0xb0 [nfsv4]
[<0>] _nfs41_test_stateid+0xb7/0x140 [nfsv4]
[<0>] nfs41_test_and_free_expired_stateid+0xe2/0x170 [nfsv4]
[<0>] nfs_server_reap_expired_delegations+0x10e/0x1e0 [nfsv4]
[<0>] nfs_client_for_each_server+0x48/0x120 [nfs]
[<0>] nfs4_state_manager+0x731/0x840 [nfsv4]
[<0>] nfs4_run_state_manager+0xa1/0x150 [nfsv4]
[<0>] kthread+0xd6/0x100
[<0>] ret_from_fork+0x1f/0x30

We did an update of the service. In that update we reduced replica count of the deployments from 3 to 1, because we had a suspicion that more than one replica count could impact the NFS Server in a bad way. We also increased Memory from 4G to 10G.

One pod was not able to terminate successfully, because the “umount” is hanging on the worker node. Also, we have a hanging “D” state nfsd process on the server side. Below you will find the /proc/stack of the nfsd process on the server and the “umount”-process on the worker node.

root     3250961  0.0  0.0  10168  2832 ?        D    Jul09   0:02 /sbin/umount.nfs4 /var/lib/kubelet/pods/4cb2f6d0-d69a-4ce8-8085-ffb683941890/volumes/kubernetes.io~nfs/data-vol-storageservice-fs-private-2
[root@pat2-computegpu-201 ~]# cat /proc/3250961/stack 
[<0>] rpc_wait_bit_killable+0x1e/0xb0 [sunrpc]
[<0>] __rpc_execute+0x117/0x350 [sunrpc]
[<0>] rpc_execute+0xc5/0xf0 [sunrpc]
[<0>] rpc_run_task+0x150/0x1c0 [sunrpc]
[<0>] rpc_call_sync+0x51/0xb0 [sunrpc]
[<0>] nfs4_destroy_clientid+0x7f/0x1a0 [nfsv4]
[<0>] nfs4_free_client+0x21/0xb0 [nfsv4]
[<0>] nfs_free_server+0x54/0xb0 [nfs]
[<0>] nfs_kill_super+0x2d/0x40 [nfs]
[<0>] deactivate_locked_super+0x2e/0xa0
[<0>] cleanup_mnt+0x131/0x190
[<0>] task_work_run+0x59/0x90
[<0>] exit_to_user_mode_loop+0x122/0x130
[<0>] exit_to_user_mode_prepare+0xb6/0x100
[<0>] syscall_exit_to_user_mode+0x12/0x40
[<0>] do_syscall_64+0x69/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x69/0xd3


[root@pat2-cloudfs-102]# ps aux | grep nfsd
root         960  0.0  0.0   5456  2176 ?        Ss   Jul04   0:00 /usr/sbin/nfsdcld
root        1187  0.0  0.0      0     0 ?        I    Jul04   0:40 [nfsd]
root        1188  0.0  0.0      0     0 ?        I    Jul04   0:56 [nfsd]
root        1189  0.0  0.0      0     0 ?        I    Jul04   1:37 [nfsd]
root        1190  0.0  0.0      0     0 ?        I    Jul04   3:15 [nfsd]
root        1191  0.1  0.0      0     0 ?        I    Jul04   8:51 [nfsd]
root        1192  0.1  0.0      0     0 ?        D    Jul04  11:44 [nfsd]
root        1193  0.8  0.0      0     0 ?        I    Jul04  76:02 [nfsd]
root        1194  2.9  0.0      0     0 ?        I    Jul04 256:54 [nfsd]
root     1714913  0.0  0.0      0     0 ?        I    Jul08   0:01 [kworker/u16:0-nfsd4_callbacks]
root     1861963  0.0  0.0      0     0 ?        I    Jul09   0:00 [kworker/u16:1-nfsd4_callbacks]
root     2239081  0.0  0.0   6408  2432 pts/4    S+   09:29   0:00 grep --color=auto nfsd

[root@pat2-cloudfs-102]# cat /proc/1192/stack 
[<0>] nfsd4_shutdown_callback+0xa3/0x120 [nfsd]
[<0>] __destroy_client+0x1f3/0x290 [nfsd]
[<0>] nfsd4_destroy_clientid+0xe2/0x1c0 [nfsd]
[<0>] nfsd4_proc_compound+0x44b/0x700 [nfsd]
[<0>] nfsd_dispatch+0xe6/0x220 [nfsd]
[<0>] svc_process_common+0x2e4/0x650 [sunrpc]
[<0>] svc_process+0x12d/0x170 [sunrpc]
[<0>] svc_handle_xprt+0x448/0x580 [sunrpc]
[<0>] svc_recv+0x17a/0x2c0 [sunrpc]
[<0>] nfsd+0x84/0xb0 [nfsd]
[<0>] kthread+0xdd/0x100
[<0>] ret_from_fork+0x29/0x50

Any help would be much appreciated!
If you need more information, feel free to ask.

Kind regards,
Marc

faitjx · July 13, 2025, 1:03am

Network File Systems are some of the fussiest things to identify problems in. The inherent nature of a shared network backbone with processes on either end trying to have completely flawless communications leads to some pathological failure points, which the “D” state is one.

The “D” state of a process indicates that it is in an “uninterruptible sleep”, which is usually caused by a IO deadlock or other issue that causes the process to wait “forever” for another process to finish its IO task. I have seen this most often when on network file systems a Client system hangs or dies in the middle of a write to storage, where it starts the write, but doesn’t complete it, or on a client where there are network problems causing TCP frames to be resent. Since the normal timeout for a lost frame in TCP can be somewhere in the 15 to 30 minute range, the process never manages to complete, and ends up in this uninterruptible sleep state. This will hang umount on the client side, and then hang the server process, as well.

In my experience, with over 25 years of experience with networked file systems, about the only thing that you can do is to try and isolate where the problem originates. Most of the time, I did not have the time to do an exhaustive diagnostic, but would change out the entire network hardware on the client side if it was only happening to one client system, and changing out the network hardware on the server side if it happens with several clients accessing the same network server. One time it was another system was misconfigured with an incorrect IP address, and yet another time it was an unrelated system jabbering over any and all traffic on the network.
Looking at the stack traces of your servers, you will see that they are in a nfs4_call_sync procedure, which is guarding a process that needs to be as atomic as possible. Since RPC in general is anything but atomic for may operations, anything that interrupts communications will impact performance in a major fashion, often causing processes to hang waiting for the other side to finish what they think is already done.

The first place I would look is at the network hardware level. Often a failing network interface will cause IO issues with NFS long before it actually quits for good. It could be either on the server or the client, or even any system on the same network segment. Look at the traffic using wireshark, and monitor the NFS protocol between the systems that seem to be having the most problems. If you start seeing various TCP errors, you may be able to pinpoint which system is causing the problem. Otherwise, start by swapping out the bits that make up the path over the net. I have seen this general type of problem caused by anything from a bad termination on a cable to network cards to a failing network switch but always in the hardware someplace.

Marie_SWE · July 13, 2025, 3:19am

Hi

When i see this two sentences and after 6year on Linux…
The first thing that hits me is disk I/O trashing i dug into a few years back, that also happens Linux desktops with low on free memory…
I dont remember if it was Linus Torvalds or some of his friends back in the late 90’s that talked about it… They named it swapdeath in a e-mail.

So I’m curious if @mcm has low free memory when it happens?

mcm · July 14, 2025, 7:34am

The Pods use up to 9G sometimes, but that’s to be expected, especially under load. The Worker Nodes are not at all used up in resources and the NFS Server also has half of his Memory free.

Marie_SWE · July 14, 2025, 8:12pm

90%mem usage sounds good… as they say… free memory is wasted memory.

I have not yet get my hand dirty yet on Linux server, just on desktops/laptops clients, so only take my post for what i have picked up so far, so i might be far off.

But my take so far with linux is…

under 16GB ram and 98% mem-usage and you risk disk I/O trashing when you get an event that will create a memory spike on those 2% free so the system starts to swap to prevent an OOM shutdown.
With an SSD this can become a freeze hiccup between 1-60 seconds… on a HDD it can become a freeze between 1minute to an hour or more.

also when using ZFS is my observation from others that on a HDD pool that sometimes SMB is faster then NFS… but… with a lot of small tiny writes NFS has the lead.
Also a HDD pool might become faster if you set sync=disable if the HDD pool dont have an SLOG device

another thing that can make a pool slow down is if someone has used SMR drives in the ZFS pool.

If I’m wrong on my take on this, i would love if someone correct me, as I’m here to learn more then i have so far…