Hello,
recently we noticed that using cp on a directory with many files was taking much longer on almalinux 9.3 than it was taking on oracle linux 7.9. The machines are the same, just an OS reinstallation.
After trying many things, I copied the cp binary from oracle linux 7 on the alma linux 9.3 machine and the time went from 830 minutes on alma to 14 minutes.
What could possibly have changed in cp that would have such an impact?
oracle linux cp:
cp (GNU coreutils) 8.22
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Torbjörn Granlund, David MacKenzie, and Jim Meyering.
alma linux cp:
cp (GNU coreutils) 8.32
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later https://gnu.org/licenses/gpl.html.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Torbjorn Granlund, David MacKenzie, and Jim Meyering.
Thanks for any hints on how to get that to work as fast as it used to.
It looks like nobody is interested in this problem but, as a user, I feel concerned. If this is a general problem with cp, everyone should feel worried!
But I’m trying to understand, and I don’t have enough data.
How many files is “many files”? And how small are they?
What filesystem? And on what kind of devices?
Does rsync suffer from the same problem, or it’s much faster?
Have you tried timing on fewer files, but still with relevantly different results between the cp versions?
Any relevant warnings or errors, I don’t know, in dmesg or elsewhere?
Hello,
For the tests, I’m using the gcc-13.2 tar which has around 120000 files, I don’t know which size but source code so probably not very big for most of them.
The filesystem that shows this problem is an NFS server running truenas. I tried a few NFS clients in case it was specific to 1 machine and now I’m doing all the tests from the same machine to have numbers I can compare.
rsync is working as expected. scp is as slow as cp (I used scp without a hostname so maybe it calls cp underneath).
I haven’t tried on fewer files.
I haven’t seen anything that seems relevant in logs but I could’ve missed it.
I downloaded the source code to coreutils 8.32 (same version as alma) and compiled it and this version seems to be ok. I also compiled coreutils-9.0 and that version is slow (15m25.343s vs 954m49.535s). I don’t know what that means but it might be related. Also tried the centos stream cp program and that one seems to behave the same as the alma linux one.
I also asked someone in a different department that’s also using alma linux 9 with NFS and it seems to be working at normal speed for them. Maybe we can get information from comparing our systems.
So the problem only occurs on non-local filesystems?
I’m pretty sure that scp does not call cp.
Very strange thing that 8.32 recompiled works fine, but 9.0 does not. Centos 9-Stream has still 8.23, so it makes sense to behave like Alma’s. But why compiling it by hand gives different results?
You might be right with the acl patch, I was thinking of something acl related and a backport of some code from a more recent version.
With a manual mount and the bigger wsize and rsize options, it looks like the server rejects the higher values and puts lower ones.
We’ll have to try and see if it’s possible to change that but I don’t want to do that on production servers so we’ll have to see how we can test this.
I think the acl patch might be something to look into more details as well.
Forgot how I stumbled across this thread, but I have successfully reproduced the observation on a TrueNAS-13.0-U6.2 NFS server (vers=4.2). When I tried reproducing this on an Ubuntu 24.04 NFS server, it seems unaffected (vers=4.2).
[root@alma9 export]# time timeout 10m cp --reflink=never -r gcc-13.2.0 gcc-13.2.0c && time timeout 10m cp -r gcc-13.2.0 gcc-13.2.0d
real 4m41.712s
user 0m1.410s
sys 0m28.554s
real 3m15.751s
user 0m1.020s
sys 0m21.993s
[root@alma9 export]# uname -a
Linux localhost 5.14.0-427.33.1.el9_4.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Aug 30 09:45:56 EDT 2024 x86_64 x86_64 x86_64 GNU/Linux
Another observation is that when cp is not given --reflink=never, cp would call copy_file_range() instead of just opening the files and performing reads & writes; according to man page, this will use server-side copy when supported (and it looks like it is supported by the Ubuntu server since the copy was faster than --reflink=never).