I’m stuck on integrating NVIDIA drivers CUDA on almalinux 9.5 for a small local LLM project on DELL T340 server, RTX3080 gpu .
I’ve tried many installation modes and settings (Wiki almalinux, web comments about, Direct Nvidia drivers, …)
secure boot is disabled!
any idea please ?
$uname -a
Linux sirius 5.14.0-503.26.1.el9_5.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Mar 3 05:56:39 EST 2025 x86_64 x86_64 x86_64 GNU/Linux
$dmesg | grep nvidia
$
$lsmod | grep nvidia
$
$sudo nvidia-smi
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
You have an application that needs CUDA runtime libraries to run (and possibly devel to build).
The CUDA runtime requires NVidia’s GPU driver (the Nouveau that is included with Alma will not do).
In principle any version of NVidia’s packages should do (NVidia’s, ELRepo’s, RPM Fusion’s) but it is easiest to use NVidia’s as they are in same repo as the CUDA packages.
One needs EPEL, so dnf install epel-release
EPEL may need CRB, so crb --enable
Then the CUDA repo. I have:
HI,
Thank you for your response and your appreciated help.
The recommendations that you have sent me have been carried out many times and on different levels, notably the direct integration of the manual procedure of the latest NVIDIA driver (NVIDIA-Linux-x86_64-570.124.04.run).
Always at the same point the driver is present but inactive!
Some extracted system elements:
$ Linux sirius 6.13.5-1.el9.elrepo.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Feb 27 12:59:45 EST 2025 x86_64 x86_64 x86_64 GNU/Linux
actualy display with matrox vgEmbedded Video motherboard
$ nvidia-smi
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
ars 06 11:32:44 sirius systemd-modules-load[332]: Failed to find module ‘nvidia-fs’
mars 06 11:32:48 sirius systemd-modules-load[760]: Failed to find module ‘nvidia-fs’
mars 06 11:32:50 sirius systemd[1]: Started nvidia-powerd service.
mars 06 11:32:50 sirius /usr/bin/nvidia-powerd[1253]: nvidia-powerd version:1.0(build 1)
mars 06 11:32:50 sirius /usr/bin/nvidia-powerd[1253]: Allocate client failed 89
mars 06 11:32:50 sirius /usr/bin/nvidia-powerd[1253]: Failed to initialize RM Client
mars 06 11:32:50 sirius systemd[1]: nvidia-powerd.service: Deactivated successfully.
mars 06 11:32:51 sirius akmods[1219]: Checking kmods exist for 6.13.5-1.el9.elrepo.x86_64Warning: Could not determine what package owns /lib/modules/6.13.5-1.el9.elrepo.x86_64/extra/nvidia/ [ OK ]
mars 06 11:32:51 sirius akmods[1219]: Checking kmods exist for 5.14.0-503.26.1.el9_5.x86_64Warning: Could not determine what package owns /lib/modules/5.14.0-503.26.1.el9_5.x86_64/extra/nvidia/ [ OK ]
mars 06 11:32:51 sirius dkms[1939]: Deprecated feature: REMAKE_INITRD (/var/lib/dkms/nvidia-fs/2.24.2/source/dkms.conf)
mars 06 11:32:51 sirius dkms[2005]: Deprecated feature: REMAKE_INITRD (/var/lib/dkms/nvidia-fs/2.24.2/source/dkms.conf)
mars 06 11:32:51 sirius dkms[1229]: Autoinstall of module nvidia-fs/2.24.2 for kernel 6.13.5-1.el9.elrepo.x86_64 (x86_64)
$ dkms status
/usr/bin/which: no ofed_info in (/home/henri/.local/bin:/home/henri/bin:/usr/share/Modules/bin:/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin)
Deprecated feature: REMAKE_INITRD (/var/lib/dkms/nvidia-fs/2.24.2/source/dkms.conf)
nvidia/570.124.06, 5.14.0-503.23.2.el9_5.x86_64, x86_64: installed
nvidia/570.124.06, 5.14.0-503.26.1.el9_5.x86_64, x86_64: installed
nvidia/570.124.06, 6.13.5-1.el9.elrepo.x86_64, x86_64: installed (Differences between built and installed modules)
nvidia-fs/2.24.2, 5.14.0-503.26.1.el9_5.x86_64, x86_64: installed
any idea ?
Best regards
Henri
I have not touched that in years. It is a “source install”. The package manager (dnf) does not know anything about source install and thus one can trample the files of the other. It is more cansistent, “managed”, to have all installs from RPM packages.
Now you reveal that you have also non-AlmaLinux kernel from ELRepo. That does not make things simpler. I don’t use those kernels anywhere, so cannot advice about them.
Yes. The LLM needs CUDA libraries and the CUDA libs need the GPU drivers. That is a fact.
If you don’t want to use that GPU for displaying anything, then you have to configure the Window Manager / Display Manager / something to use the other GPU even when the NVidia GPU is available. I have never done such config.
PS. You can post your outputs with code tags (the </> button. Makes reading easier.
This is already the case my display works with the Matrox VGA motherboard of the DELL server. But unfortunately the NVIDIA card without the functional drivers cannot work even in pure Cuda calculation! The card does not respond; no active cooling!
I will continue my investigations to try to understand this problem a little better!
Thanks again for your help
PS: if you know a Unix driver/kernel specialist let me know
Best Regards
Henri
I I think I have narrowed down the scope of the problem, the problem seems to be at the level of the integration of the module into the kernel, the “dkms” command is in error: (consistent with the result of the commands modprobe modinfo lsmode /nvidia )
journalctl -xeu dkms.service
Support: Help and Support | AlmaLinux Wiki
An ExecStart= process belonging to unit dkms.service has exited.
The process’ exit code is ‘exited’ and its exit status is 11.
mars 06 18:50:09 sirius systemd[1]: dkms.service: Failed with result ‘exit-code’.
Subject: Unit failed
Defined-By: systemd
Support: Help and Support | AlmaLinux Wiki
The unit dkms.service has entered the ‘failed’ state with result ‘exit-code’.
mars 06 18:50:09 sirius systemd[1]: Failed to start Builds and install new kernel modules through DKMS.
Subject: L’unité (unit) dkms.service a échoué
Defined-By: systemd
Support: Help and Support | AlmaLinux Wiki
When i tried to restart dkms here result
$ sudo systemctl status dkms
× dkms.service - Builds and install new kernel modules through DKMS
Loaded: loaded (/usr/lib/systemd/system/dkms.service; enabled; preset: disabled)
Active: failed (Result: exit-code) since Thu 2025-03-06 18:50:09 CET; 55s ago
Docs: man:dkms(8)
Process: 42747 ExecStart=/usr/sbin/dkms autoinstall --verbose --kernelver 6.13.5-1.el9.elrepo.x86_64 (code=exited, status=11)
Main PID: 42747 (code=exited, status=11)
CPU: 37.108s
mars 06 18:50:09 sirius dkms[50245]: Error! Bad return status for module build on kernel: 6.13.5-1.el9.elrepo.x86_64 (x86_64)
mars 06 18:50:09 sirius dkms[50245]: Consult /var/lib/dkms/nvidia-fs/2.24.2/build/make.log for more information.
mars 06 18:50:09 sirius dkms[42747]: Autoinstall on 6.13.5-1.el9.elrepo.x86_64 succeeded for module(s) nvidia.
mars 06 18:50:09 sirius dkms[42747]: Autoinstall on 6.13.5-1.el9.elrepo.x86_64 failed for module(s) nvidia-fs(10).
mars 06 18:50:09 sirius dkms[50246]: Error! One or more modules failed to install during autoinstall.
mars 06 18:50:09 sirius dkms[50246]: Refer to previous errors for more information.
mars 06 18:50:09 sirius systemd[1]: dkms.service: Main process exited, code=exited, status=11/n/a
mars 06 18:50:09 sirius systemd[1]: dkms.service: Failed with result ‘exit-code’.
mars 06 18:50:09 sirius systemd[1]: Failed to start Builds and install new kernel modules through DKMS.
mars 06 18:50:09 sirius systemd[1]: dkms.service: Consumed 37.108s CPU time.
verification on make error file designed “/var/lib/dkms/nvidia-fs/2.24.2/build/make.log” Picking NVIDIA driver sources from NVIDIA_SRC_DIR=/usr/src/nvidia-570.124.06/nvidia. If that does not meet your expectation, you might have a stale driver still around and that might cause problems. ‘/lib/modules/6.13.5-1.el9.elrepo.x86_64/extra/nvidia.ko.xz’ → ‘./nvidia.ko.xz’
now question is why make code fail ?
Regards
Henri
That error is for the kernel-6.13.5-1.el9.elrepo.x86_64. As said, it is very different from the AlmaLinux kernels. I would remove that kernel completely to see whether the normal kernels “behave”.
HI,
After laborious and fruitless tests and extended research, I confirm that the Dell T340 is inoperable with GPU cards or other GPU(Tested on my RTX 3060/80 gpu!). The matrox devalidated in the bios (BIOS_5HWMC_LN64_2.18.0) does not solve the problem; virtual devalidation, the display is still present with the GPU on the pcie 2 slot.
Curious that the integrated matrox graphics card cannot actually be disabled by the bios ! [ Dell’s likely will on the T340?]
the linux lspci command attests to the non-recognition of the card! Big disappointment. I am continuing the project and you are directing me towards an assembled hardware solution probably allowing an optimization of the hardware cost!
Thanks for your Help
best regards
It should be possible to have both Matrox and NVidia drivers (loaded), and force output to use the Matrox. I just don’t know what kind of config is required for that.
More importantly, the T340 is listed to have:
“Single or Dual Redundant 495W power supply or single 350W cabled power supply”
The recommended PSU for RTX 3080 is 750 W, and the card alone consumes some 350 W. It does not look like that the T340 would have been built to host powerful discrete GPU’s.
I have tested a lot of BIOS modifications without success. I have two 450W power supplies active and online. This Bios/UEFI does not take into account the enumeration of the RTX3080 references on the PCIE X16 slot.2 This card works and has been tested on a rather basic ASUS card!
On first pass enumérating
Pcie Slot 1 Empty ( yes correct empty )
Pcie Slot 2 Card detected ( Rtx 3080 )
Pcie Slot 3 Card detected ( Sound blaster )
Pcie Slot 4 Card detected ( Dell Raid )
in the continuation of the test Pcie bus test indicate: Bus 02: link not reported PCIE SLOT 02 empty"
I have tried a lot of BIOS modifications without success. I have two 450W power supplies active and online. This Bios/UEFI does not take into account the enumeration of the references of the RTX3080 on the PCIE 4 slot. This card works and has been tested on a rather basic ASUS card!
in the continuation of the test
I will try to contact DELL and ask them about the problem, and see if they can offer me a solution!
Regards
Henri
for the community DEll users this can be perharps useful !
Yay, Good news !!!
I added a dedicated power supply to GPU, connecting the card’s 2x8 Y cable.
The BIOS shell sees the PCIe slot occupied as well as the named card.
I can now say that, with a few hardware modifications, the Dell T340 supports GPU cards, in this case the NVIDIA RTX 3080.
Now the LLM/CUDA project on ALMALINUX I had envisioned can begin…
Many thanks for the help with the specifics of dual power supplies for GPU cards.
Regards;
Sincerely,
Henri
Now :
basic commands on UEFI Shell pre boot indicates :
drivers : 21D 00060009 ? N N 0 0 NVIDIA GPU UEFI Driver PciRoot(0x0)/Pci(0x1,0x0)/Pci(0x0,0x0)/nvgop-ga1xx
pci: 00 06 00 00 ==> Multimedia Device - Mixed mode device
Vendor 1102 Device 0012 Prog Interface 0
Type=9, Handle=0x901
Dump Structure as:
Index=23,Length=0x1E,Addr=0x6A806322
00000000: 09 11 01 09 01 B6 0B 04-04 02 00 04 01 00 00 01 …
00000010: 00 50 43 49 65 20 53 6C-6F 74 20 32 00 00 .PCIe Slot 2…
Structure Type: System Slots
Format part Len : 17
Structure Handle: 2305
SlotDesignation: PCIe Slot 2
System Slot Type: PCI Express Gen 3 X16
System Slot Data Bus Width: 8x or x8
System Slot Current Usage: In use
System Slot Length: Long Length
System Slot Type: PCI Express Gen 3 X16
Slot Id: the value present in the Slot Number field of the PCI Interrupt Routing table entry that is associated with this slot is: 2
Slot characteristics 1: Provides 3.3 Volts
Slot characteristics 2: PCI slot supports Power Management Enable (PME#) signal
SegmentGroupNum: 0x0
BusNum: 0x1
DevFuncNum: 0x0
Almalinux : $ lspci
00:00.0 Host bridge: Intel Corporation 8th Gen Core Processor Host Bridge/DRAM Registers (rev 07)
00:01.0 PCI bridge: Intel Corporation 6th-10th Gen Core Processor PCIe Controller (x16) (rev 07)
00:01.1 PCI bridge: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor PCIe Controller (x8) (rev 07)
00:08.0 System peripheral: Intel Corporation Xeon E3-1200 v5/v6 / E3-1500 v5 / 6th/7th/8th Gen Core Processor Gaussian Mixture Model
00:12.0 Signal processing controller: Intel Corporation Cannon Lake PCH Thermal Controller (rev 10)
00:14.0 USB controller: Intel Corporation Cannon Lake PCH USB 3.1 xHCI Host Controller (rev 10)
00:14.2 RAM memory: Intel Corporation Cannon Lake PCH Shared SRAM (rev 10)
00:16.0 Communication controller: Intel Corporation Cannon Lake PCH HECI Controller (rev 10)
00:16.4 Communication controller: Intel Corporation Cannon Lake PCH HECI Controller #2 (rev 10)
00:17.0 SATA controller: Intel Corporation Cannon Lake PCH SATA AHCI Controller (rev 10)
00:1c.0 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #1 (rev f0)
00:1c.1 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #2 (rev f0)
00:1c.7 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #8 (rev f0)
00:1d.0 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #9 (rev f0)
00:1e.0 Communication controller: Intel Corporation Cannon Lake PCH Serial IO UART Host Controller (rev 10)
00:1f.0 ISA bridge: Intel Corporation Cannon Point-LP LPC Controller (rev 10)
00:1f.4 SMBus: Intel Corporation Cannon Lake PCH SMBus Controller (rev 10)
00:1f.5 Serial bus controller: Intel Corporation Cannon Lake PCH SPI Controller (rev 10)
01:00.0 VGA compatible controller: NVIDIA Corporation GA102 [GeForce RTX 3080] (rev a1)
01:00.1 Audio device: NVIDIA Corporation GA102 High Definition Audio Controller (rev a1)
03:00.0 PCI bridge: PLDA PCI Express Bridge (rev 02)
04:00.0 VGA compatible controller: Matrox Electronics Systems Ltd. Integrated Matrox G200eW3 Graphics Controller (rev 04)
05:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
05:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
06:00.0 Audio device: Creative Labs CA0132 Sound Core3D [Sound Blaster Recon3D / Z-Series / Sound BlasterX AE-5 Plus] (rev 01)
07:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3108 [Invader] (rev 02)