Server random reboots with coredump

Hello,

We have been seeing multiple random reboot issue on KVM hypervisor running Almalinux 8.10 kernel - 4.18.0-553.109.1.el8_10.x86_6 . Any idea?

[   25.386651] SVM: TSC scaling supported
[   25.386655] kvm: Nested Virtualization enabled
[   25.386656] SVM: kvm: Nested Paging enabled
[   25.386764] SVM: Virtual VMLOAD VMSAVE supported
[   25.386764] SVM: Virtual GIF supported
[   25.386765] SVM: LBR virtualization supported

[93389.801393] Call Trace:
[93389.804145] ? __die_body+0x1a/0x60
[93389.808069] ? no_context+0x1ba/0x3f0
[93389.812188] ? __update_load_avg_cfs_rq+0x27a/0x300
[93389.817673] ? __bad_area_nosemaphore+0x157/0x180
[93389.822959] ? do_page_fault+0x37/0x12d
[93389.827268] ? page_fault+0x1e/0x30
[93389.831190] ? kvm_apic_accept_pic_intr+0x13/0x60 [kvm]
[93389.837078] kvm_cpu_has_extint+0x13/0x70 [kvm]
[93389.842181] kvm_cpu_has_interrupt+0xe/0x30 [kvm]
[93389.847479] kvm_arch_vcpu_runnable+0x173/0x1d0 [kvm]
[93389.853170] kvm_vcpu_check_block+0x26/0x90 [kvm]
[93389.858468] kvm_vcpu_halt+0xb7/0x390 [kvm]
[93389.863181] kvm_arch_vcpu_ioctl_run+0x5a7/0x600 [kvm]
[93389.868966] kvm_vcpu_ioctl+0x2c9/0x640 [kvm]
[93389.873870] ? restore_sigcontext+0x15e/0x1c0
[93389.878766] do_vfs_ioctl+0xa4/0x690
[93389.882787] ? syscall_trace_enter+0x1ff/0x2d0
[93389.887782] ksys_ioctl+0x64/0xa0
[93389.891509] __x64_sys_ioctl+0x16/0x20
[93389.895922] do_syscall_64+0x5b/0x1d0
[93389.900191] entry_SYSCALL_64_after_hwframe+0x66/0xcb

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 320
On-line CPU(s) list: 0-319
Thread(s) per core: 2
Core(s) per socket: 160
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
BIOS Vendor ID: Advanced Micro Devices, Inc.
CPU family: 26
Model: 17
Model name: AMD EPYC 9845 160-Core Processor
BIOS Model name: AMD EPYC 9845 160-Core Processor
Stepping: 0
CPU MHz: 2100.000
CPU max MHz: 3718.0659
CPU min MHz: 1500.0000
BogoMIPS: 4193.72
Virtualization: AMD-V
L1d cache: 48K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 32768K
NUMA node0 CPU(s): 0-319
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx_vnni avx512_bf16 clzero irperf xsaveerptr wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid bus_lock_detect movdiri movdir64b overflow_recov succor smca fsrm avx512_vp2intersect flush_l1d amd_lbr_pmc_freeze

[319357.182229] Call Trace:
[319357.185164]  ? __die_body+0x1a/0x60
[319357.189271]  ? no_context+0x1ba/0x3f0
[319357.193572]  ? __bad_area_nosemaphore+0x157/0x180
[319357.199043]  ? do_page_fault+0x37/0x12d
[319357.203538]  ? page_fault+0x1e/0x30
[319357.207648]  ? svm_interrupt_blocked+0xb0/0xb0 [kvm_amd]
[319357.213805]  ? svm_interrupt_blocked+0xe/0xb0 [kvm_amd]
[319357.219861]  ? kvm_apic_has_interrupt+0x44/0x90 [kvm]
[319357.225741]  svm_interrupt_allowed+0x1a/0x60 [kvm_amd]
[319357.231694]  kvm_arch_vcpu_runnable+0xee/0x1d0 [kvm]
[319357.237470]  kvm_vcpu_check_block+0x26/0x90 [kvm]
[319357.242948]  kvm_vcpu_halt+0xb7/0x390 [kvm]
[319357.247841]  kvm_arch_vcpu_ioctl_run+0x5a7/0x600 [kvm]
[319357.253807]  kvm_vcpu_ioctl+0x2c9/0x640 [kvm]
[319357.258898]  ? pollwake+0x74/0xa0
[319357.262810]  ? wake_up_q+0x60/0x60
[319357.266815]  ? __wake_up_common+0x7a/0x190
[319357.271604]  do_vfs_ioctl+0xa4/0x690
[319357.275809]  ? syscall_trace_enter+0x1ff/0x2d0
[319357.280986]  ksys_ioctl+0x64/0xa0
[319357.284895]  __x64_sys_ioctl+0x16/0x20
[319357.289292]  do_syscall_64+0x5b/0x1d0
[319357.293588]  entry_SYSCALL_64_after_hwframe+0x66/0xcb

Looking at this stack trace:

do_syscall_64
__x64_sys_ioctl
ksys_ioctl
do_vfs_ioctl
kvm_vcpu_ioctl
kvm_arch_vcpu_ioctl_run
kvm_vcpu_halt
kvm_arch_vcpu_runnable
svm_interrupt_allowed [kvm_amd]

it looks like the problem is happening on the KVM host side, in the host kernel’s KVM/AMD SVM path.

I would first update to the latest available AlmaLinux 8.10 kernel and disable nested virtualization if it is not required:

options kvm_amd nested=0 avic=0

Then reboot and check:

cat /sys/module/kvm_amd/parameters/nested
cat /sys/module/kvm_amd/parameters/avic

Nested virtualization on RHEL 8 is Technology Preview, so I would avoid using it on a production KVM host unless it is strictly required.

As I suspected as well, I disabled the nested virtualization. Based on my online research, I did conclude that the nested virtualization isn’’t ready for production use. For some reason. I didn’t come across a mention of the preview. So Thank you for putting it here. Now I know :slight_smile: