GPU not working due to amdgpu: failed to read ip discovery binary from file

A terminal application running on the dedicated notebook GPU crashed and now, the GPU cannot be started during during the book process:

This is the relevant boot information. How can this be fixed?

Jan 16 23:33:32 localhost kernel: amdgpu 0000:03:00.0: amdgpu: get invalid ip discovery binary signature from vram
Jan 16 23:33:32 localhost kernel: amdgpu 0000:03:00.0: amdgpu: amdgpu_discovery is not set properly
Jan 16 23:33:32 localhost kernel: amdgpu 0000:03:00.0: amdgpu: failed to read ip discovery binary from file
Jan 16 23:33:32 localhost kernel: [drm:amdgpu_discovery_set_ip_blocks [amdgpu]] ERROR amdgpu_discovery_init failed
Jan 16 23:33:32 localhost kernel: amdgpu 0000:03:00.0: amdgpu: Fatal error during GPU init
Jan 16 23:33:32 localhost kernel: amdgpu 0000:03:00.0: amdgpu: amdgpu: finishing device.
Jan 16 23:33:32 localhost kernel: amdgpu: probe of 0000:03:00.0 failed with error -22
Jan 16 23:33:32 localhost kernel: checking generic (fc20000000 300000) vs hw (fc20000000 10000000)
Jan 16 23:33:32 localhost kernel: checking generic (fc20000000 300000) vs hw (fc20000000 10000000)
Jan 16 23:33:32 localhost kernel: fb0: switching to amdgpu from EFI VGA
Jan 16 23:33:32 localhost kernel: Console: switching to colour dummy device 80x25
Jan 16 23:33:32 localhost kernel: amdgpu 0000:07:00.0: vgaarb: deactivate vga console
Jan 16 23:33:32 localhost kernel: amdgpu 0000:07:00.0: enabling device (0006 → 0007)
Jan 16 23:33:32 localhost kernel: [drm] initializing kernel modesetting (RENOIR 0x1002:0x1638 0x1462:0x1316 0xC4).
Jan 16 23:33:32 localhost kernel: amdgpu 0000:07:00.0: amdgpu: Trusted Memory Zone (TMZ) feature enabled
Jan 16 23:33:32 localhost kernel: [drm] register mmio base: 0xFC900000
Jan 16 23:33:32 localhost kernel: [drm] register mmio size: 524288
Jan 16 23:33:32 localhost kernel: [drm] add ip block number 0 <soc15_common>
Jan 16 23:33:32 localhost kernel: [drm] add ip block number 1 <gmc_v9_0>
Jan 16 23:33:32 localhost kernel: [drm] add ip block number 2 <vega10_ih>
Jan 16 23:33:32 localhost kernel: [drm] add ip block number 3
Jan 16 23:33:32 localhost kernel: [drm] add ip block number 4
Jan 16 23:33:32 localhost kernel: [drm] add ip block number 5
Jan 16 23:33:32 localhost kernel: [drm] add ip block number 6 <gfx_v9_0>
Jan 16 23:33:32 localhost kernel: [drm] add ip block number 7 <sdma_v4_0>
Jan 16 23:33:32 localhost kernel: [drm] add ip block number 8 <vcn_v2_0>
Jan 16 23:33:32 localhost kernel: [drm] add ip block number 9 <jpeg_v2_0>
Jan 16 23:33:32 localhost kernel: amdgpu 0000:07:00.0: amdgpu: Fetched VBIOS from VFCT
Jan 16 23:33:32 localhost kernel: amdgpu: ATOM BIOS: 113-CEZANNE-018
Jan 16 23:33:32 localhost kernel: [drm] VCN decode is enabled in VM mode
Jan 16 23:33:32 localhost kernel: [drm] VCN encode is enabled in VM mode
Jan 16 23:33:32 localhost kernel: [drm] JPEG decode is enabled in VM mode
Jan 16 23:33:32 localhost kernel: amdgpu 0000:07:00.0: amdgpu: PCIE atomic ops is not supported
Jan 16 23:33:32 localhost kernel: amdgpu 0000:07:00.0: amdgpu: MODE2 reset
Jan 16 23:33:32 localhost kernel: [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
Jan 16 23:33:32 localhost kernel: amdgpu 0000:07:00.0: amdgpu: VRAM: 512M 0x000000F400000000 - 0x000000F41FFFFFFF (512M used)
Jan 16 23:33:32 localhost kernel: amdgpu 0000:07:00.0: amdgpu: GART: 1024M 0x0000000000000000 - 0x000000003FFFFFFF
Jan 16 23:33:32 localhost kernel: amdgpu 0000:07:00.0: amdgpu: AGP: 267419648M 0x000000F800000000 - 0x0000FFFFFFFFFFFF
Jan 16 23:33:32 localhost kernel: [drm] Detected VRAM RAM=512M, BAR=512M
Jan 16 23:33:32 localhost kernel: [drm] RAM width 128bits DDR4
Jan 16 23:33:32 localhost kernel: [drm] amdgpu: 512M of VRAM memory ready
Jan 16 23:33:32 localhost kernel: [drm] amdgpu: 3072M of GTT memory ready.
Jan 16 23:33:32 localhost kernel: [drm] GART: num cpu pages 262144, num gpu pages 262144
Jan 16 23:33:32 localhost kernel: [drm] PCIE GART of 1024M enabled.
Jan 16 23:33:32 localhost kernel: [drm] PTB located at 0x000000F400900000
Jan 16 23:33:32 localhost kernel: amdgpu 0000:07:00.0: amdgpu: PSP runtime database doesn’t exist

This is the dedicated GPU: ]$ inxi -G
Graphics:
Device-1: AMD Navi 22 [Radeon RX 6700/6700 XT/6750 XT / 6800M] driver: N/A
Device-2: AMD Cezanne driver: amdgpu v: kernel
Device-3: Acer HD Webcam type: USB driver: uvcvideo
Display: wayland server: X.Org v: 1.21.1.3 with: Xwayland v: 21.1.3
compositor: gnome-shell v: 40.10 driver: X: loaded: modesetting
unloaded: fbdev dri: radeonsi gpu: amdgpu resolution: 1920x1080~240Hz
API: OpenGL v: 4.6 Mesa 22.1.5 renderer: AMD RENOIR (LLVM 14.0.6 DRM 3.42
5.14.0-70.30.1.el9_0.x86_64)

lol, have you considered an nvidia gpu, this amd one seems to be giving you a lot of hassle! :laughing:

I need it for simulation software development/adaption. AMD GPUs have a lot more fp64 performance which is what I need. Unfortunately there are only two notebook models with an AMD GPU with more than 8Gb RAM and I may have picked the wrong one considering the number of BIOS updates and issues like this one.

I made quite some progress on the software development side in the past couple of days so that’s going into the right direction.

1 Like

[The Fix]

The MSI Customer Support provided me with a download link to a clean AMD Windows driver installer (they only support Windows but fortunately I have a dual boot setup). It went immediately into a “repair” mode and must have fixed the issue on the mainboard.

3-4 Linux boots later the Notebook was back to normal.

The problem is back both on Windows 10 and Linux and is now permanent.

It’s back in the hands of the MSI Customer Support.