I helped fix sleep-wake hangs on Linux with AMD GPUs

(nyanpasu64.gitlab.io)

Comments

jorvi 17 February 2025
> Through some digging, I found that when a desktop enters S3 sleep, the system cuts power to PCIe GPUs

I am not sure how correct this assumption is. S3 is supposed to cut power to everything but RAM, but for example Gigabyte Aorus motherboards are notorious for an NVMe SSD sleep bug that randomly prevents the system from properly sleeping or waking.

This is fixed by adding the following udev rule:

  # Generic PCIe fix for sleep bugs by preventing wakeup from any PCIe port
  ACTION=="offline", SUBSYSTEM=="pci", DRIVER=="pcieport",     ATTR{power/wakeup}="disabled"
   
or more targeted:

  # Gigabyte sleep fix by preventing wakeup from problematic PCIe port, depends on motherboard model
  ACTION=="offline",  SUBSYSTEM=="pci", ATTR{vendor}=="0x8086", ATTR{device}=="0x43bc", ATTR{power/wakeup}="disabled"
   
You can find any glitched PCIe wakeup device with:

  1. cat /proc/acpi/wakeup (you'll have to trial and error your way through the wakeup devices if it isn't immediately clear)
  2. cat /sys/class/pci_bus/*/*/yourWakeupDevicePci/uevent | grep PCI_ID
  3. prepend "0x"
You also have the option of:

  udevadm info --attribute-walk /dev/whatever
  
but for that you need to know some basic identifier of your glitchy device.

Or if you want to shellscript it (less reliable than letting udev do it for you and needs to be done via systemd service file or another automation):

  # Gigabyte sleep fix, port depends on mobo model
  /bin/bash -c 'if grep 'RP05' /proc/acpi/wakeup | grep -q 'enabled'; then echo 'RP05' > /proc/acpi/wakeup; fi'";

Yes I really hate this (and other) Linux sleep issues.
lorenzbrun 17 February 2025
Author of memreserver (one of the mentioned userspace workarounds) here. I've debugged this a few years back, only public comment I can quickly find is [1]. I also remember some mailing list discussions, but it basically came down to the isuse that Linux didn't have staggered suspend hooks that reliably ran before disks and parts of the memory subsystem were frozen. Apparently this is now possible. Sadly the Freedesktop Gitlab doesn't seem indexable so this knowledge seems to have gotten lost.

[1] https://gitlab.freedesktop.org/drm/amd/-/issues/2125#note_17...

sabujp 17 February 2025
This is amazing work! If folks have ever wondered why suspend is so difficult to get working on linux and why debugging it is equally difficult, this is a single datapoint with lots of information about all the things that can go wrong. Even now I have a thinkpad P1G4 where the fans won't turn off automatically unless I turn them off before going into suspend. Recently I also started having crackling issues with my bluetooth headphones after resuming from suspend and had to disable node suspension there also (https://wiki.archlinux.org/title/PipeWire#Noticeable_audio_d...).
Gormo 17 February 2025
My sincere personal thanks for this. My main laptop is a Ryzen-based ThinkPad running Linux that I suspend and hibernate regularly, and I sporadically encounter this issue. Looking forward to 6.14!
mkesper 18 February 2025
Why was dm->cached_state storing -12 instead of a pointer? Most likely this happened because earlier during suspend, dm_suspend() assigned dm.cached_state = drm_atomic_helper_suspend(adev_to_drm(adev)). The callee drm_atomic_helper_suspend() could return either a valid pointer, or ERR_PTR(err) which encoded errors as negative pointers. But the caller function assigned the return value directly to a pointer which gets dereferenced upon resume, instead of testing the return value for an error.

One more point for rust in the kernel. Just can't happen if you're required to handle a Result type.

jph 17 February 2025
Your work will help me on a Framework AMD laptop with the GPU extension and dual boot Linux/Windows. May I donate to you or to your favorite charity? My contact info is in my profile.
dekhn 17 February 2025
I used to think that naming things, cache invalidation, and off-by-one errors were the 2 biggest problems in CS, but then I learned about the "sleep/wake" problem and realized it's NP-complete.
jchw 17 February 2025
Memory management and specifically OOM conditions remain an unbelievably painful nightmare on Linux. It's not like I run into these issues constantly, but I've definitely tried to debug issues like these (unsuccessfully). Ultimately if I OOM a machine I usually wind up installing more RAM, which is wasteful/expensive, but it's pretty clear that handling OOM conditions gracefully is going to be a hard problem for Linux to solve into the future.

This is really great work and will serve as a reference point for debugging similar issues in the future. Pretty happy about systemd's debug-shell feature, I had no idea that existed. I don't think my X670E Steel Legend board has a serial header anywhere on it, though. How do modern built-in serial ports work, anyway? Are they attached off of the chipset PCIe lanes?

Something that's also very useful when trying to dive into the Linux kernel is that there's a bunch of great talks discussing Linux kernel subsystems from conferences like FOSDEM and Linux Plumber's Conference which you can usually find recordings of online. For example, there's this one for TTM, the memory subsystem that most of the desktop GPU DRM drivers use:

https://www.youtube.com/watch?v=MG7_tUNKSt0

dralley 17 February 2025
Fantastic news. AMD's linux graphics drivers have mostly worked great for me but this has been the one exception that I've hit multiple times.
dado3212 18 February 2025
> So I did the natural thing: I saved and extracted the amdgpu.ko kernel module, decompiled it in Ghidra, and mapped the location of the crash in dm_resume to the corresponding lines in the kernel source.

This is always my favorite part of debugging.

Daunk 17 February 2025
For all the years I've been using Linux, I've always had some kind of sleep issues. I've used Intel, AMD, ATI, and NVIDIA hardware across countless distros and setups, yet nothing seems to make a difference, there's always something that doesn’t work properly with sleep or hibernation.

Honestly, it's one of the main issues I wish the Linux community would take a closer look at and finally fix!

nyanpasu64 17 February 2025
Update: I upgraded to an Intel Arc B570 GPU... and ran into the exact same problem on an independent driver: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/4288
whatever1 17 February 2025
Apple became a trillion dollar company by mastering sleeping / waking up of electronic devices.

why nobody else sees this?

gmokki 19 February 2025
Has anyone tried the AMD suspend/resume patches that Alibaba? https://lists.freedesktop.org/archives/amd-gfx/2025-January/...

"We have tried to solve these issues case by case, but found that may not be the right way. Especially about the unbalanced irq reference count, there will be new issues appear once we fixed the current known issues. After analyzing related source code, we found that there may be some fundamental implementation flaws behind these resource tracking issues.

So we try to fix those issues by two enhancements/refinements to current device management state machines."

voytec 17 February 2025
I've had zero problems with S3 wake/sleep on AMD ThinkPad with FreeBSD for years. And FreeBSD uses AMD drivers pulled from Linux. How is this still a problem on Linux?

    hw.acpi.lid_switch_state=S3
zrm 17 February 2025
> To make room for VRAM, memreserver allocates system RAM based on used VRAM plus 1 gigabyte, then fills the RAM with 0xFF bytes and mlocks the memory (so none of it is swapped out).

That seems like a bit of trouble if you have 16GB of system RAM and a 24GB GPU.

sidkshatriya 17 February 2025
TL;DR:

During suspend, for graphics cards, GPU VRAM needs to be transferred to system RAM.

However, during high memory usage scenarios the VRAM + RAM usage could exceed system memory -- this would ordinarily involve system swap coming into play and handling the temporarily out of memory issue. However system swap was already deactivated when it came time to suspending the AMD card causing all sorts of problems.

The fix was asking the GPU to evict its VRAM to system RAM via the hook ("suspend prepare") before swap was deactivated in linux kernel.

badsectoracula 17 February 2025
I wonder if this will help a similar problem i have with my AMD GPU: very often when i wake/resume the PC, the output is almost frozen. "Almost" because it actually isn't frozen, if there is any output/animation/etc going on it plays fine, but once i try to move the mouse it freezes and everything updates at a single frame per couple of seconds - sometimes freezing completely. I can usually Ctrl+Alt+Fn to another virtual desktop in text mode and, if that is possible (i.e. the computer hasn't completely frozen, though sometimes it takes about a minute to switch), i can Ctrl+Alt+Fn back and everything works fine. Dmesg has a ton of spam messages from amdgpu after that.

AFAICT (from the behavior) something isn't properly saved/restored and communicating with the GPU (the mouse cursor is a hardware cursor thus needs to send commands to the GPU to update its position) causes some sort of issue. Switching to another virtual terminal that is running in text mode probably forces the driver to reset its graphics state. Of course that is just my assumption based on what i see going on.

Weirdly enough this only happens after i replaced my RX 5700 XT with a RX 7900 XTX so it might be something GPU (or GPU arch) specific.

I've been considering plugging my laptop and see if there is something i can figure out (GPU aside the PC is usable, but i guess if this a kernel side thing i'd need a second computer connected to it to debug it), but as this isn't something i've tried before (though i know someone who has and said it isn't anything special) my annoyance still hasn't gone over the "i need to get to the bottom of this" threshold :-P.

It'd be nice if 6.14 fixes the issue, though i am not sure as i rarely have more than 1/3rd of the system RAM (32GB) in use and VRAM (24GB) barely goes above 1-2GB of use outside games. But this post might be helpful in diagnosing the issue next time it happens :-).

Asmod4n 17 February 2025
> On my laptop, I opened a terminal and ran sudo minicom --device /dev/ttyUSB0 --baudrate 115200 to monitor the computer over serial. In addition to saving logs

You can just use screen for that and have a working terminal with color support et al.

fowl2 18 February 2025
I guess "hibernating" (writing VRAM to swap) works better than expecting userspace to gracefully handle device resets. One linear read vs. a thundering herd of processes re-initialising, decompressing, etc. should be more predictable/reliable at least.

I do wonder however how much VRAM is "volatile" - ie. framebuffers - and could just be thrown away. And web browsers seem to handle GPU resets just fine, so maybe they could opt-in?

yellow_lead 17 February 2025
I have an Nvidia GPU and a sporadic crash (black screen) with no logs on Linux. I suspect it's a driver issue too. Going to try some of these tips to enable the debug shell, but I'm not sure if they'll be effective.

Anyone have other tips for this type of thing? I did try upgrading drivers/kernels already

raffraffraff 17 February 2025
> I dug a PS/2 keyboard out of a dusty closet and plugged it into my system (only safe when the PC is off!)

Lol, I remember.

rakejake 18 February 2025
I have a problem similar to what OP faced but with an NVIDIA GPU (RTX 4080). When I wake the computer from suspend, the computer will usually wake up, show me the timestamp changing on the lock screen. But sometimes randomly, the timestamp will not change the the screen will freeze on the old timestamp. After this, it is either REISUB or hard reset.

@nyanpasu64, do you think enabling nvidia-suspend/nvidia-resume services will do the trick? I didn't go through the codefix in the above post in detail, but it looks like the fix is to raise a relevant notification to all listeners in the suspend_prepare() method for amdgpus, not nvidia.

Running Ubuntu 24.10.

0xTJ 17 February 2025
Very excited to see 6.14 hit Arch! Hanging around sleep (with symptoms that sound like what's described in the write-up) has been the one persistent occasional issue, so I'm hoping that this fixes what I'm seeing.
Voultapher 17 February 2025
Since a couple Linux versions something around 6.10 IIRC I've had it where my Nvidia system wakes into a black screen, but with a cursor and alt shells work, specifically KDE Plasma seems bugged here, but they say it's a kernel issue, or at least there are dozens of separate issues open about this kind of bug and it's rather annoying that I can't put my machine to sleep.

If anyone has ideas what could fix this I'd really appreciate it. The machine is dual booted with Windows, and there sleep works without issue, so it's clearly possible, as it was for years before that on Linux as well.

devilsdata 17 February 2025
Props to you for this.

I am not clever/experienced enough to solve my own issues with sleep-wake hangs on Linux at work.

I’ve instead opted to work around it. I use Firefox, Obsidian, and Tmux with Neovim for all my work. Tmux has resurrect and a plugin that saves my entire terminal state automatically every few minutes.

I also have a command that automatically sets up my i3wm/regolith windows exactly how I like.

Basically if I run `wkup`, I’m exactly where I was, down to the line of code open on NeoVim, Firefox tab, and dev server or cargo running.

empiricus 17 February 2025
I notice I am confused how the code needed for the GPU to sleep was implemented. It was failing when simply saving/copying gigabytes of flat memory, but on the other hand it was able to recover successfully the previous complex hw and sw state and data structures?! I guess it probably makes sense if after waking up that data is actually dropped and the gpu and ui is reinitialized and redrawn.
asmor 17 February 2025
Some AMD integrated GPUs are surprisingly fragile with this. I have a GPD Win Max 2 8840U (a "concept car" handheld laptop hybrid) and when I got it last year, it would fail to wake from suspend and hibernate about half the time in Windows, with Linux actually being more reliable (but also not perfect), and only this year did an AMD GPU driver fix this.
schainks 17 February 2025
Ugh the first time I was debugging problems like this, it was in production and for some IoT hardware we had deployed in the field.

Fortunately, although that's not the focus of this article, system hibernate is WAY more reliable than system sleep in Linux due to the way it works.

Use system hibernate if your SSD is fast enough. It works better than system sleep and isn't a ton slower.

stycznik 18 February 2025
It seems like the GPU should grow the capability to keep its VRAM intact through suspend, it's already complex[0] enough it's basically another computer attached to your computer anyway..

[0] https://github.com/jhuber6/doomgeneric

Namidairo 18 February 2025
For a second I thought this was referring to the other reset bug on Polaris, Vega and Navi. (These apparently have broken Function Level Reset sequences, requiring quite specific reset code as a separate module or a system reboot to bring back to a working state.)
podiki 17 February 2025
I didn't have issues with sleep/wake until somewhat recently (not sure when) and found this post. Grabbing the patch from the commit referenced and using it on top of 6.12 and 6.13 kernels seems to have fixed it for me too (for the past couple of weeks and counting).

Great work!

isodude 18 February 2025
There should exist something like memtest86, but for S3 and S0, that you can run on the laptop to identify hardware that do not suspend properly.
progforlyfe 17 February 2025
Extremely high level genius stuff -- nice work and thank you for your efforts!
sim7c00 18 February 2025
very nice and detailed writeup, so many interesting stuff in here really. reading about systemd (bugreport) always hurts my brain (bugreport) but all the low level interactions between OS and drivers /firmware regarding these states and what kind of issues can happen between them, how to find out whats happening better, verynice :). many thanks! for the year long hunt and the excellent writeup.
Bobaso 17 February 2025
My thinkpad E14 gen5 intel + ubuntu 24.04 is my first ever laptop where sleep work exactly as intended. with ~1% battery waste per hour
deepsun 17 February 2025
I have to unplug Logitech wireless receivers for mouse/keyboard, otherwise desktop wakes up immediately.
igtztorrero 18 February 2025
Thanks, this happen to me, I avoided sleep functions
mistyvales 17 February 2025
Highly relevant! Thanks for this.
thrdbndndn 18 February 2025
what is 'agd5f/linux'?
kkarpkkarp 17 February 2025
omg, thank you
tombot 17 February 2025
> This took over a year of debugging and multiple attempts by many people to fix.

2025 finally Linux on the desktop

tgsovlerkhgsel 17 February 2025
AMD GPU linux drivers are (were?) a nightmare in general, and this includes iGPUs in their processors. Sadly, I don't have the impression that AMD is actively working on fixing this.

Just to make sure I'm not griping over something long fixed, I took a quick look and instantly found someone with a very similar issue to the one I ran into happening on a semi-recent kernel: https://community.amd.com/t5/pc-drivers-software/linux-amdgp...

It looks to me that if you want to have a working computer under Linux, it's worth the extra cost to avoid AMD.