https://bugs.freedesktop.org/show_bug.cgi?id=108854
--- Comment #17 from Alex Deucher alexdeucher@gmail.com --- (In reply to Tom Seewald from comment #16)
But in general shouldn't the kernel driver (ideally) be able to handle mesa passing malformed/bad commands rather than freezing the device (step 3 to 4)? I understand not every case can be covered, and I also understand that GPU resets need to be supported in user space for seamless recovery, but shouldn't the driver "unstick" itself enough so the computer can be rebooted normally?
These are not generally bad data from mesa per se. There's not really a good way to validate all combinations of state sent to the GPU are valid or not. There are hundreds of registers and state buffers that the GPU uses to process the 3D pipeline. It's impossible to test every combination of state and dispatch and ordering. The hangs are generally due to a deadlock in the hw due to a bad interaction of states set by the application. E.g., some hw block is waiting on a signal from another hw block which won't get sent because the user sent another state update which stops that signal.
The GPU reset should generally be able recover the GPU, but in some cases you may end up with a deadlock in sw in the kernel somewhere.