Re: [PATCH v2 0/3] drm: Add panic handling

25 Mar 2019

On Mon, Mar 18, 2019 at 12:06:13AM +0100, Ahmed S. Darwish wrote:
...
=> Now that the dust has settled, here's a summary of this huge
   50-email thread (thanks Daniel, Noralf, John, everyone!).
=> Parts of this document are a direct rewording of Daniel's replies,
   so I took the liberty of adding a Co-developed-by tag here..
=> This is only a summary, and _not_ an official patch submission.
   It's now Show-me-the-code time ;-)
Subject: [PATCH] Documentation: gpu: Add initial DRM panic design
Co-developed-by: Daniel Vetter daniel@ffwll.ch
Signed-off-by: Ahmed S. Darwish darwish.07@gmail.com

Documentation/gpu/drm-panic-design.rst | 124 +++++++++++++++++++++++++
 1 file changed, 124 insertions(+)
 create mode 100644 Documentation/gpu/drm-panic-design.rst

diff --git a/Documentation/gpu/drm-panic-design.rst b/Documentation/gpu/drm-panic-design.rst
new file mode 100644
index 000000000000..ba306193f347
--- /dev/null
+++ b/Documentation/gpu/drm-panic-design.rst
@@ -0,0 +1,124 @@



+========================
+A DRM-based panic viewer
+========================



+The Linux Kernel currently contains the necessary plumbing for viewing
+a kernel panic log using DRM-based kmsg dumpers, even if the system is
+currently running a graphical session (e.g. wayland).



+.. _drm_panic_design:



+Implementation design
+=====================



+Code pathes running in a panic context face several constraints:



+1. Code must be fully synchronous: interrupts are disabled
+2. Dynamic memory allocations are not allowed
+3. Cannot acquire any mutexes: atomic context, due to 1.
+4. Cannot acquire any spin locks: potential spinning-forever risk
Maybe rephrase as:
3. No sleeping (there's other ways to sleep than just memory allocations
and acquiring a mutex)
4. No unconditional locking at all (there's more than spinlocks/mutexes).
E.g. one of the most important locks is drm_modest_lock, which is a
ww_mutex.
...



+For the *DRM* panic code, the extra conditions below apply:



+5. Code must only trylock relevant DRM subsystem locks
+6. If any trylock operation fails, the code *must not* go through
+7. All rendering is done on the GPU's "current display buffer" used

at the time of panic(): no HW programming is done at all.

Maybe let's clarify this a bit, since from the discussions it sounded like
at least amdgpu needs to touch a few bits (disable tiling, the indirect
vram write registers to access vram outside of the bar): "Only the least amount
of HW programming (preferrably none) is done, exceptions would be
disabling tiled/compressed scanout."
...
+8. The code must be non-intrusive, and *must not* affect any other

panic handling mechanism (netconsole, ramoops, kexec, etc.)

Hm, I'm not entirely clear on what you mean here. Maybe some more examples
of what would be a bad idea?
...



+Rationale follows.




+Spin locks
+----------



+Acquiring a spin lock in a panic() context is potentially lethal:
+the lock might've been already acquired, _permanently_, by another
+core that is now fully shut down through an IPI from the panic()-ing
+core.



+Moreover, at least on x86, the first target architecture for this
+work, the panic()-ing core wait by default for *a full second* until
+all other cores finish their non-preemptible regions and terminate.
+If that did not work out, it even tries more aggressively with NMIs.



+So if the other non panic()-ing cores was holding a DRM-related lock
+through spin_lock_irqsave() for more than one second, then it's a
+bug in the DRM layer code. Thus, the DRM panic viewer cannot do
+anything and should just exit.  [A]



+What if the non panic()-ing core was holding a DRM lock through
+barebone spin_lock()? Interrupts are enabled there, so the IPI will be
+handled, and thus that core will effectively die while the lock is
+*forever held*.  [B]




+Trylocking
+----------



+The DRM panic code always *tries* to acquire the *minimum relevant
+set* of DRM related locks, through the basic :c:func:`spin_trylock()`
+mechanism.



+From case [A] and case [B] above, if the trylock operation fails,
+there's no point in retrying it multiple times: the relevant locks
+are in a broken and unrecoverable state anyway.



+Moreover, The panic code cannot also just ignore the DRM locks and
+force its way through: a broken non-preemptible DRM region implies
+either invalid SW state (e.g. broken linked list pointers), or a GPU
+in an unknown HW state.



+A GPU in an unknown HW state is quite dangerous: it has access to the
+full system memory, and if poked incorrectly, there's a really good
+chance it can kill the entire machine.




+GPU hardware access
+-------------------



+In the current state, a full GPU reset, modesetting, or even disabling
+GPU planes, is not doable under a panic() context: it implies going
+through a potentially huge set of DRM call-chains that cannot be
+sanely verified against the :ref:`drm_panic_design` requirements
+(e.g. memory allocations, spinlocks, etc.).



+The current approach is simple: run the minimal amount of code
+necessary to draw pixels onto the current scanout buffers. Instead
+of disabling GPU planes, the biggest visible rectangle is just picked.



+*Usually* there should be a main window that is big enough to show the
+oops.




+CI testing
+----------



+One of the things that killed earlier linux DRM panic handling efforts,
+beside getting into deep DRM call-chains that cannot be verified, was
+that it couldn't be tested except with real oopses.



+The first set of bug reports was whack-a-molde kind of bugs where the
+oops displayed was caused by the DRM panic handler itself instead of
+the real oops causing the panic.



+Thus, the :ref:`drm_panic_design` requirements was created. Moreover:



Special hooks are added at the spin_lock() level to complain


loudly if a spin lock operation was tried under the DRM panic
context. This could be easily noticed/reported by CI testing.


*Non-destructive* testing of the DRM panic code, emulating a


real panic path context as much as possible (e.g. by disabling
irqs and enabling the spin lock hooks earlier mentioned), is
created. This is necessary for heaviling testing the DRM panic
code through `CI machines https://lwn.net/Articles/735468/`_.



+Supported drivers
+=================



+* Intel i915-compatible cards
Excellent summary, thanks for typing this up.
-Daniel
...
--
darwi
http://darwish.chasingpointers.com
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [PATCH v2 0/3] drm: Add panic handling