On Thu, 17 Dec 2020 11:03:20 +0100 Daniel Vetter daniel.vetter@ffwll.ch wrote:
I think we're tripping over the might_sleep() all the mutexes have, and that's not as good as yours, but good enough to catch a missing rcu_read_unlock(). That's kinda why I'm baffled, since like almost every 2nd function in the backtrace grabbed a mutex and it was all fine until the very last.
I think it would be really nice if the rcu checks could retain (in debugging only) the backtrace of the outermost rcu_read_lock, so we could print that when something goes wrong in cases where it's leaked. For normal locks lockdep does that already (well not full backtrace I think, just the function that acquired the lock, but that's often enough). I guess that doesn't exist yet?
Also yes without reproducer this is kinda tough nut to crack.
I'm looking at drm_client_modeset_commit_atomic(), where it triggered after the "retry:" label, which to get to, does a bit of goto spaghetti, with a -EDEADLK detected and a goto backoff, which calls goto retry, and then the next mutex taken is the one that triggers the bug.
As this is hard to reproduce, but reproducible by a fuzzer, I'm guessing there's some error return path somewhere in there that doesn't release an rcu_read_lock().
-- Steve