Re: [Nouveau] [PATCH 1/5] drm/nouveau: Prevent RPM callback recursion in suspend/resume paths

17 Jul 2018

On Tue, 2018-07-17 at 20:20 +0200, Lukas Wunner wrote:
...
On Tue, Jul 17, 2018 at 12:53:11PM -0400, Lyude Paul wrote:
...
On Tue, 2018-07-17 at 09:16 +0200, Lukas Wunner wrote:
...
On Mon, Jul 16, 2018 at 07:59:25PM -0400, Lyude Paul wrote:
...
In order to fix all of the spots that need to have runtime PM get/puts()
added, we need to ensure that it's possible for us to call
pm_runtime_get/put() in any context, regardless of how deep, since
almost all of the spots that are currently missing refs can potentially
get called in the runtime suspend/resume path. Otherwise, we'll try to
resume the GPU as we're trying to resume the GPU (and vice-versa) and
cause the kernel to deadlock.
With this, it should be safe to call the pm runtime functions in any
context in nouveau with one condition: any point in the driver that
calls pm_runtime_get*() cannot hold any locks owned by nouveau that
would be acquired anywhere inside nouveau_pmops_runtime_resume().
This includes modesetting locks, i2c bus locks, etc.
[snip]
...

--- a/drivers/gpu/drm/nouveau/nouveau_drm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_drm.c
@@ -835,6 +835,8 @@ nouveau_pmops_runtime_suspend(struct device *dev)
   	return -EBUSY;
   }

dev->power.disable_depth++;


Anyway, if I understand the commit message correctly, you're hitting a
pm_runtime_get_sync() in a code path that itself is called during a
pm_runtime_get_sync().  Could you include stack traces in the commit
message?  My gut feeling is that this patch masks a deeper issue,
e.g. if the runtime_resume code path does in fact directly poll outputs,
that would seem wrong.  Runtime resume should merely make the card
accessible, i.e. reinstate power if necessary, put into PCI_D0,
restore registers, etc.  Output polling should be scheduled
asynchronously.
So: the reason that patch was added was mainly for the patches later in the
series that add guards around the i2c bus and aux bus, since both of those
require that the device be awake for it to work. Currently, the spot where
it
would recurse is:
Okay, the PCI device is suspending and the nvkm_i2c_aux_acquire()
wants it in resumed state, so is waiting forever for the device to
runtime suspend in order to resume it again immediately afterwards.
The deadlock in the stack trace you've posted could be resolved using
the technique I used in d61a5c106351 by adding the following to
include/linux/pm_runtime.h:
static inline bool pm_runtime_status_suspending(struct device *dev)
{
   return dev->power.runtime_status == RPM_SUSPENDING;
}
static inline bool is_pm_work(struct device *dev)
{
   struct work_struct *work = current_work();
return work && work->func == dev->power.work;
}
Then adding this to nvkm_i2c_aux_acquire():
struct device *dev = pad->i2c->subdev.device->dev;
if (!(is_pm_work(dev) && pm_runtime_status_suspending(dev))) {
   	ret = pm_runtime_get_sync(dev);
   	if (ret < 0 && ret != -EACCES)
   		return ret;
   }
But here's the catch:  This only works for an *async* runtime suspend.
It doesn't work for pm_runtime_put_sync(), pm_runtime_suspend() etc,
because then the runtime suspend is executed in the context of the caller,
not in the context of dev->power.work.
So it's not a full solution, but hopefully something that gets you
going.  I'm not really familiar with the code paths leading to
nvkm_i2c_aux_acquire() to come up with a full solution off the top
of my head I'm afraid.
OK-I was considering doing something similar to that commit beforehand but I
wasn't sure if I was going to just be hacking around an actual issue. That
doesn't seem to be the case. This is very helpful and hopefully I should be able
to figure something out from this, thanks!
...
Note, it's not sufficient to just check pm_runtime_status_suspending(dev)
because if the runtime_suspend is carried out concurrently by something
else, this will return true but it's not guaranteed that the device is
actually kept awake until the i2c communication has been fully performed.
HTH,
Lukas

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [Nouveau] [PATCH 1/5] drm/nouveau: Prevent RPM callback recursion in suspend/resume paths