The following three patches remove unecessary locks around ioctls in drm module.
First two:
[PATCH 1/3] drm: no need to hold global mutex for static data [PATCH 2/3] drm: make DRM_UNLOCKED ioctls with their own mutex
are rather trivial and straight forward and probably do not need much explanation.
The third one:
[PATCH 3/3] drm: do not sleep on vblank while holding a mutex
is more serious and clears a clog that can occur if multiple processes call drm_wait_vblank as explained in patch commit message.
drm_getcap and drm_version ioctls only reads static data, there is no need to protect them with drm_global_mutex, so make them DRM_UNLOCKED
Signed-off-by: Ilija Hadzic ihadzic@research.bell-labs.com --- drivers/gpu/drm/drm_drv.c | 4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c index 7a87e08..e159f17 100644 --- a/drivers/gpu/drm/drm_drv.c +++ b/drivers/gpu/drm/drm_drv.c @@ -60,14 +60,14 @@ static int drm_version(struct drm_device *dev, void *data,
/** Ioctl table */ static struct drm_ioctl_desc drm_ioctls[] = { - DRM_IOCTL_DEF(DRM_IOCTL_VERSION, drm_version, 0), + DRM_IOCTL_DEF(DRM_IOCTL_VERSION, drm_version, DRM_UNLOCKED), DRM_IOCTL_DEF(DRM_IOCTL_GET_UNIQUE, drm_getunique, 0), DRM_IOCTL_DEF(DRM_IOCTL_GET_MAGIC, drm_getmagic, 0), DRM_IOCTL_DEF(DRM_IOCTL_IRQ_BUSID, drm_irq_by_busid, DRM_MASTER|DRM_ROOT_ONLY), DRM_IOCTL_DEF(DRM_IOCTL_GET_MAP, drm_getmap, 0), DRM_IOCTL_DEF(DRM_IOCTL_GET_CLIENT, drm_getclient, 0), DRM_IOCTL_DEF(DRM_IOCTL_GET_STATS, drm_getstats, 0), - DRM_IOCTL_DEF(DRM_IOCTL_GET_CAP, drm_getcap, 0), + DRM_IOCTL_DEF(DRM_IOCTL_GET_CAP, drm_getcap, DRM_UNLOCKED), DRM_IOCTL_DEF(DRM_IOCTL_SET_VERSION, drm_setversion, DRM_MASTER),
DRM_IOCTL_DEF(DRM_IOCTL_SET_UNIQUE, drm_setunique, DRM_AUTH|DRM_MASTER|DRM_ROOT_ONLY),
On Tue, Oct 25, 2011 at 10:20:12PM -0400, Ilija Hadzic wrote:
drm_getcap and drm_version ioctls only reads static data, there is no need to protect them with drm_global_mutex, so make them DRM_UNLOCKED
Signed-off-by: Ilija Hadzic ihadzic@research.bell-labs.com
Reviewed-by: Daniel Vetter daniel.vetter@ffwll.ch
drm_getmap, drm_getclient and drm_getstats are all protected with their own mutex (dev->struct_mutex) no need to hold global mutex; make them DRM_UNLOCKED
Signed-off-by: Ilija Hadzic ihadzic@research.bell-labs.com --- drivers/gpu/drm/drm_drv.c | 6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c index e159f17..dbabcb0 100644 --- a/drivers/gpu/drm/drm_drv.c +++ b/drivers/gpu/drm/drm_drv.c @@ -64,9 +64,9 @@ static struct drm_ioctl_desc drm_ioctls[] = { DRM_IOCTL_DEF(DRM_IOCTL_GET_UNIQUE, drm_getunique, 0), DRM_IOCTL_DEF(DRM_IOCTL_GET_MAGIC, drm_getmagic, 0), DRM_IOCTL_DEF(DRM_IOCTL_IRQ_BUSID, drm_irq_by_busid, DRM_MASTER|DRM_ROOT_ONLY), - DRM_IOCTL_DEF(DRM_IOCTL_GET_MAP, drm_getmap, 0), - DRM_IOCTL_DEF(DRM_IOCTL_GET_CLIENT, drm_getclient, 0), - DRM_IOCTL_DEF(DRM_IOCTL_GET_STATS, drm_getstats, 0), + DRM_IOCTL_DEF(DRM_IOCTL_GET_MAP, drm_getmap, DRM_UNLOCKED), + DRM_IOCTL_DEF(DRM_IOCTL_GET_CLIENT, drm_getclient, DRM_UNLOCKED), + DRM_IOCTL_DEF(DRM_IOCTL_GET_STATS, drm_getstats, DRM_UNLOCKED), DRM_IOCTL_DEF(DRM_IOCTL_GET_CAP, drm_getcap, DRM_UNLOCKED), DRM_IOCTL_DEF(DRM_IOCTL_SET_VERSION, drm_setversion, DRM_MASTER),
On Tue, Oct 25, 2011 at 10:20:13PM -0400, Ilija Hadzic wrote:
drm_getmap, drm_getclient and drm_getstats are all protected with their own mutex (dev->struct_mutex) no need to hold global mutex; make them DRM_UNLOCKED
Signed-off-by: Ilija Hadzic ihadzic@research.bell-labs.com
Just to check before I dig into reviewing this: Have you check all the other users of these data structure that these functions touch whether they don't accidentally rely on the global lock being taken? Just because there's some locking around doesn't mean it actually protects much ...
I ask because I've just recently waded through the non-kms drm cruft again and there's some horrible stuff there.
Yours, Daniel
On Wed, 26 Oct 2011, Daniel Vetter wrote:
Just to check before I dig into reviewing this: Have you check all the other users of these data structure that these functions touch whether they don't accidentally rely on the global lock being taken?
I did and thought it was safe. I re-checked this morning prompted by your note and of course there is one (easily fixable) thing that I missed. Here is the full "report":
drm_getclient is safe: the only thing that should be protected there is dev->filelist and that is all protected in other functions using struct_mutex.
drm_getstats is safe: actually I think there is no need for any mutex there because the loop runs through quasi-static (set at load time only) data, and the actual count access is done with atomic_read()
drm_getmap has one problem which I can fix (and it's the hardest to follow): the structure that should be protected there is the dev->maplist. There are three functions that may potentially be an issue:
drm_getsarea doesn't grab any lock before touching dev->maplist (so global mutex won't help here either). However, drivers seem to call it only at initialization time so it probably doesn't matter
drm_master_destroy doesn't grab any lock either, but it is called from drm_master_put which is protected with dev->struct_mutex everywhere else in drm module. I also see a couple of calls to drm_master_destroy from vmwgfx driver that look unlocked to me, but drm_global_mutex won't help here either
drm_getsareactx uses struct_mutex, but it apparently releases it too early (that's the problem I missed initially). If it's released after it's done with dev->maplist, we should be fine.
I'll redo this patch with a fix to drm_getsareactx and that should do it. I would still appreciate another pair of eyes to make sure I didn't miss anything else
thanks,
-- Ilija
On Wed, Oct 26, 2011 at 11:11:14AM -0500, Ilija Hadzic wrote:
On Wed, 26 Oct 2011, Daniel Vetter wrote:
Just to check before I dig into reviewing this: Have you check all the other users of these data structure that these functions touch whether they don't accidentally rely on the global lock being taken?
I did and thought it was safe. I re-checked this morning prompted by your note and of course there is one (easily fixable) thing that I missed. Here is the full "report":
drm_getclient is safe: the only thing that should be protected there is dev->filelist and that is all protected in other functions using struct_mutex.
drm_getstats is safe: actually I think there is no need for any mutex there because the loop runs through quasi-static (set at load time only) data, and the actual count access is done with atomic_read()
drm_getmap has one problem which I can fix (and it's the hardest to follow): the structure that should be protected there is the dev->maplist. There are three functions that may potentially be an issue:
drm_getsarea doesn't grab any lock before touching dev->maplist (so global mutex won't help here either). However, drivers seem to call it only at initialization time so it probably doesn't matter
drm_master_destroy doesn't grab any lock either, but it is called from drm_master_put which is protected with dev->struct_mutex everywhere else in drm module. I also see a couple of calls to drm_master_destroy from vmwgfx driver that look unlocked to me, but drm_global_mutex won't help here either
drm_getsareactx uses struct_mutex, but it apparently releases it too early (that's the problem I missed initially). If it's released after it's done with dev->maplist, we should be fine.
I'll redo this patch with a fix to drm_getsareactx and that should do it. I would still appreciate another pair of eyes to make sure I didn't miss anything else
Awesome. Please include this in the patch description, and if the description gets too complicated, maybe even split up the patches into the individual cases. Also, when you can convince yourself that drm_getstats doesn't need the struct_mutex, I think it'd be best to drop it. Locking that doesn't actually protect anything just confuses, especially here in drm where the locking isn't too clear to begin with.
Yours, Daniel
holding the drm_global_mutex in drm_wait_vblank and then going to sleep (via DRM_WAIT_ON) is a bad idea in general because it can block other processes that may issue ioctls that also grab drm_global_mutex. Blocking can occur until next vblank which is in the tens of microseconds order of magnitude.
fix it, but making drm_wait_vblank DRM_UNLOCKED call and then grabbing the mutex inside the drm_wait_vblank function and only for the duration of the critical section (i.e., from first manipulation of vblank sequence number until putting the task in the wait queue).
Signed-off-by: Ilija Hadzic ihadzic@research.bell-labs.com --- drivers/gpu/drm/drm_drv.c | 2 +- drivers/gpu/drm/drm_irq.c | 16 +++++++++++----- include/drm/drm_os_linux.h | 25 +++++++++++++++++++++++++ 3 files changed, 37 insertions(+), 6 deletions(-)
diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c index dbabcb0..dc0eb0b 100644 --- a/drivers/gpu/drm/drm_drv.c +++ b/drivers/gpu/drm/drm_drv.c @@ -124,7 +124,7 @@ static struct drm_ioctl_desc drm_ioctls[] = { DRM_IOCTL_DEF(DRM_IOCTL_SG_ALLOC, drm_sg_alloc_ioctl, DRM_AUTH|DRM_MASTER|DRM_ROOT_ONLY), DRM_IOCTL_DEF(DRM_IOCTL_SG_FREE, drm_sg_free, DRM_AUTH|DRM_MASTER|DRM_ROOT_ONLY),
- DRM_IOCTL_DEF(DRM_IOCTL_WAIT_VBLANK, drm_wait_vblank, 0), + DRM_IOCTL_DEF(DRM_IOCTL_WAIT_VBLANK, drm_wait_vblank, DRM_UNLOCKED),
DRM_IOCTL_DEF(DRM_IOCTL_MODESET_CTL, drm_modeset_ctl, 0),
diff --git a/drivers/gpu/drm/drm_irq.c b/drivers/gpu/drm/drm_irq.c index 3830e9e..d85d604 100644 --- a/drivers/gpu/drm/drm_irq.c +++ b/drivers/gpu/drm/drm_irq.c @@ -1191,6 +1191,7 @@ int drm_wait_vblank(struct drm_device *dev, void *data, DRM_DEBUG("failed to acquire vblank counter, %d\n", ret); return ret; } + mutex_lock(&drm_global_mutex); seq = drm_vblank_count(dev, crtc);
switch (vblwait->request.type & _DRM_VBLANK_TYPES_MASK) { @@ -1200,12 +1201,15 @@ int drm_wait_vblank(struct drm_device *dev, void *data, case _DRM_VBLANK_ABSOLUTE: break; default: + mutex_unlock(&drm_global_mutex); ret = -EINVAL; goto done; }
- if (flags & _DRM_VBLANK_EVENT) + if (flags & _DRM_VBLANK_EVENT) { + mutex_unlock(&drm_global_mutex); return drm_queue_vblank_event(dev, crtc, vblwait, file_priv); + }
if ((flags & _DRM_VBLANK_NEXTONMISS) && (seq - vblwait->request.sequence) <= (1<<23)) { @@ -1215,10 +1219,12 @@ int drm_wait_vblank(struct drm_device *dev, void *data, DRM_DEBUG("waiting on vblank count %d, crtc %d\n", vblwait->request.sequence, crtc); dev->last_vblank_wait[crtc] = vblwait->request.sequence; - DRM_WAIT_ON(ret, dev->vbl_queue[crtc], 3 * DRM_HZ, - (((drm_vblank_count(dev, crtc) - - vblwait->request.sequence) <= (1 << 23)) || - !dev->irq_enabled)); + /* DRM_WAIT_ON_LOCKED will release drm_global_mutex */ + /* before going to sleep */ + DRM_WAIT_ON_LOCKED(ret, dev->vbl_queue[crtc], 3 * DRM_HZ, + (((drm_vblank_count(dev, crtc) - + vblwait->request.sequence) <= (1 << 23)) || + !dev->irq_enabled));
if (ret != -EINTR) { struct timeval now; diff --git a/include/drm/drm_os_linux.h b/include/drm/drm_os_linux.h index 3933691..fc6aaf4 100644 --- a/include/drm/drm_os_linux.h +++ b/include/drm/drm_os_linux.h @@ -123,5 +123,30 @@ do { \ remove_wait_queue(&(queue), &entry); \ } while (0)
+#define DRM_WAIT_ON_LOCKED( ret, queue, timeout, condition ) \ +do { \ + DECLARE_WAITQUEUE(entry, current); \ + unsigned long end = jiffies + (timeout); \ + add_wait_queue(&(queue), &entry); \ + mutex_unlock(&drm_global_mutex); \ + \ + for (;;) { \ + __set_current_state(TASK_INTERRUPTIBLE); \ + if (condition) \ + break; \ + if (time_after_eq(jiffies, end)) { \ + ret = -EBUSY; \ + break; \ + } \ + schedule_timeout((HZ/100 > 1) ? HZ/100 : 1); \ + if (signal_pending(current)) { \ + ret = -EINTR; \ + break; \ + } \ + } \ + __set_current_state(TASK_RUNNING); \ + remove_wait_queue(&(queue), &entry); \ +} while (0) + #define DRM_WAKEUP( queue ) wake_up( queue ) #define DRM_INIT_WAITQUEUE( queue ) init_waitqueue_head( queue )
On Tue, Oct 25, 2011 at 10:20:14PM -0400, Ilija Hadzic wrote:
holding the drm_global_mutex in drm_wait_vblank and then going to sleep (via DRM_WAIT_ON) is a bad idea in general because it can block other processes that may issue ioctls that also grab drm_global_mutex. Blocking can occur until next vblank which is in the tens of microseconds order of magnitude.
fix it, but making drm_wait_vblank DRM_UNLOCKED call and then grabbing the mutex inside the drm_wait_vblank function and only for the duration of the critical section (i.e., from first manipulation of vblank sequence number until putting the task in the wait queue).
Signed-off-by: Ilija Hadzic ihadzic@research.bell-labs.com
While you mess around with this code, please use the standard linux wait_event_* infrastructure. I want to stop that DRM_FOO yelling from spreading around - the idea of the drm core being os agnostic died quite a while ago.
Again, have you carefully checked whether this is safe and how the relevant data structures (vblank counting) are proctected?
One comment below, mind though that I've only glanced over the patch. -Daniel
drivers/gpu/drm/drm_drv.c | 2 +- drivers/gpu/drm/drm_irq.c | 16 +++++++++++----- include/drm/drm_os_linux.h | 25 +++++++++++++++++++++++++ 3 files changed, 37 insertions(+), 6 deletions(-)
diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c index dbabcb0..dc0eb0b 100644 --- a/drivers/gpu/drm/drm_drv.c +++ b/drivers/gpu/drm/drm_drv.c @@ -124,7 +124,7 @@ static struct drm_ioctl_desc drm_ioctls[] = { DRM_IOCTL_DEF(DRM_IOCTL_SG_ALLOC, drm_sg_alloc_ioctl, DRM_AUTH|DRM_MASTER|DRM_ROOT_ONLY), DRM_IOCTL_DEF(DRM_IOCTL_SG_FREE, drm_sg_free, DRM_AUTH|DRM_MASTER|DRM_ROOT_ONLY),
- DRM_IOCTL_DEF(DRM_IOCTL_WAIT_VBLANK, drm_wait_vblank, 0),
DRM_IOCTL_DEF(DRM_IOCTL_WAIT_VBLANK, drm_wait_vblank, DRM_UNLOCKED),
DRM_IOCTL_DEF(DRM_IOCTL_MODESET_CTL, drm_modeset_ctl, 0),
diff --git a/drivers/gpu/drm/drm_irq.c b/drivers/gpu/drm/drm_irq.c index 3830e9e..d85d604 100644 --- a/drivers/gpu/drm/drm_irq.c +++ b/drivers/gpu/drm/drm_irq.c @@ -1191,6 +1191,7 @@ int drm_wait_vblank(struct drm_device *dev, void *data, DRM_DEBUG("failed to acquire vblank counter, %d\n", ret); return ret; }
mutex_lock(&drm_global_mutex); seq = drm_vblank_count(dev, crtc);
switch (vblwait->request.type & _DRM_VBLANK_TYPES_MASK) {
@@ -1200,12 +1201,15 @@ int drm_wait_vblank(struct drm_device *dev, void *data, case _DRM_VBLANK_ABSOLUTE: break; default:
ret = -EINVAL; goto done; }mutex_unlock(&drm_global_mutex);
- if (flags & _DRM_VBLANK_EVENT)
if (flags & _DRM_VBLANK_EVENT) {
mutex_unlock(&drm_global_mutex);
return drm_queue_vblank_event(dev, crtc, vblwait, file_priv);
}
if ((flags & _DRM_VBLANK_NEXTONMISS) && (seq - vblwait->request.sequence) <= (1<<23)) {
@@ -1215,10 +1219,12 @@ int drm_wait_vblank(struct drm_device *dev, void *data, DRM_DEBUG("waiting on vblank count %d, crtc %d\n", vblwait->request.sequence, crtc); dev->last_vblank_wait[crtc] = vblwait->request.sequence;
- DRM_WAIT_ON(ret, dev->vbl_queue[crtc], 3 * DRM_HZ,
(((drm_vblank_count(dev, crtc) -
vblwait->request.sequence) <= (1 << 23)) ||
!dev->irq_enabled));
/* DRM_WAIT_ON_LOCKED will release drm_global_mutex */
/* before going to sleep */
DRM_WAIT_ON_LOCKED(ret, dev->vbl_queue[crtc], 3 * DRM_HZ,
(((drm_vblank_count(dev, crtc) -
vblwait->request.sequence) <= (1 << 23)) ||
!dev->irq_enabled));
if (ret != -EINTR) { struct timeval now;
diff --git a/include/drm/drm_os_linux.h b/include/drm/drm_os_linux.h index 3933691..fc6aaf4 100644 --- a/include/drm/drm_os_linux.h +++ b/include/drm/drm_os_linux.h @@ -123,5 +123,30 @@ do { \ remove_wait_queue(&(queue), &entry); \ } while (0)
+#define DRM_WAIT_ON_LOCKED( ret, queue, timeout, condition ) \ +do { \
- DECLARE_WAITQUEUE(entry, current); \
- unsigned long end = jiffies + (timeout); \
- add_wait_queue(&(queue), &entry); \
- mutex_unlock(&drm_global_mutex); \
Hiding a lock-dropping in this fashion is horrible.
\
- for (;;) { \
__set_current_state(TASK_INTERRUPTIBLE); \
if (condition) \
break; \
if (time_after_eq(jiffies, end)) { \
ret = -EBUSY; \
break; \
} \
schedule_timeout((HZ/100 > 1) ? HZ/100 : 1); \
if (signal_pending(current)) { \
ret = -EINTR; \
break; \
} \
- } \
- __set_current_state(TASK_RUNNING); \
- remove_wait_queue(&(queue), &entry); \
+} while (0)
#define DRM_WAKEUP( queue ) wake_up( queue )
#define DRM_INIT_WAITQUEUE( queue ) init_waitqueue_head( queue )
1.7.7
dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
On Wed, 26 Oct 2011, Daniel Vetter wrote:
While you mess around with this code, please use the standard linux wait_event_* infrastructure.
Right now the order of entering the queue is guaranteed to be the same as the order of entering the ioctl, which reflects the order of progressing vblank sequence. I wanted to preserve this semantic, so I need a wait function that puts the task into the queue and then unlocks the mutex (essentially a kernel equivalent of pthread_cond_wait that POSIX threads library implements).
The closest "approximation" of that wait_event_interruptable_locked, but that requires a spinlock, not mutex (and thus the rework to drm_wait_vblank ioctl would be more radical.
I want to stop that DRM_FOO yelling from spreading around - the idea of the drm core being os agnostic died quite a while ago.
Is DRM support for other OS-es officially dead ? I know it's not in best shape on BSD and friends, but I am not sure if I should make it worse (or inconsistent with the current code shape and form).
Anyway, the primary reason for implementing the DRM_WAIT_ON_LOCKED is because I didn't have the function that enqueues the task and then releases the mutex.
Again, have you carefully checked whether this is safe and how the relevant data structures (vblank counting) are proctected?
I am only moving the global lock further down in the function. The structures moved out of the critical section are only local vars. and arguments. So it's safe. Had I changed the mutex to something other than global one (which is the right thing to do in a long run) then a more careful review would be warranted.
Also note that running drm_wait_ioctl as DRM_LOCKED is a severe problem that we have to address quickly. So I'd like to get *some* kind of decent fix in this merge window and then follow up with more polishing later.
For example, try compiling and running this code (which any bozo in the userland can do):
#include <stdio.h> #include <xf86drm.h>
main() { int fd; drmVBlank vbl;
fd = open("/dev/dri/card0", 0); printf("fd=%d\n", fd);
while(1) { vbl.request.type = DRM_VBLANK_RELATIVE ; vbl.request.sequence = 60; printf("waiting on vblank\n"); ioctl(fd, DRM_IOCTL_WAIT_VBLANK, &vbl); } }
Then start glxgears (on CRTC-0) and watch the hell break loose. The hostile code will cause any OpenGL application to hang for a full second before it can even enter the vblank_wait. Now because all vblank waits go through a common function in DDX, the whole windowing system will consequently go into a crawl. Run it with Compiz and things are even worse (the whole windowing system goes "stupid" as soon as the hostile application is started).
- add_wait_queue(&(queue), &entry); \
- mutex_unlock(&drm_global_mutex); \
Hiding a lock-dropping in this fashion is horrible.
I explained above why am I have to release the lock inside the macro. That's equivalent to what you can find in the userland in POSIX threads, like I said. So that's not bad.
What *is* bad is that I should have passed the mutex as an argument to DRM_WAIT_ON_LOCKED and that I'll fix.
-- Ilija
On Wed, Oct 26, 2011 at 05:33:24PM -0500, Ilija Hadzic wrote:
Is DRM support for other OS-es officially dead ? I know it's not in best shape on BSD and friends, but I am not sure if I should make it worse (or inconsistent with the current code shape and form).
I think the idea of sharing kernel drm code is pretty much dead, yeah.
main() { int fd; drmVBlank vbl;
fd = open("/dev/dri/card0", 0); printf("fd=%d\n", fd); while(1) { vbl.request.type = DRM_VBLANK_RELATIVE ; vbl.request.sequence = 60; printf("waiting on vblank\n"); ioctl(fd, DRM_IOCTL_WAIT_VBLANK, &vbl); }
}
Then start glxgears (on CRTC-0) and watch the hell break loose. The hostile code will cause any OpenGL application to hang for a full second before it can even enter the vblank_wait. Now because all vblank waits go through a common function in DDX, the whole windowing system will consequently go into a crawl. Run it with Compiz and things are even worse (the whole windowing system goes "stupid" as soon as the hostile application is started).
This is really just horrible.
- add_wait_queue(&(queue), &entry); \
- mutex_unlock(&drm_global_mutex); \
Hiding a lock-dropping in this fashion is horrible.
I explained above why am I have to release the lock inside the macro. That's equivalent to what you can find in the userland in POSIX threads, like I said. So that's not bad.
I disagree - this is just the blk magic reinvented. Also I don't like to rush locking changes, because the hard part is reviewing all the drivers, and I prefer to do that just once for the real solution.
Also explaining locking changes (especially in drm) starting with "assuming the old code is correct, this is safe" isn't great. Because your changes might simply enlarge the already existing race.
On a quick glance - drm_vblank functions call by wait_vblank seem to do correct locking already using various spinlocks and atomic ops. - linux waitqueues have their own locking - dev->last_vblank_wait is only used for a procfs file. I don't think it makes much sense given that we can have more than one vblank waiter - there's no selective wakeup going on, all waiters get woken for every vblank and call schedule again if it's not the right vblank
The only really hairy thing going on is racing modeset ioctls against vblank waiters. But modeset is protected by the dev->mode_config.mutex and hence already races against wait_vblank with the current code, so I'm slightly inclined to ignore this for the moment. Would be great if you coudl check that in-depth, too.
So I think the right thing to do is - Kill dev->last_vblank_wait (in a prep patch). - Imo we also have a few too many atomic ops going on, e.g. dev->vblank_refcount looks like it's atomic only because or the procfs file, so maybe kill that procfs file entirely. - Audit the vblank locking, maybe resulting in a patch with updated comments, if there's anything misleading or tricky going on. And it's always good when the locking of structe members is explained in the header file ... - Drop the mutex locking because it's not needed.
While locking at the code I also noticed that wait_vblank calls drm_vblank_get, but not always drm_vblank_put. The code is correct, the missing vblank_put is hidden in drm_queue_vblank_event. Can you also write the patch to move that around to not trip up reviewers the next time around?
Yours, Daniel -
On 10/27/11 03:43, Daniel Vetter wrote:
On Wed, Oct 26, 2011 at 05:33:24PM -0500, Ilija Hadzic wrote:
Is DRM support for other OS-es officially dead ? I know it's not in best shape on BSD and friends, but I am not sure if I should make it worse (or inconsistent with the current code shape and form).
I think the idea of sharing kernel drm code is pretty much dead, yeah.
We are still trying to keep it alive, despite what some may think.
On Thu, 27 Oct 2011, Alan Coopersmith wrote:
I think the idea of sharing kernel drm code is pretty much dead, yeah.
We are still trying to keep it alive, despite what some may think.
In the context of this patch (in progress), it's probably a moot topic because per comments that Daniel sent, most of the locks in drm_vblank_wait will become unecessary after I rework the patch and I won't need to introduce (new) DRM_WAIT_ON_LOCKED any more.
I will, however, *not* convert existing DRM_WAIT_ON to a Linux-centric function, so I will not be the one to kill (or contribute to the killing of) the portability of that file. That I leave to someone else ;-).
P.S. If the "fight" starts I will side with the "keep the portability alive" camp. I have "emotional" ties with other (less widely used) operating systems. ;-)
-- Ilija
On Thu, 27 Oct 2011, Daniel Vetter wrote:
On a quick glance
- drm_vblank functions call by wait_vblank seem to do correct locking
already using various spinlocks and atomic ops.
- linux waitqueues have their own locking
- dev->last_vblank_wait is only used for a procfs file. I don't think it
makes much sense given that we can have more than one vblank waiter
- there's no selective wakeup going on, all waiters get woken for every
vblank and call schedule again if it's not the right vblank
I concur.
The only really hairy thing going on is racing modeset ioctls against vblank waiters. But modeset is protected by the dev->mode_config.mutex and hence already races against wait_vblank with the current code, so I'm slightly inclined to ignore this for the moment. Would be great if you coudl check that in-depth, too.
Will leave this for some other patch at some other time; the critical path is to fix the hang/crawl that I explained in my previous note.
So I think the right thing to do is
- Kill dev->last_vblank_wait (in a prep patch).
Agreed. Also drm_vblank_info function can go away
- Imo we also have a few too many atomic ops going on, e.g.
dev->vblank_refcount looks like it's atomic only because or the procfs file, so maybe kill that procfs file entirely.
I am not sure about that. dev->vblank_refcount looks critical to me: it is incremented through drm_vblank_get which is called from several different contexts: page-flip functions in several drivers (which can run in the context of multiple user processes), **_pm_** stuff in radeon driver (which looks like it runs in the context of kernel worker thread). Mutexes used at all these different places are not quite consistent. If anything, as the result of a later audit, some other mutexes may be rendered unecessary, but definitely, I would keep vblank_refcount atomic.
- Audit the vblank locking, maybe resulting in a patch with updated
comments, if there's anything misleading or tricky going on. And it's always good when the locking of structe members is explained in the header file ...
I'll add comments that I feel make sense for this set of patches (if anything). Full audit, should come later at a slower pace. There are quite a few mutexes and locks many of which are overlapping in function and some are spurious. It will take more than just myself to untangle this know.
- Drop the mutex locking because it's not needed.
Agreed. Will drop.
While locking at the code I also noticed that wait_vblank calls drm_vblank_get, but not always drm_vblank_put. The code is correct, the missing vblank_put is hidden in drm_queue_vblank_event. Can you also write the patch to move that around to not trip up reviewers the next time around?
Actually, there is something fishy here. Look at the 'else' branch under the 'if ((seq - vblwait->request.sequence) <= (1 << 23))' at the end of the drm_queue_vblank_event. It doesn't have the 'drm_vblank_put'. It looks wrong to me, but I need to first convince myself that there is not some other obscure place where drm_vblank_put is accounted for. If that branch is really missing a drm_vblank_put, then it will be easy pull the drm_vblank_put out. Otherwise, it will be another knot to untangle.
On Thu, 27 Oct 2011, Ilija Hadzic wrote:
While locking at the code I also noticed that wait_vblank calls drm_vblank_get, but not always drm_vblank_put. The code is correct, the missing vblank_put is hidden in drm_queue_vblank_event. Can you also write the patch to move that around to not trip up reviewers the next time around?
Actually, there is something fishy here. Look at the 'else' branch under the 'if ((seq - vblwait->request.sequence) <= (1 << 23))' at the end of the drm_queue_vblank_event. It doesn't have the 'drm_vblank_put'. It looks wrong to me, but I need to first convince myself that there is not some other obscure place where drm_vblank_put is accounted for. If that branch is really missing a drm_vblank_put, then it will be easy pull the drm_vblank_put out. Otherwise, it will be another knot to untangle.
OK, I checked this. Although the code is a little hard to read, it is done the way it is with the reason. Whenever the drm_queue_vblank_event and that 'else' branch is caught, the drm_vblank_put should *not* be called. The reason is that the relevant vblank has not come in yet, so the reference must remain up so that the vblank interrupts are not disabled until the event occurs.
The seemingly missing drm_vblank_put will be accounted for in drm_handle_vblank_events.
So I should not move anything around. If anything, this should be annotated with comments to explain what's going on.
-- Ilija
On Don, 2011-10-27 at 22:19 -0500, Ilija Hadzic wrote:
On Thu, 27 Oct 2011, Daniel Vetter wrote:
So I think the right thing to do is
- Kill dev->last_vblank_wait (in a prep patch).
Agreed. Also drm_vblank_info function can go away
Actually, I was rather going to submit a patch to hook it up again — AFAICT it was unhooked without any justification. It could be useful for debugging vblank related hangs. Any issues with it, such as last_vblank_wait not being guaranteed to really be the last one, can always be improved later on.
On Fri, Oct 28, 2011 at 08:59:04AM +0200, Michel Dänzer wrote:
On Don, 2011-10-27 at 22:19 -0500, Ilija Hadzic wrote:
On Thu, 27 Oct 2011, Daniel Vetter wrote:
So I think the right thing to do is
- Kill dev->last_vblank_wait (in a prep patch).
Agreed. Also drm_vblank_info function can go away
Actually, I was rather going to submit a patch to hook it up again — AFAICT it was unhooked without any justification. It could be useful for debugging vblank related hangs. Any issues with it, such as last_vblank_wait not being guaranteed to really be the last one, can always be improved later on.
I've thought a bit about the usefulness of it for debugging before proposing to kill it and I think it can die: It's only really useful for a complete hangs, if we have an issue we just missing a wakeup somewhere, that's not gonna catch things. Hence I think something that allows you to watch things while it's running is much better, i.e. either a drm debug prinkt or a tracecall.
But if you're firmly attached to that debug file it should be pretty easy to shove that under the protection of one of the vblank spinlocks. -Daniel
On Fri, 28 Oct 2011, Daniel Vetter wrote:
On Fri, Oct 28, 2011 at 08:59:04AM +0200, Michel Dänzer wrote:
On Don, 2011-10-27 at 22:19 -0500, Ilija Hadzic wrote:
On Thu, 27 Oct 2011, Daniel Vetter wrote:
So I think the right thing to do is
- Kill dev->last_vblank_wait (in a prep patch).
Agreed. Also drm_vblank_info function can go away
Actually, I was rather going to submit a patch to hook it up again — AFAICT it was unhooked without any justification. It could be useful for debugging vblank related hangs. Any issues with it, such as last_vblank_wait not being guaranteed to really be the last one, can always be improved later on.
I've thought a bit about the usefulness of it for debugging before proposing to kill it and I think it can die: It's only really useful for a complete hangs, if we have an issue we just missing a wakeup somewhere, that's not gonna catch things. Hence I think something that allows you to watch things while it's running is much better, i.e. either a drm debug prinkt or a tracecall.
But if you're firmly attached to that debug file it should be pretty easy to shove that under the protection of one of the vblank spinlocks.
I'll keep it then and figure out the best mutex/spinlock to use. It can be anything that exists on one-per-CRTC basis (vblank waits on different CTCs are not contending). The critical section is from that switch in which vblwait->request.sequence is incremented until it is assigned to dev->last_vblank_wait[crtc] (and we are only protecting that section against itself, executed in contexts of different PIDs).
I guess we settled now on this patch (other comments will be addressed in a different set of patches).
-- Ilija
On Fri, Oct 28, 2011 at 07:10:51AM -0500, Ilija Hadzic wrote:
I'll keep it then and figure out the best mutex/spinlock to use. It can be anything that exists on one-per-CRTC basis (vblank waits on different CTCs are not contending). The critical section is from that switch in which vblwait->request.sequence is incremented until it is assigned to dev->last_vblank_wait[crtc] (and we are only protecting that section against itself, executed in contexts of different PIDs).
I guess we settled now on this patch (other comments will be addressed in a different set of patches).
If you settle on keeping ->last_vblank_wait I think the easiest is to wrap it with the dev->vbl_lock spinlock everywhere. -Daniel
On Thu, Oct 27, 2011 at 10:19:39PM -0500, Ilija Hadzic wrote:
On Thu, 27 Oct 2011, Daniel Vetter wrote:
The only really hairy thing going on is racing modeset ioctls against vblank waiters. But modeset is protected by the dev->mode_config.mutex and hence already races against wait_vblank with the current code, so I'm slightly inclined to ignore this for the moment. Would be great if you coudl check that in-depth, too.
Will leave this for some other patch at some other time; the critical path is to fix the hang/crawl that I explained in my previous note.
Agreed, one thing at a time. It's complicated enough as is.
So I think the right thing to do is
- Kill dev->last_vblank_wait (in a prep patch).
Agreed. Also drm_vblank_info function can go away
- Imo we also have a few too many atomic ops going on, e.g.
dev->vblank_refcount looks like it's atomic only because or the procfs file, so maybe kill that procfs file entirely.
I am not sure about that. dev->vblank_refcount looks critical to me: it is incremented through drm_vblank_get which is called from several different contexts: page-flip functions in several drivers (which can run in the context of multiple user processes), **_pm_** stuff in radeon driver (which looks like it runs in the context of kernel worker thread). Mutexes used at all these different places are not quite consistent. If anything, as the result of a later audit, some other mutexes may be rendered unecessary, but definitely, I would keep vblank_refcount atomic.
I've re-read the code a bit and in _get it's protected by dev->vbl_lock, but not so in _put (because we issue a timer call to actually switch it off). I think we could just shove it under the protection of dev->vbl_lock completely. Another fuzzzy area is the relation between dev->vbl_lock and dev->vblank_time_lock. The latter looks a bit rendundant.
- Audit the vblank locking, maybe resulting in a patch with updated
comments, if there's anything misleading or tricky going on. And it's always good when the locking of structe members is explained in the header file ...
I'll add comments that I feel make sense for this set of patches (if anything). Full audit, should come later at a slower pace. There are quite a few mutexes and locks many of which are overlapping in function and some are spurious. It will take more than just myself to untangle this know.
Yeah, agreed. Let's first drop the mutex around this and untangle the spinlock/atomic mess in a second step. This is too hairy.
- Drop the mutex locking because it's not needed.
Agreed. Will drop.
While locking at the code I also noticed that wait_vblank calls drm_vblank_get, but not always drm_vblank_put. The code is correct, the missing vblank_put is hidden in drm_queue_vblank_event. Can you also write the patch to move that around to not trip up reviewers the next time around?
Actually, there is something fishy here. Look at the 'else' branch under the 'if ((seq - vblwait->request.sequence) <= (1 << 23))' at the end of the drm_queue_vblank_event. It doesn't have the 'drm_vblank_put'. It looks wrong to me, but I need to first convince myself that there is not some other obscure place where drm_vblank_put is accounted for. If that branch is really missing a drm_vblank_put, then it will be easy pull the drm_vblank_put out. Otherwise, it will be another knot to untangle.
I've re-read the code and agree with your follow-up analysis.
On Die, 2011-10-25 at 22:20 -0400, Ilija Hadzic wrote:
holding the drm_global_mutex in drm_wait_vblank and then going to sleep (via DRM_WAIT_ON) is a bad idea in general because it can block other processes that may issue ioctls that also grab drm_global_mutex. Blocking can occur until next vblank which is in the tens of microseconds order of magnitude.
fix it, but making drm_wait_vblank DRM_UNLOCKED call and then grabbing the mutex inside the drm_wait_vblank function and only for the duration of the critical section (i.e., from first manipulation of vblank sequence number until putting the task in the wait queue).
Does it really need drm_global_mutex at all, as opposed to e.g. dev->struct_mutex?
diff --git a/include/drm/drm_os_linux.h b/include/drm/drm_os_linux.h index 3933691..fc6aaf4 100644 --- a/include/drm/drm_os_linux.h +++ b/include/drm/drm_os_linux.h @@ -123,5 +123,30 @@ do { \ remove_wait_queue(&(queue), &entry); \ } while (0)
+#define DRM_WAIT_ON_LOCKED( ret, queue, timeout, condition ) \ +do { \
DECLARE_WAITQUEUE(entry, current); \
unsigned long end = jiffies + (timeout); \
add_wait_queue(&(queue), &entry); \
mutex_unlock(&drm_global_mutex); \
I agree with Daniel's sentiment on this. AFAICT add_wait_queue() protects against concurrent access to the wait queue, so I think it would be better to just drop the mutex explicitly before calling DRM_WAIT_ON.
On Wed, 26 Oct 2011, Michel [ISO-8859-1] D�nzer wrote:
Does it really need drm_global_mutex at all, as opposed to e.g. dev->struct_mutex?
It doesn't. Actually, it would probably suffice to have a mutex that is one-per-CRTC, because vlbank waits on different CRTCs don't step on each other, but I was going for a conservative fix. I tried to avoid having to hunt around in other sections of code for lines that possibly assume that a global lock is held while touching vblank counter structures (one of concerns raised by Daniel which is a non-issue because I didn't change the mutex being used).
The "urgent" part of this patch is to make sure we don't go to sleep while holding the mutex. In my response to Daniel, I sent an example of a simple userland application that can hang up the whole windowing system. That's a big deal for which we have to get the fix in quickly.
After that we can follow up with more polishing (e.g. use per-CRTC mutex, review and potentially fix other parts of the code that deal with vblank and maybe assume global mutex protection).
I agree with Daniel's sentiment on this. AFAICT add_wait_queue() protects against concurrent access to the wait queue, so I think it would be better to just drop the mutex explicitly before calling DRM_WAIT_ON.
The problem is that if I release the mutex just before calling DRM_WAIT_ON, the task can be scheduled away right at that point. Then another task can go all the way into the DRM_WAIT_ON and enter the queue. Then the first task wakes up and enters the queue.
Now the last task in the queue is *not* the task that updated the dev->last_vblank_wait[crtc] value last.
If you or someone who knows better than me can tell me that this potential reordering is guaranteed harmless, I'll gladly drop the mutex earlier and then I no longer need the DRM_WAIT_ON_LOCKED macro.
However, if the preservation of the order is actually important, then it will be an ugly race condition to track later.
-- Ilija
dri-devel@lists.freedesktop.org