From: Nicolai Hähnle Nicolai.Haehnle@amd.com
Fix a race condition involving 4 threads and 2 ww_mutexes as indicated in the following example. Acquire context stamps are ordered like the thread numbers, i.e. thread #1 should back off when it encounters a mutex locked by thread #0 etc.
Thread #0 Thread #1 Thread #2 Thread #3 --------- --------- --------- --------- lock(ww) success lock(ww') success lock(ww) lock(ww) . . . unlock(ww) part 1 lock(ww) . . . success . . . . . unlock(ww) part 2 . back off lock(ww') . . . (stuck) (stuck)
Here, unlock(ww) part 1 is the part that sets lock->base.count to 1 (without being protected by lock->base.wait_lock), meaning that thread #0 can acquire ww in the fast path or, much more likely, the medium path in mutex_optimistic_spin. Since lock->base.count == 0, thread #0 then won't wake up any of the waiters in ww_mutex_set_context_fastpath.
Then, unlock(ww) part 2 wakes up _only_the_first_ waiter of ww. This is thread #2, since waiters are added at the tail. Thread #2 wakes up and backs off since it sees ww owned by a context with a lower stamp.
Meanwhile, thread #1 is never woken up, and so it won't back off its lock on ww'. So thread #0 gets stuck waiting for ww' to be released.
This patch fixes the deadlock by waking up all waiters in the slow path of ww_mutex_unlock.
We have an internal test case for amdgpu which continuously submits command streams from tens of threads, where all command streams reference hundreds of GPU buffer objects with a lot of overlap in the buffer lists between command streams. This test reliably caused a deadlock, and while I haven't completely confirmed that it is exactly the scenario outlined above, this patch does fix the test case.
v2: - use wake_q_add - add additional explanations
Cc: Peter Zijlstra peterz@infradead.org Cc: Ingo Molnar mingo@redhat.com Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@canonical.com Cc: dri-devel@lists.freedesktop.org Cc: stable@vger.kernel.org Reviewed-by: Christian König christian.koenig@amd.com (v1) Signed-off-by: Nicolai Hähnle nicolai.haehnle@amd.com --- kernel/locking/mutex.c | 33 +++++++++++++++++++++++++++++---- 1 file changed, 29 insertions(+), 4 deletions(-)
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index a70b90d..7fbf9b4 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -409,6 +409,9 @@ static bool mutex_optimistic_spin(struct mutex *lock, __visible __used noinline void __sched __mutex_unlock_slowpath(atomic_t *lock_count);
+static __used noinline +void __sched __mutex_unlock_slowpath_wakeall(atomic_t *lock_count); + /** * mutex_unlock - release the mutex * @lock: the mutex to be released @@ -473,7 +476,14 @@ void __sched ww_mutex_unlock(struct ww_mutex *lock) */ mutex_clear_owner(&lock->base); #endif - __mutex_fastpath_unlock(&lock->base.count, __mutex_unlock_slowpath); + /* + * A previously _not_ waiting task may acquire the lock via the fast + * path during our unlock. In that case, already waiting tasks may have + * to back off to avoid a deadlock. Wake up all waiters so that they + * can check their acquire context stamp against the new owner. + */ + __mutex_fastpath_unlock(&lock->base.count, + __mutex_unlock_slowpath_wakeall); } EXPORT_SYMBOL(ww_mutex_unlock);
@@ -716,7 +726,7 @@ EXPORT_SYMBOL_GPL(__ww_mutex_lock_interruptible); * Release the lock, slowpath: */ static inline void -__mutex_unlock_common_slowpath(struct mutex *lock, int nested) +__mutex_unlock_common_slowpath(struct mutex *lock, int nested, int wake_all) { unsigned long flags; WAKE_Q(wake_q); @@ -740,7 +750,14 @@ __mutex_unlock_common_slowpath(struct mutex *lock, int nested) mutex_release(&lock->dep_map, nested, _RET_IP_); debug_mutex_unlock(lock);
- if (!list_empty(&lock->wait_list)) { + if (wake_all) { + struct mutex_waiter *waiter; + + list_for_each_entry(waiter, &lock->wait_list, list) { + debug_mutex_wake_waiter(lock, waiter); + wake_q_add(&wake_q, waiter->task); + } + } else if (!list_empty(&lock->wait_list)) { /* get the first entry from the wait-list: */ struct mutex_waiter *waiter = list_entry(lock->wait_list.next, @@ -762,7 +779,15 @@ __mutex_unlock_slowpath(atomic_t *lock_count) { struct mutex *lock = container_of(lock_count, struct mutex, count);
- __mutex_unlock_common_slowpath(lock, 1); + __mutex_unlock_common_slowpath(lock, 1, 0); +} + +static void +__mutex_unlock_slowpath_wakeall(atomic_t *lock_count) +{ + struct mutex *lock = container_of(lock_count, struct mutex, count); + + __mutex_unlock_common_slowpath(lock, 1, 1); }
#ifndef CONFIG_DEBUG_LOCK_ALLOC
From: Nicolai Hähnle Nicolai.Haehnle@amd.com
When ww_mutex_set_context_slowpath runs, we are in one of two situations:
1. The current task was woken up by ww_mutex_unlock.
2. The current task is racing with ww_mutex_unlock: We entered the slow path while lock->base.count <= 0, but skipped the wait in __mutex_lock_common.
In both cases, all tasks that are currently contending for the lock are either racing as in point 2 and blocked on lock->wait_lock, or they have been or will be woken up by ww_mutex_unlock.
Either way, they will check their stamp against ours without having to be woken up again.
This improvement is possible only after the changed behavior of ww_mutex_unlock from the previous patch ("locking/ww_mutex: Fix a deadlock affecting ww_mutexes").
Cc: Peter Zijlstra peterz@infradead.org Cc: Ingo Molnar mingo@redhat.com Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@canonical.com Cc: dri-devel@lists.freedesktop.org Signed-off-by: Nicolai Hähnle nicolai.haehnle@amd.com --- kernel/locking/mutex.c | 17 ++++------------- 1 file changed, 4 insertions(+), 13 deletions(-)
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index 7fbf9b4..7c09376 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -192,8 +192,10 @@ ww_mutex_set_context_fastpath(struct ww_mutex *lock, }
/* - * After acquiring lock in the slowpath set ctx and wake up any - * waiters so they can recheck. + * After acquiring lock in the slowpath set ctx. + * + * Unlike the fast path, other waiters are already woken up by ww_mutex_unlock, + * so we don't have to do it again here. * * Callers must hold the mutex wait_lock. */ @@ -201,19 +203,8 @@ static __always_inline void ww_mutex_set_context_slowpath(struct ww_mutex *lock, struct ww_acquire_ctx *ctx) { - struct mutex_waiter *cur; - ww_mutex_lock_acquired(lock, ctx); lock->ctx = ctx; - - /* - * Give any possible sleeping processes the chance to wake up, - * so they can recheck if they have to back off. - */ - list_for_each_entry(cur, &lock->base.wait_list, list) { - debug_mutex_wake_waiter(&lock->base, cur); - wake_up_process(cur->task); - } }
#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
From: Nicolai Hähnle Nicolai.Haehnle@amd.com
Cc: Peter Zijlstra peterz@infradead.org Cc: Ingo Molnar mingo@redhat.com Cc: dri-devel@lists.freedesktop.org Signed-off-by: Nicolai Hähnle nicolai.haehnle@amd.com --- Documentation/locking/00-INDEX | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/Documentation/locking/00-INDEX b/Documentation/locking/00-INDEX index c256c9b..adc1d289 100644 --- a/Documentation/locking/00-INDEX +++ b/Documentation/locking/00-INDEX @@ -13,4 +13,4 @@ rt-mutex.txt spinlocks.txt - info on using spinlocks to provide exclusive access in kernel. ww-mutex-design.txt - - Intro to Mutex wait/would deadlock handling.s + - Intro to Mutex wait/wound deadlock handling.
From: Nicolai Hähnle Nicolai.Haehnle@amd.com
Cc: Peter Zijlstra peterz@infradead.org Cc: Ingo Molnar mingo@redhat.com Cc: dri-devel@lists.freedesktop.org Signed-off-by: Nicolai Hähnle nicolai.haehnle@amd.com --- include/linux/ww_mutex.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/linux/ww_mutex.h b/include/linux/ww_mutex.h index 760399a..fd93cd3 100644 --- a/include/linux/ww_mutex.h +++ b/include/linux/ww_mutex.h @@ -97,7 +97,7 @@ static inline void ww_mutex_init(struct ww_mutex *lock, * @ctx: w/w acquire context to initialize * @ww_class: w/w class of the context * - * Initializes an context to acquire multiple mutexes of the given w/w class. + * Initializes a context to acquire multiple mutexes of the given w/w class. * * Context-based w/w mutex acquiring can be done in any order whatsoever within * a given lock class. Deadlocks will be detected and handled with the
On Wed, Nov 23, 2016 at 12:25:22PM +0100, Nicolai Hähnle wrote:
From: Nicolai Hähnle Nicolai.Haehnle@amd.com
Fix a race condition involving 4 threads and 2 ww_mutexes as indicated in the following example. Acquire context stamps are ordered like the thread numbers, i.e. thread #1 should back off when it encounters a mutex locked by thread #0 etc.
Thread #0 Thread #1 Thread #2 Thread #3
lock(ww) success lock(ww') success lock(ww) lock(ww) . . . unlock(ww) part 1
lock(ww) . . . success . . . . . unlock(ww) part 2 . back off lock(ww') . . . (stuck) (stuck)
Here, unlock(ww) part 1 is the part that sets lock->base.count to 1 (without being protected by lock->base.wait_lock), meaning that thread #0 can acquire ww in the fast path or, much more likely, the medium path in mutex_optimistic_spin. Since lock->base.count == 0, thread #0 then won't wake up any of the waiters in ww_mutex_set_context_fastpath.
Then, unlock(ww) part 2 wakes up _only_the_first_ waiter of ww. This is thread #2, since waiters are added at the tail. Thread #2 wakes up and backs off since it sees ww owned by a context with a lower stamp.
Meanwhile, thread #1 is never woken up, and so it won't back off its lock on ww'. So thread #0 gets stuck waiting for ww' to be released.
This patch fixes the deadlock by waking up all waiters in the slow path of ww_mutex_unlock.
We have an internal test case for amdgpu which continuously submits command streams from tens of threads, where all command streams reference hundreds of GPU buffer objects with a lot of overlap in the buffer lists between command streams. This test reliably caused a deadlock, and while I haven't completely confirmed that it is exactly the scenario outlined above, this patch does fix the test case.
v2:
- use wake_q_add
- add additional explanations
Cc: Peter Zijlstra peterz@infradead.org Cc: Ingo Molnar mingo@redhat.com Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@canonical.com Cc: dri-devel@lists.freedesktop.org Cc: stable@vger.kernel.org Reviewed-by: Christian König christian.koenig@amd.com (v1) Signed-off-by: Nicolai Hähnle nicolai.haehnle@amd.com
Yeah, when the owning ctx changes we need to wake up all waiters, to make sure we catch all (new) deadlock scenarios. And I tried poking at your example, and I think it's solid and can't be minimized any further. I don't have much clue on mutex.c code itself, but the changes seem reasonable. With that caveat:
Reviewed-by: Daniel Vetter daniel.vetter@ffwll.ch
Cheers, Daniel
kernel/locking/mutex.c | 33 +++++++++++++++++++++++++++++---- 1 file changed, 29 insertions(+), 4 deletions(-)
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c index a70b90d..7fbf9b4 100644 --- a/kernel/locking/mutex.c +++ b/kernel/locking/mutex.c @@ -409,6 +409,9 @@ static bool mutex_optimistic_spin(struct mutex *lock, __visible __used noinline void __sched __mutex_unlock_slowpath(atomic_t *lock_count);
+static __used noinline +void __sched __mutex_unlock_slowpath_wakeall(atomic_t *lock_count);
/**
- mutex_unlock - release the mutex
- @lock: the mutex to be released
@@ -473,7 +476,14 @@ void __sched ww_mutex_unlock(struct ww_mutex *lock) */ mutex_clear_owner(&lock->base); #endif
- __mutex_fastpath_unlock(&lock->base.count, __mutex_unlock_slowpath);
- /*
* A previously _not_ waiting task may acquire the lock via the fast
* path during our unlock. In that case, already waiting tasks may have
* to back off to avoid a deadlock. Wake up all waiters so that they
* can check their acquire context stamp against the new owner.
*/
- __mutex_fastpath_unlock(&lock->base.count,
__mutex_unlock_slowpath_wakeall);
} EXPORT_SYMBOL(ww_mutex_unlock);
@@ -716,7 +726,7 @@ EXPORT_SYMBOL_GPL(__ww_mutex_lock_interruptible);
- Release the lock, slowpath:
*/ static inline void -__mutex_unlock_common_slowpath(struct mutex *lock, int nested) +__mutex_unlock_common_slowpath(struct mutex *lock, int nested, int wake_all) { unsigned long flags; WAKE_Q(wake_q); @@ -740,7 +750,14 @@ __mutex_unlock_common_slowpath(struct mutex *lock, int nested) mutex_release(&lock->dep_map, nested, _RET_IP_); debug_mutex_unlock(lock);
- if (!list_empty(&lock->wait_list)) {
- if (wake_all) {
struct mutex_waiter *waiter;
list_for_each_entry(waiter, &lock->wait_list, list) {
debug_mutex_wake_waiter(lock, waiter);
wake_q_add(&wake_q, waiter->task);
}
- } else if (!list_empty(&lock->wait_list)) { /* get the first entry from the wait-list: */ struct mutex_waiter *waiter = list_entry(lock->wait_list.next,
@@ -762,7 +779,15 @@ __mutex_unlock_slowpath(atomic_t *lock_count) { struct mutex *lock = container_of(lock_count, struct mutex, count);
- __mutex_unlock_common_slowpath(lock, 1);
- __mutex_unlock_common_slowpath(lock, 1, 0);
+}
+static void +__mutex_unlock_slowpath_wakeall(atomic_t *lock_count) +{
- struct mutex *lock = container_of(lock_count, struct mutex, count);
- __mutex_unlock_common_slowpath(lock, 1, 1);
}
#ifndef CONFIG_DEBUG_LOCK_ALLOC
2.7.4
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Wed, Nov 23, 2016 at 12:25:22PM +0100, Nicolai Hähnle wrote:
From: Nicolai Hähnle Nicolai.Haehnle@amd.com
Fix a race condition involving 4 threads and 2 ww_mutexes as indicated in the following example. Acquire context stamps are ordered like the thread numbers, i.e. thread #1 should back off when it encounters a mutex locked by thread #0 etc.
Thread #0 Thread #1 Thread #2 Thread #3
lock(ww) success lock(ww') success lock(ww) lock(ww) . . . unlock(ww) part 1
lock(ww) . . . success . . . . . unlock(ww) part 2 . back off lock(ww') . . . (stuck) (stuck)
Here, unlock(ww) part 1 is the part that sets lock->base.count to 1 (without being protected by lock->base.wait_lock), meaning that thread #0 can acquire ww in the fast path or, much more likely, the medium path in mutex_optimistic_spin. Since lock->base.count == 0, thread #0 then won't wake up any of the waiters in ww_mutex_set_context_fastpath.
Then, unlock(ww) part 2 wakes up _only_the_first_ waiter of ww. This is thread #2, since waiters are added at the tail. Thread #2 wakes up and backs off since it sees ww owned by a context with a lower stamp.
Meanwhile, thread #1 is never woken up, and so it won't back off its lock on ww'. So thread #0 gets stuck waiting for ww' to be released.
This patch fixes the deadlock by waking up all waiters in the slow path of ww_mutex_unlock.
We have an internal test case for amdgpu which continuously submits command streams from tens of threads, where all command streams reference hundreds of GPU buffer objects with a lot of overlap in the buffer lists between command streams. This test reliably caused a deadlock, and while I haven't completely confirmed that it is exactly the scenario outlined above, this patch does fix the test case.
v2:
- use wake_q_add
- add additional explanations
Cc: Peter Zijlstra peterz@infradead.org Cc: Ingo Molnar mingo@redhat.com Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@canonical.com Cc: dri-devel@lists.freedesktop.org Cc: stable@vger.kernel.org Reviewed-by: Christian König christian.koenig@amd.com (v1) Signed-off-by: Nicolai Hähnle nicolai.haehnle@amd.com
Completely and utterly fails to apply; I think this patch is based on code prior to the mutex rewrite.
Please rebase on tip/locking/core.
Also, is this a regression, or has this been a 'feature' of the ww_mutex code from early on?
On Wed, Nov 23, 2016 at 02:00:46PM +0100, Peter Zijlstra wrote:
On Wed, Nov 23, 2016 at 12:25:22PM +0100, Nicolai Hähnle wrote:
From: Nicolai Hähnle Nicolai.Haehnle@amd.com
Fix a race condition involving 4 threads and 2 ww_mutexes as indicated in the following example. Acquire context stamps are ordered like the thread numbers, i.e. thread #1 should back off when it encounters a mutex locked by thread #0 etc.
Thread #0 Thread #1 Thread #2 Thread #3
lock(ww) success lock(ww') success lock(ww) lock(ww) . . . unlock(ww) part 1
lock(ww) . . . success . . . . . unlock(ww) part 2 . back off lock(ww') . . . (stuck) (stuck)
Here, unlock(ww) part 1 is the part that sets lock->base.count to 1 (without being protected by lock->base.wait_lock), meaning that thread #0 can acquire ww in the fast path or, much more likely, the medium path in mutex_optimistic_spin. Since lock->base.count == 0, thread #0 then won't wake up any of the waiters in ww_mutex_set_context_fastpath.
Then, unlock(ww) part 2 wakes up _only_the_first_ waiter of ww. This is thread #2, since waiters are added at the tail. Thread #2 wakes up and backs off since it sees ww owned by a context with a lower stamp.
Meanwhile, thread #1 is never woken up, and so it won't back off its lock on ww'. So thread #0 gets stuck waiting for ww' to be released.
This patch fixes the deadlock by waking up all waiters in the slow path of ww_mutex_unlock.
We have an internal test case for amdgpu which continuously submits command streams from tens of threads, where all command streams reference hundreds of GPU buffer objects with a lot of overlap in the buffer lists between command streams. This test reliably caused a deadlock, and while I haven't completely confirmed that it is exactly the scenario outlined above, this patch does fix the test case.
v2:
- use wake_q_add
- add additional explanations
Cc: Peter Zijlstra peterz@infradead.org Cc: Ingo Molnar mingo@redhat.com Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@canonical.com Cc: dri-devel@lists.freedesktop.org Cc: stable@vger.kernel.org Reviewed-by: Christian König christian.koenig@amd.com (v1) Signed-off-by: Nicolai Hähnle nicolai.haehnle@amd.com
Completely and utterly fails to apply; I think this patch is based on code prior to the mutex rewrite.
Please rebase on tip/locking/core.
Also, is this a regression, or has this been a 'feature' of the ww_mutex code from early on?
Sorry forgot to mention that, but I checked. Seems to have been broken since day 1, at least looking at the original code the wake-single-waiter stuff is as old as the mutex code added in 2006. -Daniel
On Wed, Nov 23, 2016 at 02:08:48PM +0100, Daniel Vetter wrote:
On Wed, Nov 23, 2016 at 02:00:46PM +0100, Peter Zijlstra wrote:
On Wed, Nov 23, 2016 at 12:25:22PM +0100, Nicolai Hähnle wrote:
From: Nicolai Hähnle Nicolai.Haehnle@amd.com
Fix a race condition involving 4 threads and 2 ww_mutexes as indicated in the following example. Acquire context stamps are ordered like the thread numbers, i.e. thread #1 should back off when it encounters a mutex locked by thread #0 etc.
Thread #0 Thread #1 Thread #2 Thread #3
lock(ww) success lock(ww') success lock(ww) lock(ww) . . . unlock(ww) part 1
lock(ww) . . . success . . . . . unlock(ww) part 2 . back off lock(ww') . . . (stuck) (stuck)
Here, unlock(ww) part 1 is the part that sets lock->base.count to 1 (without being protected by lock->base.wait_lock), meaning that thread #0 can acquire ww in the fast path or, much more likely, the medium path in mutex_optimistic_spin. Since lock->base.count == 0, thread #0 then won't wake up any of the waiters in ww_mutex_set_context_fastpath.
Then, unlock(ww) part 2 wakes up _only_the_first_ waiter of ww. This is thread #2, since waiters are added at the tail. Thread #2 wakes up and backs off since it sees ww owned by a context with a lower stamp.
Meanwhile, thread #1 is never woken up, and so it won't back off its lock on ww'. So thread #0 gets stuck waiting for ww' to be released.
This patch fixes the deadlock by waking up all waiters in the slow path of ww_mutex_unlock.
We have an internal test case for amdgpu which continuously submits command streams from tens of threads, where all command streams reference hundreds of GPU buffer objects with a lot of overlap in the buffer lists between command streams. This test reliably caused a deadlock, and while I haven't completely confirmed that it is exactly the scenario outlined above, this patch does fix the test case.
v2:
- use wake_q_add
- add additional explanations
Cc: Peter Zijlstra peterz@infradead.org Cc: Ingo Molnar mingo@redhat.com Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@canonical.com Cc: dri-devel@lists.freedesktop.org Cc: stable@vger.kernel.org Reviewed-by: Christian König christian.koenig@amd.com (v1) Signed-off-by: Nicolai Hähnle nicolai.haehnle@amd.com
Completely and utterly fails to apply; I think this patch is based on code prior to the mutex rewrite.
Please rebase on tip/locking/core.
Also, is this a regression, or has this been a 'feature' of the ww_mutex code from early on?
Sorry forgot to mention that, but I checked. Seems to have been broken since day 1, at least looking at the original code the wake-single-waiter stuff is as old as the mutex code added in 2006.
More details: For gpu drivers this was originally working, since the ww_mutex implementation in ttm did use wake_up_all. So need to add a
Fixes: 5e338405119a ("drm/ttm: convert to the reservation api")
Op 23-11-16 om 14:11 schreef Daniel Vetter:
On Wed, Nov 23, 2016 at 02:08:48PM +0100, Daniel Vetter wrote:
On Wed, Nov 23, 2016 at 02:00:46PM +0100, Peter Zijlstra wrote:
On Wed, Nov 23, 2016 at 12:25:22PM +0100, Nicolai Hähnle wrote:
From: Nicolai Hähnle Nicolai.Haehnle@amd.com
Fix a race condition involving 4 threads and 2 ww_mutexes as indicated in the following example. Acquire context stamps are ordered like the thread numbers, i.e. thread #1 should back off when it encounters a mutex locked by thread #0 etc.
Thread #0 Thread #1 Thread #2 Thread #3
lock(ww) success lock(ww') success lock(ww) lock(ww) . . . unlock(ww) part 1
lock(ww) . . . success . . . . . unlock(ww) part 2 . back off lock(ww') . . . (stuck) (stuck)
Here, unlock(ww) part 1 is the part that sets lock->base.count to 1 (without being protected by lock->base.wait_lock), meaning that thread #0 can acquire ww in the fast path or, much more likely, the medium path in mutex_optimistic_spin. Since lock->base.count == 0, thread #0 then won't wake up any of the waiters in ww_mutex_set_context_fastpath.
Then, unlock(ww) part 2 wakes up _only_the_first_ waiter of ww. This is thread #2, since waiters are added at the tail. Thread #2 wakes up and backs off since it sees ww owned by a context with a lower stamp.
Meanwhile, thread #1 is never woken up, and so it won't back off its lock on ww'. So thread #0 gets stuck waiting for ww' to be released.
This patch fixes the deadlock by waking up all waiters in the slow path of ww_mutex_unlock.
We have an internal test case for amdgpu which continuously submits command streams from tens of threads, where all command streams reference hundreds of GPU buffer objects with a lot of overlap in the buffer lists between command streams. This test reliably caused a deadlock, and while I haven't completely confirmed that it is exactly the scenario outlined above, this patch does fix the test case.
v2:
- use wake_q_add
- add additional explanations
Cc: Peter Zijlstra peterz@infradead.org Cc: Ingo Molnar mingo@redhat.com Cc: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@canonical.com Cc: dri-devel@lists.freedesktop.org Cc: stable@vger.kernel.org Reviewed-by: Christian König christian.koenig@amd.com (v1) Signed-off-by: Nicolai Hähnle nicolai.haehnle@amd.com
Completely and utterly fails to apply; I think this patch is based on code prior to the mutex rewrite.
Please rebase on tip/locking/core.
Also, is this a regression, or has this been a 'feature' of the ww_mutex code from early on?
Sorry forgot to mention that, but I checked. Seems to have been broken since day 1, at least looking at the original code the wake-single-waiter stuff is as old as the mutex code added in 2006.
More details: For gpu drivers this was originally working, since the ww_mutex implementation in ttm did use wake_up_all. So need to add a
Fixes: 5e338405119a ("drm/ttm: convert to the reservation api")
But it's broken in the original kernel ww_mutex implementation, I guess it doesn't matter much since ttm was the first in-kernel user anyway. :)
~Maarten
On Wed, Nov 23, 2016 at 12:25:22PM +0100, Nicolai Hähnle wrote:
@@ -473,7 +476,14 @@ void __sched ww_mutex_unlock(struct ww_mutex *lock) */ mutex_clear_owner(&lock->base); #endif
- __mutex_fastpath_unlock(&lock->base.count, __mutex_unlock_slowpath);
- /*
* A previously _not_ waiting task may acquire the lock via the fast
* path during our unlock. In that case, already waiting tasks may have
* to back off to avoid a deadlock. Wake up all waiters so that they
* can check their acquire context stamp against the new owner.
*/
- __mutex_fastpath_unlock(&lock->base.count,
__mutex_unlock_slowpath_wakeall);
}
So doing a wake-all has obvious issues with thundering herd etc.. Also, with the new mutex, you'd not be able to do hand-off, which would introduce starvation cases.
Ideally we'd iterate the blocked list and pick the waiter with the earliest stamp, or we'd maintain the list in stamp order instead of FIFO, for ww_mutex.
On Wed, Nov 23, 2016 at 03:03:36PM +0100, Peter Zijlstra wrote:
On Wed, Nov 23, 2016 at 12:25:22PM +0100, Nicolai Hähnle wrote:
@@ -473,7 +476,14 @@ void __sched ww_mutex_unlock(struct ww_mutex *lock) */ mutex_clear_owner(&lock->base); #endif
- __mutex_fastpath_unlock(&lock->base.count, __mutex_unlock_slowpath);
- /*
* A previously _not_ waiting task may acquire the lock via the fast
* path during our unlock. In that case, already waiting tasks may have
* to back off to avoid a deadlock. Wake up all waiters so that they
* can check their acquire context stamp against the new owner.
*/
- __mutex_fastpath_unlock(&lock->base.count,
__mutex_unlock_slowpath_wakeall);
}
So doing a wake-all has obvious issues with thundering herd etc.. Also, with the new mutex, you'd not be able to do hand-off, which would introduce starvation cases.
Ideally we'd iterate the blocked list and pick the waiter with the earliest stamp, or we'd maintain the list in stamp order instead of FIFO, for ww_mutex.
Not sure we'll win that much, at least I think we still need to wake up everyone with earlier stamp than the one of the task that just released the lock. Otherwise there's deadlocks. So just cuts the wakeups in half, on average.
What we could do is do a handoff-hint with the timestamp of earliest task we believe should get the lock. Everyone with a later timestamp that gets woken then knows that they definitely have a deadlock situation and need to back off (thread 2 in the example).
thread 1 would get woken, and would be able to take the lock, except when thread 0 successfully raced it and stole the lock. And anyone else racing in with later timestamps would also immediately back off, ensuring fairness.
Without thinking it through in detail this is a PI issue, except that we replace boosting with wakeup&back-off. Could we perhaps steal something from rt mutexes to make it fair&efficient? -Daniel
On Wed, Nov 23, 2016 at 03:25:25PM +0100, Daniel Vetter wrote:
Without thinking it through in detail this is a PI issue, except that we replace boosting with wakeup&back-off. Could we perhaps steal something from rt mutexes to make it fair&efficient?
rt_mutexes order the waiters by 'priority' and wake the top most.
On 23.11.2016 15:25, Daniel Vetter wrote:
On Wed, Nov 23, 2016 at 03:03:36PM +0100, Peter Zijlstra wrote:
On Wed, Nov 23, 2016 at 12:25:22PM +0100, Nicolai Hähnle wrote:
@@ -473,7 +476,14 @@ void __sched ww_mutex_unlock(struct ww_mutex *lock) */ mutex_clear_owner(&lock->base); #endif
- __mutex_fastpath_unlock(&lock->base.count, __mutex_unlock_slowpath);
- /*
* A previously _not_ waiting task may acquire the lock via the fast
* path during our unlock. In that case, already waiting tasks may have
* to back off to avoid a deadlock. Wake up all waiters so that they
* can check their acquire context stamp against the new owner.
*/
- __mutex_fastpath_unlock(&lock->base.count,
__mutex_unlock_slowpath_wakeall);
}
So doing a wake-all has obvious issues with thundering herd etc.. Also, with the new mutex, you'd not be able to do hand-off, which would introduce starvation cases.
Ideally we'd iterate the blocked list and pick the waiter with the earliest stamp, or we'd maintain the list in stamp order instead of FIFO, for ww_mutex.
Not sure we'll win that much, at least I think we still need to wake up everyone with earlier stamp than the one of the task that just released the lock. Otherwise there's deadlocks. So just cuts the wakeups in half, on average.
What we could do is do a handoff-hint with the timestamp of earliest task we believe should get the lock. Everyone with a later timestamp that gets woken then knows that they definitely have a deadlock situation and need to back off (thread 2 in the example).
thread 1 would get woken, and would be able to take the lock, except when thread 0 successfully raced it and stole the lock. And anyone else racing in with later timestamps would also immediately back off, ensuring fairness.
I did consider maintaining stamp order in the waiter list and originally decided against it because I just wanted a simple and conservative fix to avoid adding new regressions.
Now that a different approach is needed for >= 4.9 anyway mostly due to the hand-off logic, I'm reconsidering this.
I do believe we can win a bit by keeping the wait list sorted, if we also make sure that waiters don't add themselves in the first place if they see that a deadlock situation cannot be avoided.
I will probably want to extend struct mutex_waiter with ww_mutex-specific fields to facilitate this (i.e. ctx pointer, perhaps stamp as well to reduce pointer-chasing). That should be fine since it lives on the stack.
In the meantime, I'd appreciate it if patch #1 could be accepted as-is for stable updates to <= 4.8. It fixes a real (if rare) bug, and the stampede inefficiency isn't a problem in practice at least for GPU applications.
Thanks, Nicolai
Without thinking it through in detail this is a PI issue, except that we replace boosting with wakeup&back-off. Could we perhaps steal something from rt mutexes to make it fair&efficient? -Daniel
On Thu, Nov 24, 2016 at 12:26:57PM +0100, Nicolai Hähnle wrote:
I do believe we can win a bit by keeping the wait list sorted, if we also make sure that waiters don't add themselves in the first place if they see that a deadlock situation cannot be avoided.
I will probably want to extend struct mutex_waiter with ww_mutex-specific fields to facilitate this (i.e. ctx pointer, perhaps stamp as well to reduce pointer-chasing). That should be fine since it lives on the stack.
Right, shouldn't be a problem I think.
The only 'problem' I can see with using that is that its possible to mix ww and !ww waiters through ww_mutex_lock(.ctx = NULL). This makes the list order somewhat tricky.
Ideally we'd remove that feature, although I see its actually used quite a bit :/
In the meantime, I'd appreciate it if patch #1 could be accepted as-is for stable updates to <= 4.8. It fixes a real (if rare) bug, and the stampede inefficiency isn't a problem in practice at least for GPU applications.
Sorry can't do. We don't do stable patches that don't have anything upstream.
On Thu, Nov 24, 2016 at 12:40 PM, Peter Zijlstra peterz@infradead.org wrote:
I do believe we can win a bit by keeping the wait list sorted, if we also make sure that waiters don't add themselves in the first place if they see that a deadlock situation cannot be avoided.
I will probably want to extend struct mutex_waiter with ww_mutex-specific fields to facilitate this (i.e. ctx pointer, perhaps stamp as well to reduce pointer-chasing). That should be fine since it lives on the stack.
Right, shouldn't be a problem I think.
The only 'problem' I can see with using that is that its possible to mix ww and !ww waiters through ww_mutex_lock(.ctx = NULL). This makes the list order somewhat tricky.
Ideally we'd remove that feature, although I see its actually used quite a bit :/
I guess we could create a small fake acquire_ctx for single-lock paths. That way callers still don't need to deal with having an explicit ctx, but we can assume the timestamp (for ensuring fairness) is available for all cases. Otherwise there's indeed a problem with correctly (well fairly) interleaving ctx and non-ctx lockers I think. -Daniel
On Thu, Nov 24, 2016 at 12:52:25PM +0100, Daniel Vetter wrote:
On Thu, Nov 24, 2016 at 12:40 PM, Peter Zijlstra peterz@infradead.org wrote:
I do believe we can win a bit by keeping the wait list sorted, if we also make sure that waiters don't add themselves in the first place if they see that a deadlock situation cannot be avoided.
I will probably want to extend struct mutex_waiter with ww_mutex-specific fields to facilitate this (i.e. ctx pointer, perhaps stamp as well to reduce pointer-chasing). That should be fine since it lives on the stack.
Right, shouldn't be a problem I think.
The only 'problem' I can see with using that is that its possible to mix ww and !ww waiters through ww_mutex_lock(.ctx = NULL). This makes the list order somewhat tricky.
Ideally we'd remove that feature, although I see its actually used quite a bit :/
I guess we could create a small fake acquire_ctx for single-lock paths. That way callers still don't need to deal with having an explicit ctx, but we can assume the timestamp (for ensuring fairness) is available for all cases. Otherwise there's indeed a problem with correctly (well fairly) interleaving ctx and non-ctx lockers I think.
Actually tried that, but we need a ww_class to get a stamp from, and ww_mutex_lock() doesn't have one of those..
On 24.11.2016 12:56, Peter Zijlstra wrote:
On Thu, Nov 24, 2016 at 12:52:25PM +0100, Daniel Vetter wrote:
On Thu, Nov 24, 2016 at 12:40 PM, Peter Zijlstra peterz@infradead.org wrote:
I do believe we can win a bit by keeping the wait list sorted, if we also make sure that waiters don't add themselves in the first place if they see that a deadlock situation cannot be avoided.
I will probably want to extend struct mutex_waiter with ww_mutex-specific fields to facilitate this (i.e. ctx pointer, perhaps stamp as well to reduce pointer-chasing). That should be fine since it lives on the stack.
Right, shouldn't be a problem I think.
The only 'problem' I can see with using that is that its possible to mix ww and !ww waiters through ww_mutex_lock(.ctx = NULL). This makes the list order somewhat tricky.
Ideally we'd remove that feature, although I see its actually used quite a bit :/
I guess we could create a small fake acquire_ctx for single-lock paths. That way callers still don't need to deal with having an explicit ctx, but we can assume the timestamp (for ensuring fairness) is available for all cases. Otherwise there's indeed a problem with correctly (well fairly) interleaving ctx and non-ctx lockers I think.
Actually tried that, but we need a ww_class to get a stamp from, and ww_mutex_lock() doesn't have one of those..
The acquire context needs to be live until the unlock anyway, so this is something that requires modifying the callers of ww_mutex_lock. Those should all have a ww_class available, or something is very wrong :)
Nicolai
dri-devel@lists.freedesktop.org