This is an RFC because I'm still trying to grok the correct behavior.
Consider a dma_fence_array created two two fence and signal_on_any is true. A reference to dma_fence_array is taken for each waiting fence.
When the client calls dma_fence_wait() only one of the fences is signaled. The client returns successfully from the wait and puts it's reference to the array fence but the array fence still remains because of the remaining un-signaled fence.
Now consider that the unsignaled fence is signaled while the timeline is being destroyed much later. The timeline destroy calls dma_fence_signal_locked(). The following sequence occurs:
1) dma_fence_array_cb_func is called
2) array->num_pending is 0 (because it was set to 1 due to signal_on_any) so the callback function calls dma_fence_put() instead of triggering the irq work
3) The array fence is released which in turn puts the lingering fence which is then released
4) deadlock with the timeline
I think that we can fix this with the attached patch. Once the fence is signaled signaling it again in the irq worker shouldn't hurt anything. The only gotcha might be how the error is propagated - I wasn't quite sure the intent of clearing it only after getting to the irq worker.
Signed-off-by: Jordan Crouse jcrouse@codeaurora.org ---
drivers/dma-buf/dma-fence-array.c | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-)
diff --git a/drivers/dma-buf/dma-fence-array.c b/drivers/dma-buf/dma-fence-array.c index d3fbd950be94..b8829b024255 100644 --- a/drivers/dma-buf/dma-fence-array.c +++ b/drivers/dma-buf/dma-fence-array.c @@ -46,8 +46,6 @@ static void irq_dma_fence_array_work(struct irq_work *wrk) { struct dma_fence_array *array = container_of(wrk, typeof(*array), work);
- dma_fence_array_clear_pending_error(array); - dma_fence_signal(&array->base); dma_fence_put(&array->base); } @@ -61,10 +59,10 @@ static void dma_fence_array_cb_func(struct dma_fence *f,
dma_fence_array_set_pending_error(array, f->error);
- if (atomic_dec_and_test(&array->num_pending)) - irq_work_queue(&array->work); - else - dma_fence_put(&array->base); + if (!atomic_dec_and_test(&array->num_pending)) + dma_fence_array_set_pending_error(array, f->error); + + irq_work_queue(&array->work); }
static bool dma_fence_array_enable_signaling(struct dma_fence *fence)
Quoting Jordan Crouse (2020-08-13 00:55:44)
This is an RFC because I'm still trying to grok the correct behavior.
Consider a dma_fence_array created two two fence and signal_on_any is true. A reference to dma_fence_array is taken for each waiting fence.
When the client calls dma_fence_wait() only one of the fences is signaled. The client returns successfully from the wait and puts it's reference to the array fence but the array fence still remains because of the remaining un-signaled fence.
Now consider that the unsignaled fence is signaled while the timeline is being destroyed much later. The timeline destroy calls dma_fence_signal_locked(). The following sequence occurs:
dma_fence_array_cb_func is called
array->num_pending is 0 (because it was set to 1 due to signal_on_any) so the
callback function calls dma_fence_put() instead of triggering the irq work
- The array fence is released which in turn puts the lingering fence which is
then released
- deadlock with the timeline
It's the same recursive lock as we previously resolved in sw_sync.c by removing the locking from timeline_fence_release(). -Chris
On Thu, Aug 13, 2020 at 07:49:24AM +0100, Chris Wilson wrote:
Quoting Jordan Crouse (2020-08-13 00:55:44)
This is an RFC because I'm still trying to grok the correct behavior.
Consider a dma_fence_array created two two fence and signal_on_any is true. A reference to dma_fence_array is taken for each waiting fence.
When the client calls dma_fence_wait() only one of the fences is signaled. The client returns successfully from the wait and puts it's reference to the array fence but the array fence still remains because of the remaining un-signaled fence.
Now consider that the unsignaled fence is signaled while the timeline is being destroyed much later. The timeline destroy calls dma_fence_signal_locked(). The following sequence occurs:
dma_fence_array_cb_func is called
array->num_pending is 0 (because it was set to 1 due to signal_on_any) so the
callback function calls dma_fence_put() instead of triggering the irq work
- The array fence is released which in turn puts the lingering fence which is
then released
- deadlock with the timeline
It's the same recursive lock as we previously resolved in sw_sync.c by removing the locking from timeline_fence_release().
Ah, yep. I'm working on a not-quite-ready-for-primetime version of a vulkan timeline implementation for drm/msm and I was doing something similar to how sw_sync used to work in the release function. Getting rid of the recursive lock in the timeline seems a better solution than this. Thanks for taking the time to respond.
Jordan
-Chris
Am 13.08.20 um 01:55 schrieb Jordan Crouse:
This is an RFC because I'm still trying to grok the correct behavior.
Consider a dma_fence_array created two two fence and signal_on_any is true. A reference to dma_fence_array is taken for each waiting fence.
Ok, that sounds like you seem to mix a couple of things up here.
A dma_fence_array takes the reference to the fences it contains on creation. There is only one reference to the dma_fence_array even if it contains N unsignaled fences.
What we do is to grab a reference to the array in dma_fence_array_enable_signaling(), but this is because we are registering the callback here.
When the client calls dma_fence_wait() only one of the fences is signaled. The client returns successfully from the wait and puts it's reference to the array fence but the array fence still remains because of the remaining un-signaled fence.
If signaling was enabled then this is correct, because otherwise we would crash when the other callbacks are called.
Now consider that the unsignaled fence is signaled while the timeline is being destroyed much later. The timeline destroy calls dma_fence_signal_locked(). The following sequence occurs:
dma_fence_array_cb_func is called
array->num_pending is 0 (because it was set to 1 due to signal_on_any) so the
callback function calls dma_fence_put() instead of triggering the irq work
- The array fence is released which in turn puts the lingering fence which is
then released
- deadlock with the timeline
Why do we have a deadlock here? That doesn't seems to add up.
Christian.
I think that we can fix this with the attached patch. Once the fence is signaled signaling it again in the irq worker shouldn't hurt anything. The only gotcha might be how the error is propagated - I wasn't quite sure the intent of clearing it only after getting to the irq worker.
Signed-off-by: Jordan Crouse jcrouse@codeaurora.org
drivers/dma-buf/dma-fence-array.c | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-)
diff --git a/drivers/dma-buf/dma-fence-array.c b/drivers/dma-buf/dma-fence-array.c index d3fbd950be94..b8829b024255 100644 --- a/drivers/dma-buf/dma-fence-array.c +++ b/drivers/dma-buf/dma-fence-array.c @@ -46,8 +46,6 @@ static void irq_dma_fence_array_work(struct irq_work *wrk) { struct dma_fence_array *array = container_of(wrk, typeof(*array), work);
- dma_fence_array_clear_pending_error(array);
- dma_fence_signal(&array->base); dma_fence_put(&array->base); }
@@ -61,10 +59,10 @@ static void dma_fence_array_cb_func(struct dma_fence *f,
dma_fence_array_set_pending_error(array, f->error);
- if (atomic_dec_and_test(&array->num_pending))
irq_work_queue(&array->work);
- else
dma_fence_put(&array->base);
if (!atomic_dec_and_test(&array->num_pending))
dma_fence_array_set_pending_error(array, f->error);
irq_work_queue(&array->work); }
static bool dma_fence_array_enable_signaling(struct dma_fence *fence)
Greeting,
FYI, we noticed the following commit (built with gcc-9):
commit: ee7499cf7d6b98a2caef0466f5dcbefdd25d49fe ("[RFC PATCH v1] dma-fence-array: Deal with sub-fences that are signaled late") url: https://github.com/0day-ci/linux/commits/Jordan-Crouse/dma-fence-array-Deal-... base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git 7c2a69f610e64c8dec6a06a66e721f4ce1dd783a
in testcase: igt with following parameters:
group: group_08 ucode: 0x21
on test machine: 4 threads Intel(R) Core(TM) i3-3220 CPU @ 3.30GHz with 8G memory
caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
If you fix the issue, kindly add following tag Reported-by: kernel test robot rong.a.chen@intel.com
user :info : [ 659.496362] [IGT] gem_exec_fence: starting subtest busy-hang-all user :notice: [ 659.496971] Subtest basic-wait-all: SUCCESS (0.010s)
user :notice: [ 659.497873] Starting subtest: busy-hang-all
kern :info : [ 665.922688] i915 0000:00:02.0: [drm] GPU HANG: ecode 7:0:00000000 kern :info : [ 665.922794] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. kern :info : [ 665.922935] Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/intel/issues/new. kern :info : [ 665.923079] Please see https://gitlab.freedesktop.org/drm/intel/-/wikis/How-to-file-i915-bugs for details. kern :info : [ 665.923234] drm/i915 developers can then reassign to the right component if it's not a kernel issue. kern :info : [ 665.923380] The GPU crash dump is required to analyze GPU hangs, so please always attach it. kern :info : [ 665.923517] GPU crash dump saved to /sys/class/drm/card0/error kern :info : [ 665.924218] i915 0000:00:02.0: [drm] GPU HANG: ecode 7:0:00000000 kern :notice: [ 665.924332] i915 0000:00:02.0: [drm] Resetting chip for stopped heartbeat on rcs0 kern :info : [ 665.924586] i915 0000:00:02.0: [drm] GPU HANG: ecode 7:0:00000000 kern :notice: [ 665.924831] i915 0000:00:02.0: [drm] Resetting chip for stopped heartbeat on vcs0 user :notice: [ 665.928599] (gem_exec_fence:3060) CRITICAL: Test assertion failure function test_fence_busy_all, file ../tests/i915/gem_exec_fence.c:322:
user :info : [ 665.929208] [IGT] gem_exec_fence: starting subtest wait-hang-all user :notice: [ 665.930744] (gem_exec_fence:3060) CRITICAL: Failed assertion: !gem_bo_busy(fd, obj.handle)
user :notice: [ 665.931557] Subtest busy-hang-all failed.
user :notice: [ 665.932011] **** DEBUG ****
user :notice: [ 665.934219] (gem_exec_fence:3060) igt_debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
user :notice: [ 665.937368] (gem_exec_fence:3060) CRITICAL: Test assertion failure function test_fence_busy_all, file ../tests/i915/gem_exec_fence.c:322:
user :notice: [ 665.939399] (gem_exec_fence:3060) CRITICAL: Failed assertion: !gem_bo_busy(fd, obj.handle)
user :notice: [ 665.940675] (gem_exec_fence:3060) igt_core-INFO: Stack trace:
user :notice: [ 665.942717] (gem_exec_fence:3060) igt_core-INFO: #0 ../lib/igt_core.c:1727 __igt_fail_assert()
user :notice: [ 665.944446] (gem_exec_fence:3060) igt_core-INFO: #1 [test_fence_busy_all+0x57d]
user :notice: [ 665.946803] (gem_exec_fence:3060) igt_core-INFO: #2 ../tests/i915/gem_exec_fence.c:1614 __real_main1583()
user :notice: [ 665.948905] (gem_exec_fence:3060) igt_core-INFO: #3 ../tests/i915/gem_exec_fence.c:1583 main()
user :notice: [ 665.950565] (gem_exec_fence:3060) igt_core-INFO: #4 [__libc_start_main+0xeb]
user :notice: [ 665.951953] (gem_exec_fence:3060) igt_core-INFO: #5 [_start+0x2a]
user :notice: [ 665.952677] **** END ****
user :notice: [ 665.953117] Stack trace:
user :notice: [ 665.954297] #0 ../lib/igt_core.c:1727 __igt_fail_assert()
user :notice: [ 665.955185] #1 [test_fence_busy_all+0x57d]
user :notice: [ 665.956678] #2 ../tests/i915/gem_exec_fence.c:1614 __real_main1583()
user :notice: [ 665.957912] #3 ../tests/i915/gem_exec_fence.c:1583 main()
user :notice: [ 665.958725] #4 [__libc_start_main+0xeb]
user :notice: [ 665.959226] #5 [_start+0x2a]
user :notice: [ 665.960260] Subtest busy-hang-all: FAIL (6.433s)
To reproduce:
git clone https://github.com/intel/lkp-tests.git cd lkp-tests bin/lkp install job.yaml # job file is attached in this email bin/lkp run job.yaml
Thanks, Rong Chen
dri-devel@lists.freedesktop.org