[Bug 100712] ring 0 stalled after bytes_moved_threshold reached - Cap Verde - HD 7770

19 Apr 2017


      https://bugs.freedesktop.org/show_bug.cgi?id=100712
--- Comment #5 from Julien Isorce julien.isorce@gmail.com ---
(In reply to Michel Dänzer from comment #4)
...
(In reply to Julien Isorce from comment #0)
...
In kernel radeon_object.c::radeon_bo_list_validate, once "bytes_moved >
bytes_moved_threshold" is reached (this is the case for 850 bo in the same
list_for_each_entry loop), I can see that radeon_ib_schedule emits a fence
that it takes more than the radeon.lockup_timeout to be signaled.
radeon_ib_schedule is called for submitting the command stream from
userspace, not for any BO moves directly, right?
How did you determine that this hang is directly related to bytes_moved /
bytes_moved_threshold? Maybe it's only indirectly related, e.g. due to the
threshold preventing a BO from being moved to VRAM despite userspace's
preference.
I added a trace and the fence that is not signaled on time is always the one
emited by radeon_ib_schedule after that the bytes_moved_threshold is reached.
But you are right it could be only indirectly related.
Here is the sequence I have:
ioctl_radeon_cs
  radeon_bo_list_validate
    bytes_moved > bytes_moved_threshold(=1024*1024ull)
    800 bo are not moved from gtt to vram because of that.
  radeon_cs_ib_vm_chunk
    radeon_ib_schedule(rdev, &parser->ib, NULL, true);
      radeon_fence_emit on ring 0
      r600_mmio_hdp_flush
/ioctl_radeon_cs
Then anything calling ttm_bo_wait will block more than the
radeon.lockup_timeout because the above fence is not signaled on time.
Could it be that something is not flushed properly ? (ref:
https://patchwork.kernel.org/patch/5807141/ ? tlb_flush ?)
Are you saying that some bos are required to be moved from gtt to vram in order
for this fence to be signaled ?
As you can see above it happens when vram_usage >= half_vram so
radeon_bo_get_threshold_for_moves returns 1024*1024, which explains why only 1
or 2 bos can be moved from gtt to vram in that case and why all others are
forced to stay in gtt.
In the same run of radeon_bo_list_validate there are many calls to
ttm_bo_validate with both domain and current_domain as VRAM, this is the case
for around 400 bo. Maybe this cause delay for this fence to be signaled,
providing vram usage is high too.
...
...
Also it seems the fence is signaled by swapper after more than 10 seconds
but it is too late. I requires to reduce the "15" param above to 4 to see
that.
How does "swapper" (what is that exactly?) signal the fence?
My wording was wrong sorry, I should have said "the first entity noticing that
the fence is signaled" by calling radeon_fence_activity. swapper is the name
for process 0 (idle). I change drm logging to print process name and id:
(current->comm, current->pid)
...
It might be worth looking into why this happens, though. If domain ==
current_domain == RADEON_GEM_DOMAIN_VRAM, I wouldn't expect ttm_bo_validate
to trigger a blit.
I will check though I think I get just confused by a previous trace.
-- 
You are receiving this mail because:
You are the assignee for the bug.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

[Bug 100712] ring 0 stalled after bytes_moved_threshold reached - Cap Verde - HD 7770