https://bugs.freedesktop.org/show_bug.cgi?id=100712
--- Comment #5 from Julien Isorce julien.isorce@gmail.com --- (In reply to Michel Dänzer from comment #4)
(In reply to Julien Isorce from comment #0)
In kernel radeon_object.c::radeon_bo_list_validate, once "bytes_moved > bytes_moved_threshold" is reached (this is the case for 850 bo in the same list_for_each_entry loop), I can see that radeon_ib_schedule emits a fence that it takes more than the radeon.lockup_timeout to be signaled.
radeon_ib_schedule is called for submitting the command stream from userspace, not for any BO moves directly, right?
How did you determine that this hang is directly related to bytes_moved / bytes_moved_threshold? Maybe it's only indirectly related, e.g. due to the threshold preventing a BO from being moved to VRAM despite userspace's preference.
I added a trace and the fence that is not signaled on time is always the one emited by radeon_ib_schedule after that the bytes_moved_threshold is reached. But you are right it could be only indirectly related.
Here is the sequence I have:
ioctl_radeon_cs radeon_bo_list_validate bytes_moved > bytes_moved_threshold(=1024*1024ull) 800 bo are not moved from gtt to vram because of that. radeon_cs_ib_vm_chunk radeon_ib_schedule(rdev, &parser->ib, NULL, true); radeon_fence_emit on ring 0 r600_mmio_hdp_flush /ioctl_radeon_cs
Then anything calling ttm_bo_wait will block more than the radeon.lockup_timeout because the above fence is not signaled on time. Could it be that something is not flushed properly ? (ref: https://patchwork.kernel.org/patch/5807141/ ? tlb_flush ?)
Are you saying that some bos are required to be moved from gtt to vram in order for this fence to be signaled ?
As you can see above it happens when vram_usage >= half_vram so radeon_bo_get_threshold_for_moves returns 1024*1024, which explains why only 1 or 2 bos can be moved from gtt to vram in that case and why all others are forced to stay in gtt.
In the same run of radeon_bo_list_validate there are many calls to ttm_bo_validate with both domain and current_domain as VRAM, this is the case for around 400 bo. Maybe this cause delay for this fence to be signaled, providing vram usage is high too.
Also it seems the fence is signaled by swapper after more than 10 seconds but it is too late. I requires to reduce the "15" param above to 4 to see that.
How does "swapper" (what is that exactly?) signal the fence?
My wording was wrong sorry, I should have said "the first entity noticing that the fence is signaled" by calling radeon_fence_activity. swapper is the name for process 0 (idle). I change drm logging to print process name and id: (current->comm, current->pid)
It might be worth looking into why this happens, though. If domain == current_domain == RADEON_GEM_DOMAIN_VRAM, I wouldn't expect ttm_bo_validate to trigger a blit.
I will check though I think I get just confused by a previous trace.