Re: [PATCH 1/1] drm/amdkfd: Do not ignore requested queue size during allocation

30 Nov 2017

On Wed, 2017-11-29 at 16:58 -0500, Felix Kuehling wrote:
...
You can see the state of the queues in debugfs:
/sys/kernel/debug/kfd/... You can look at MQDs and HQDs.
thanks. how do I decode the information?
The rptr always stops at pos 60 which looks like this in mqds:
DIQ on device 45a2
    00000000: c0310800 00004000 00000000 00000000 00000000 00000000 00000000 00000000
    00000020: 00000000 00000000 00000000 00000001 00000000 00000000 00000000 00000000
    00000040: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 ffffffff
    00000060: ffffffff 00000000 ffffffff ffffffff 00000000 00000000 00000000 00000000
If I understood correctly that's the queue dump, so those fffffs look
wrong
...
If your application isn't stopping queues deliberately, queues get
disabled by evictions, usually temporarily. You'll see kernel messages
when that happens.
A VM fault will result in queues of the offending process getting
disabled permanently. Again, you'll see messages about that in the
kernel log.
The RPTR can also stop advancing if you have an infinite loop in a
shader program, or just a shader that takes a very long time to execute.
Or maybe if you have some dependencies (barriers) in your AQL packets
that never get satisfied.
The function you changed only affects the HIQ, the queue that KFD uses
to control the HWS. It does not affect user mode queues. If your problem
is with a user mode queue, your change should have no effect at all.
It's not a userspace queue that stops. I'm using kernel dbgdev to issue
wave_resume commands. (waves are halted after executing
s_sendmsg_halt).
I bumped KFD_KERNEL_QUEUE_SIZE to 16KB to make sure all 320 resume
commads fit (otherwise I get spurious ENOMEM when the queue is full but
still advancing).
thanks,
Jan
...
Regards,
  Felix
On 2017-11-29 04:43 PM, Jan Vesely wrote:
...
On Mon, 2017-11-20 at 14:22 -0500, Felix Kuehling wrote:
...
I think this patch is not correct. The EOP-mem is not associated with
the queue size. The EOP buffer is a separate buffer used by the firmware
to handle command completion. As I understand it, this allows more
concurrency, while still making it look like all commands in the queue
are completing in order.
thanks for the explanation. I was looking for a source of a CP hang
(rptr stops advancing), but bumping the eop size actually mode things
worse. Is there a way to find out if a queue got disabled and for what
reason? (I'm running ROCK-1.6.x based kernel)
thanks,
Jan
...
Regards,
  Felix
On 2017-11-19 03:19 AM, Oded Gabbay wrote:
...
On Thu, Nov 16, 2017 at 11:36 PM, Jan Vesely jan.vesely@rutgers.edu wrote:
...
Signed-off-by: Jan Vesely jan.vesely@rutgers.edu
drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue_vi.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue_vi.c b/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue_vi.c
index f1d48281e322..b3bee39661ab 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue_vi.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_kernel_queue_vi.c
@@ -37,15 +37,16 @@ static bool initialize_vi(struct kernel_queue *kq, struct kfd_dev *dev,
                        enum kfd_queue_type type, unsigned int queue_size)
 {
        int retval;

  unsigned int size = ALIGN(queue_size, PAGE_SIZE);





  retval = kfd_gtt_sa_allocate(dev, PAGE_SIZE, &kq->eop_mem);




  retval = kfd_gtt_sa_allocate(dev, size, &kq->eop_mem);
  if (retval != 0)
          return false;

  kq->eop_gpu_addr = kq->eop_mem->gpu_addr;
  kq->eop_kernel_addr = kq->eop_mem->cpu_ptr;





  memset(kq->eop_kernel_addr, 0, PAGE_SIZE);




  memset(kq->eop_kernel_addr, 0, size);

  return true;



}
2.13.6

amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Thanks!
Applied to -next tree
Oded
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [PATCH 1/1] drm/amdkfd: Do not ignore requested queue size during allocation

Signed-off-by: Jan Vesely jan.vesely@rutgers.edu

}