Re: [RFC PATCH 0/7] drm/panfrost: Add a new submit ioctl

12 Mar 2021


      On Fri, 12 Mar 2021 19:25:13 +0100
Boris Brezillon boris.brezillon@collabora.com wrote:
...
...
So where does this leave us?  Well, it depends on your submit model
and exactly how you handle pipeline barriers that sync between
engines.  If you're taking option 3 above and doing two command
buffers for each VkCommandBuffer, then you probably want two
serialized timelines, one for each engine, and some mechanism to tell
the kernel driver "these two command buffers have to run in parallel"
so that your ping-pong works.  If you're doing 1 or 2 above, I think
you probably still want two simple syncobjs, one for each engine.  You
don't really have any need to go all that far back in history.  All
you really need to describe is "command buffer X depends on previous
compute work" or "command buffer X depends on previous binning work".
Okay, so this will effectively force in-order execution. Let's take your
previous example and add 2 more jobs at the end that have no deps on
previous commands:
vkBeginRenderPass() /* Writes to ImageA */
vkCmdDraw()
vkCmdDraw()
...
vkEndRenderPass()
vkPipelineBarrier(imageA /* fragment -> compute */)
vkCmdDispatch() /* reads imageA, writes BufferB */
vkBeginRenderPass() /* Writes to ImageC */
vkCmdBindVertexBuffers(bufferB)
vkCmdDraw();
...
vkEndRenderPass()
vkBeginRenderPass() /* Writes to ImageD */
vkCmdDraw()
...
vkEndRenderPass()
A: Vertex for the first draw on the compute engine
B: Vertex for the first draw on the compute engine
C: Fragment for the first draw on the binning engine; depends on A
D: Fragment for the second draw on the binning engine; depends on B
E: Compute on the compute engine; depends on C and D
F: Vertex for the third draw on the compute engine; depends on E
G: Fragment for the third draw on the binning engine; depends on F
H: Vertex for the fourth draw on the compute engine
I: Fragment for the fourth draw on the binning engine
When we reach E, we might be waiting for D to finish before scheduling
the job, and because of the implicit serialization we have on the
compute queue (F implicitly depends on E, and H on F) we can't schedule
H either, which could, in theory be started. I guess that's where the
term submission order is a bit unclear to me. The action of starting a
job sounds like execution order to me (the order you starts jobs
determines the execution order since we only have one HW queue per job
type). All implicit deps have been calculated when we queued the job to
the SW queue, and I thought that would be enough to meet the submission
order requirements, but I might be wrong.
The PoC I have was trying to get rid of this explicit serialization on
the compute and fragment queues by having one syncobj timeline
(queue(<syncpoint>)) and synchronization points (Sx).
S0: in-fences=<waitSemaphores[]>, out-fences=<explicit_deps> #waitSemaphore sync point
A: in-fences=<explicit_deps>, out-fences=<queue(1)>
B: in-fences=<explicit_deps>, out-fences=<queue(2)>
C: in-fences=<explicit_deps>, out-fence=<queue(3)> #implicit dep on A through the tiler context
D: in-fences=<explicit_deps>, out-fence=<queue(4)> #implicit dep on B through the tiler context
E: in-fences=<explicit_deps>, out-fence=<queue(5)> #implicit dep on D through imageA
F: in-fences=<explicit_deps>, out-fence=<queue(6)> #implicit dep on E through buffer B
G: in-fences=<explicit_deps>, out-fence=<queue(7)> #implicit dep on F through the tiler context
H: in-fences=<explicit_deps>, out-fence=<queue(8)>
I: in-fences=<explicit_deps>, out-fence=<queue(9)> #implicit dep on H through the tiler buffer
S1: in-fences=<queue(9)>, out-fences=<signalSemaphores[],fence> #signalSemaphore,fence sync point
# QueueWaitIdle is implemented with a wait(queue(0)), AKA wait on the last point
With this solution H can be started before E if the compute slot
is empty and E's implicit deps are not done. It's probably overkill,
but I thought maximizing GPU utilization was important.
Nevermind, I forgot the drm scheduler was dequeuing jobs in order, so 2
syncobjs (one per queue type) is indeed the right approach.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [RFC PATCH 0/7] drm/panfrost: Add a new submit ioctl