On Fri, 12 Mar 2021 19:25:13 +0100 Boris Brezillon boris.brezillon@collabora.com wrote:
So where does this leave us? Well, it depends on your submit model and exactly how you handle pipeline barriers that sync between engines. If you're taking option 3 above and doing two command buffers for each VkCommandBuffer, then you probably want two serialized timelines, one for each engine, and some mechanism to tell the kernel driver "these two command buffers have to run in parallel" so that your ping-pong works. If you're doing 1 or 2 above, I think you probably still want two simple syncobjs, one for each engine. You don't really have any need to go all that far back in history. All you really need to describe is "command buffer X depends on previous compute work" or "command buffer X depends on previous binning work".
Okay, so this will effectively force in-order execution. Let's take your previous example and add 2 more jobs at the end that have no deps on previous commands:
vkBeginRenderPass() /* Writes to ImageA */ vkCmdDraw() vkCmdDraw() ... vkEndRenderPass() vkPipelineBarrier(imageA /* fragment -> compute */) vkCmdDispatch() /* reads imageA, writes BufferB */ vkBeginRenderPass() /* Writes to ImageC */ vkCmdBindVertexBuffers(bufferB) vkCmdDraw(); ... vkEndRenderPass() vkBeginRenderPass() /* Writes to ImageD */ vkCmdDraw() ... vkEndRenderPass()
A: Vertex for the first draw on the compute engine B: Vertex for the first draw on the compute engine C: Fragment for the first draw on the binning engine; depends on A D: Fragment for the second draw on the binning engine; depends on B E: Compute on the compute engine; depends on C and D F: Vertex for the third draw on the compute engine; depends on E G: Fragment for the third draw on the binning engine; depends on F H: Vertex for the fourth draw on the compute engine I: Fragment for the fourth draw on the binning engine
When we reach E, we might be waiting for D to finish before scheduling the job, and because of the implicit serialization we have on the compute queue (F implicitly depends on E, and H on F) we can't schedule H either, which could, in theory be started. I guess that's where the term submission order is a bit unclear to me. The action of starting a job sounds like execution order to me (the order you starts jobs determines the execution order since we only have one HW queue per job type). All implicit deps have been calculated when we queued the job to the SW queue, and I thought that would be enough to meet the submission order requirements, but I might be wrong.
The PoC I have was trying to get rid of this explicit serialization on the compute and fragment queues by having one syncobj timeline (queue(<syncpoint>)) and synchronization points (Sx).
S0: in-fences=<waitSemaphores[]>, out-fences=<explicit_deps> #waitSemaphore sync point A: in-fences=<explicit_deps>, out-fences=<queue(1)> B: in-fences=<explicit_deps>, out-fences=<queue(2)> C: in-fences=<explicit_deps>, out-fence=<queue(3)> #implicit dep on A through the tiler context D: in-fences=<explicit_deps>, out-fence=<queue(4)> #implicit dep on B through the tiler context E: in-fences=<explicit_deps>, out-fence=<queue(5)> #implicit dep on D through imageA F: in-fences=<explicit_deps>, out-fence=<queue(6)> #implicit dep on E through buffer B G: in-fences=<explicit_deps>, out-fence=<queue(7)> #implicit dep on F through the tiler context H: in-fences=<explicit_deps>, out-fence=<queue(8)> I: in-fences=<explicit_deps>, out-fence=<queue(9)> #implicit dep on H through the tiler buffer S1: in-fences=<queue(9)>, out-fences=<signalSemaphores[],fence> #signalSemaphore,fence sync point # QueueWaitIdle is implemented with a wait(queue(0)), AKA wait on the last point
With this solution H can be started before E if the compute slot is empty and E's implicit deps are not done. It's probably overkill, but I thought maximizing GPU utilization was important.
Nevermind, I forgot the drm scheduler was dequeuing jobs in order, so 2 syncobjs (one per queue type) is indeed the right approach.