Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

17 Jun 2021

      Timeline semaphore waits (polling on memory) will be unmonitored and as
fast as the roundtrip to memory. Semaphore writes will be slower because
the copy of those write requests will also be forwarded to the kernel.
Arbitrary writes are not protected by the hw but the kernel will take
action against such behavior because it will receive them too.
I don't know if that would work with dma_fence.
Marek
On Thu, Jun 17, 2021 at 3:04 PM Daniel Vetter daniel@ffwll.ch wrote:
...
On Thu, Jun 17, 2021 at 02:28:06PM -0400, Marek Olšák wrote:
...
The kernel will know who should touch the implicit-sync semaphore next,
and
...
at the same time, the copy of all write requests to the implicit-sync
semaphore will be forwarded to the kernel for monitoring and bo_wait.
Syncobjs could either use the same monitored access as implicit sync or
be
...
completely unmonitored. We haven't decided yet.
Syncfiles could either use one of the above or wait for a syncobj to go
idle before converting to a syncfile.
Hm this sounds all like you're planning to completely rewrap everything
... I'm assuming the plan is still that this is going to be largely
wrapped in dma_fence? Maybe with timeline objects being a bit more
optimized, but I'm not sure how much you can optimize without breaking the
interfaces.
-Daniel
...
Marek
On Thu, Jun 17, 2021 at 12:48 PM Daniel Vetter daniel@ffwll.ch wrote:
...
On Mon, Jun 14, 2021 at 07:13:00PM +0200, Christian König wrote:
...
As long as we can figure out who touched to a certain sync object
last
...
...
that
...
would indeed work, yes.
Don't you need to know who will touch it next, i.e. who is holding up
your
...
...
fence? Or maybe I'm just again totally confused.
-Daniel
...
Christian.
Am 14.06.21 um 19:10 schrieb Marek Olšák:
...
The call to the hw scheduler has a limitation on the size of all
parameters combined. I think we can only pass a 32-bit sequence
number
...
...
...
...
and a ~16-bit global (per-GPU) syncobj handle in one call and not
much
...
...
...
...
else.
The syncobj handle can be an element index in a global (per-GPU)
syncobj
...
...
table and it's read only for all processes with the exception of
the
...
...
...
...
signal command. Syncobjs can either have per VMID write access
flags
...
...
for
...
...
the signal command (slow), or any process can write to any
syncobjs and
...
...
...
...
only rely on the kernel checking the write log (fast).
In any case, we can execute the memory write in the queue engine
and
...
...
...
...
only use the hw scheduler for logging, which would be perfect.
Marek
On Thu, Jun 10, 2021 at 12:33 PM Christian König
<ckoenig.leichtzumerken@gmail.com
mailto:ckoenig.leichtzumerken@gmail.com> wrote:
Hi guys,

maybe soften that a bit. Reading from the shared memory of the
user fence is ok for everybody. What we need to take more care

of
...
...
...
...
is the writing side.

So my current thinking is that we allow read only access, but
writing a new sequence value needs to go through the

scheduler/kernel.
...
...
So when the CPU wants to signal a timeline fence it needs to

call
...
...
...
...
an IOCTL. When the GPU wants to signal the timeline fence it

needs
...
...
...
...
to hand that of to the hardware scheduler.

If we lockup the kernel can check with the hardware who did the
last write and what value was written.

That together with an IOCTL to give out sequence number for
implicit sync to applications should be sufficient for the

kernel
...
...
...
...
to track who is responsible if something bad happens.

In other words when the hardware says that the shader wrote

stuff
...
...
...
...
like 0xdeadbeef 0x0 or 0xffffffff into memory we kill the

process
...
...
...
...
who did that.

If the hardware says that seq - 1 was written fine, but seq is
missing then the kernel blames whoever was supposed to write

seq.
...
...
...
...
Just pieping the write through a privileged instance should be
fine to make sure that we don't run into issues.

Christian.

Am 10.06.21 um 17:59 schrieb Marek Olšák:

...
Hi Daniel,

We just talked about this whole topic internally and we came

up
...
...
...
...
...
to the conclusion that the hardware needs to understand sync
object handles and have high-level wait and signal

operations in
...
...
...
...
...
the command stream. Sync objects will be backed by memory,

but
...
...
...
...
...
they won't be readable or writable by processes directly. The
hardware will log all accesses to sync objects and will send

the
...
...
...
...
...
log to the kernel periodically. The kernel will identify
malicious behavior.

Example of a hardware command stream:
...
ImplicitSyncWait(syncObjHandle, sequenceNumber); // the

sequence
...
...
...
...
...
number is assigned by the kernel
Draw();
ImplicitSyncSignalWhenDone(syncObjHandle);
...

I'm afraid we have no other choice because of the TLB
invalidation overhead.

Marek

On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter <

daniel@ffwll.ch
...
...
...
...
...
<mailto:daniel@ffwll.ch>> wrote:

    On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König

wrote:
...
...
...
    > Am 09.06.21 um 15:19 schrieb Daniel Vetter:
    > > [SNIP]
    > > > Yeah, we call this the lightweight and the

heavyweight
...
...
...
...
...
    tlb flush.
    > > >
    > > > The lighweight can be used when you are sure that

you
...
...
...
...
...
    don't have any of the
    > > > PTEs currently in flight in the 3D/DMA engine and

you
...
...
...
...
...
    just need to
    > > > invalidate the TLB.
    > > >
    > > > The heavyweight must be used when you need to
    invalidate the TLB *AND* make
    > > > sure that no concurrently operation moves new stuff
    into the TLB.
    > > >
    > > > The problem is for this use case we have to use the
    heavyweight one.
    > > Just for my own curiosity: So the lightweight flush

is
...
...
...
...
...
    only for in-between
    > > CS when you know access is idle? Or does that also

not
...
...
...
...
...
    work if userspace
    > > has a CS on a dma engine going at the same time

because
...
...
...
...
...
    the tlb aren't
    > > isolated enough between engines?
    >
    > More or less correct, yes.
    >
    > The problem is a lightweight flush only invalidates the
    TLB, but doesn't
    > take care of entries which have been handed out to the
    different engines.
    >
    > In other words what can happen is the following:
    >
    > 1. Shader asks TLB to resolve address X.
    > 2. TLB looks into its cache and can't find address X

so it
...
...
...
...
...
    asks the walker
    > to resolve.
    > 3. Walker comes back with result for address X and TLB

puts
...
...
...
...
...
    that into its
    > cache and gives it to Shader.
    > 4. Shader starts doing some operation using result for
    address X.
    > 5. You send lightweight TLB invalidate and TLB throws

away
...
...
...
...
...
    cached values for
    > address X.
    > 6. Shader happily still uses whatever the TLB gave to

it in
...
...
...
...
...
    step 3 to
    > accesses address X
    >
    > See it like the shader has their own 1 entry L0 TLB

cache
...
...
...
...
...
    which is not
    > affected by the lightweight flush.
    >
    > The heavyweight flush on the other hand sends out a
    broadcast signal to
    > everybody and only comes back when we are sure that an
    address is not in use
    > any more.

    Ah makes sense. On intel the shaders only operate in VA,
    everything goes
    around as explicit async messages to IO blocks. So we

don't
...
...
...
...
...
    have this, the
    only difference in tlb flushes is between tlb flush in

the IB
...
...
...
...
...
    and an mmio
    one which is independent for anything currently being
    executed on an
    egine.
    -Daniel
    --         Daniel Vetter
    Software Engineer, Intel Corporation
    http://blog.ffwll.ch <http://blog.ffwll.ch>

--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [Mesa-dev] Linux Graphics Next: Userspace submission update