Linux Graphics Next: Userspace submission update

List overview All Threads
Download

newer

older

Fixing problems around shared...

vc4_bo_create: Failed to allocate...

Marek Olšák

27 May 2021 27 May '21

9:51 p.m.

Hi,

Since Christian believes that we can't deadlock the kernel with some changes there, we just need to make everything nice for userspace too. Instead of explaining how it will work, I will explain the cases where future hardware (and its kernel driver) will break existing userspace in order to protect everybody from deadlocks. Anything that uses implicit sync will be spared, so X and Wayland will be fine, assuming they don't import/export fences. Those use cases that do import/export fences might or might not work, depending on how the fences are used.

One of the necessities is that all fences will become future fences. The semantics of imported/exported fences will change completely and will have new restrictions on the usage. The restrictions are:

1) Android sync files will be impossible to support, so won't be supported. (they don't allow future fences)

2) Implicit sync and explicit sync will be mutually exclusive between process. A process can either use one or the other, but not both. This is meant to prevent a deadlock condition with future fences where any process can malevolently deadlock execution of any other process, even execution of a higher-privileged process. The kernel will impose the following restrictions to protect against the deadlock:

a) a process with an implicitly-sync'd imported/exported buffer can't import/export a fence from/to another process b) a process with an imported/exported fence can't import/export an implicitly-sync'd buffer from/to another process

Alternative: A higher-privileged process could enforce both restrictions instead of the kernel to protect itself from the deadlock, but this would be a can of worms for existing userspace. It would be better if the kernel just broke unsafe userspace on future hw, just like sync files.

If both implicit and explicit sync are allowed to occur simultaneously, sending a future fence that will never signal to any process will deadlock that process after it acquires the implicit sync lock, which is a sequence number that the process is required to write to memory and send an interrupt from the GPU in a finite time. This is how the deadlock can happen:

* The process gets sequence number N from the kernel for an implicitly-sync'd buffer. * The process inserts (into the GPU user-mapped queue) a wait for sequence number N-1. * The process inserts a wait for a fence, but it doesn't know that it will never signal ==> deadlock. ... * The process inserts a command to write sequence number N to a predetermined memory location. (which will make the buffer idle and send an interrupt to the kernel) ... * The kernel will terminate the process because it has never received the interrupt. (i.e. a less-privileged process just killed a more-privileged process)

It's the interrupt for implicit sync that never arrived that caused the termination, and the only way another process can cause it is by sending a fence that will never signal. Thus, importing/exporting fences from/to other processes can't be allowed simultaneously with implicit sync.

3) Compositors (and other privileged processes, and display flipping) can't trust imported/exported fences. They need a timeout recovery mechanism from the beginning, and the following are some possible solutions to timeouts:

a) use a CPU wait with a small absolute timeout, and display the previous content on timeout b) use a GPU wait with a small absolute timeout, and conditional rendering will choose between the latest content (if signalled) and previous content (if timed out)

The result would be that the desktop can run close to 60 fps even if an app runs at 1 fps.

*Redefining imported/exported fences and breaking some users/OSs is the only way to have userspace GPU command submission, and the deadlock example here is the counterexample proving that there is no other way.*

So, what are the chances this is going to fly with the ecosystem?

Thanks, Marek

Attachments:

attachment.html (text/html — 4.5 KB)

Show replies by date

Christian König

28 May 28 May

2:41 p.m.

Hi Marek,

well I don't think that implicit and explicit synchronization needs to be mutual exclusive.

What we should do is to have the ability to transport an synchronization object with each BO.

Implicit and explicit synchronization then basically become the same, they just transport the synchronization object differently.

The biggest problem are the sync_files for Android, since they are really not easy to support at all. If Android wants to support user queues we would probably have to do some changes there.

Regards, Christian.

Am 27.05.21 um 23:51 schrieb Marek Olšák:

...

Hi,

Since Christian believes that we can't deadlock the kernel with some changes there, we just need to make everything nice for userspace too. Instead of explaining how it will work, I will explain the cases where future hardware (and its kernel driver) will break existing userspace in order to protect everybody from deadlocks. Anything that uses implicit sync will be spared, so X and Wayland will be fine, assuming they don't import/export fences. Those use cases that do import/export fences might or might not work, depending on how the fences are used.

One of the necessities is that all fences will become future fences. The semantics of imported/exported fences will change completely and will have new restrictions on the usage. The restrictions are:

Android sync files will be impossible to support, so won't be

supported. (they don't allow future fences)

Implicit sync and explicit sync will be mutually exclusive between

process. A process can either use one or the other, but not both. This is meant to prevent a deadlock condition with future fences where any process can malevolently deadlock execution of any other process, even execution of a higher-privileged process. The kernel will impose the following restrictions to protect against the deadlock:

a) a process with an implicitly-sync'd imported/exported buffer can't import/export a fence from/to another process b) a process with an imported/exported fence can't import/export an implicitly-sync'd buffer from/to another process

Alternative: A higher-privileged process could enforce both restrictions instead of the kernel to protect itself from the deadlock, but this would be a can of worms for existing userspace. It would be better if the kernel just broke unsafe userspace on future hw, just like sync files.

If both implicit and explicit sync are allowed to occur simultaneously, sending a future fence that will never signal to any process will deadlock that process after it acquires the implicit sync lock, which is a sequence number that the process is required to write to memory and send an interrupt from the GPU in a finite time. This is how the deadlock can happen:

The process gets sequence number N from the kernel for an

implicitly-sync'd buffer.

The process inserts (into the GPU user-mapped queue) a wait for

sequence number N-1.

The process inserts a wait for a fence, but it doesn't know that it

will never signal ==> deadlock. ...

The process inserts a command to write sequence number N to a

predetermined memory location. (which will make the buffer idle and send an interrupt to the kernel) ...

The kernel will terminate the process because it has never received

the interrupt. (i.e. a less-privileged process just killed a more-privileged process)

It's the interrupt for implicit sync that never arrived that caused the termination, and the only way another process can cause it is by sending a fence that will never signal. Thus, importing/exporting fences from/to other processes can't be allowed simultaneously with implicit sync.

Compositors (and other privileged processes, and display flipping)

can't trust imported/exported fences. They need a timeout recovery mechanism from the beginning, and the following are some possible solutions to timeouts:

a) use a CPU wait with a small absolute timeout, and display the previous content on timeout b) use a GPU wait with a small absolute timeout, and conditional rendering will choose between the latest content (if signalled) and previous content (if timed out)

The result would be that the desktop can run close to 60 fps even if an app runs at 1 fps.

*Redefining imported/exported fences and breaking some users/OSs is the only way to have userspace GPU command submission, and the deadlock example here is the counterexample proving that there is no other way.*

So, what are the chances this is going to fly with the ecosystem?

Thanks, Marek

Marek Olšák

10:25 p.m.

If both implicit and explicit synchronization are handled the same, then the kernel won't be able to identify the process that caused an implicit sync deadlock. The process that is stuck waiting for a fence can be innocent, and the kernel can't punish it. Likewise, the GPU reset guery that reports which process is guilty and innocent will only be able to report unknown. Is that OK?

Marek

On Fri, May 28, 2021 at 10:41 AM Christian König < ckoenig.leichtzumerken@gmail.com> wrote:

...

Hi Marek,

well I don't think that implicit and explicit synchronization needs to be mutual exclusive.

What we should do is to have the ability to transport an synchronization object with each BO.

Implicit and explicit synchronization then basically become the same, they just transport the synchronization object differently.

The biggest problem are the sync_files for Android, since they are really not easy to support at all. If Android wants to support user queues we would probably have to do some changes there.

Regards, Christian.

Am 27.05.21 um 23:51 schrieb Marek Olšák:

Hi,

Since Christian believes that we can't deadlock the kernel with some changes there, we just need to make everything nice for userspace too. Instead of explaining how it will work, I will explain the cases where future hardware (and its kernel driver) will break existing userspace in order to protect everybody from deadlocks. Anything that uses implicit sync will be spared, so X and Wayland will be fine, assuming they don't import/export fences. Those use cases that do import/export fences might or might not work, depending on how the fences are used.

One of the necessities is that all fences will become future fences. The semantics of imported/exported fences will change completely and will have new restrictions on the usage. The restrictions are:

Android sync files will be impossible to support, so won't be

supported. (they don't allow future fences)

Implicit sync and explicit sync will be mutually exclusive between

process. A process can either use one or the other, but not both. This is meant to prevent a deadlock condition with future fences where any process can malevolently deadlock execution of any other process, even execution of a higher-privileged process. The kernel will impose the following restrictions to protect against the deadlock:

a) a process with an implicitly-sync'd imported/exported buffer can't import/export a fence from/to another process b) a process with an imported/exported fence can't import/export an implicitly-sync'd buffer from/to another process

Alternative: A higher-privileged process could enforce both restrictions instead of the kernel to protect itself from the deadlock, but this would be a can of worms for existing userspace. It would be better if the kernel just broke unsafe userspace on future hw, just like sync files.

If both implicit and explicit sync are allowed to occur simultaneously, sending a future fence that will never signal to any process will deadlock that process after it acquires the implicit sync lock, which is a sequence number that the process is required to write to memory and send an interrupt from the GPU in a finite time. This is how the deadlock can happen:

The process gets sequence number N from the kernel for an

implicitly-sync'd buffer.

The process inserts (into the GPU user-mapped queue) a wait for sequence

number N-1.

The process inserts a wait for a fence, but it doesn't know that it will

never signal ==> deadlock. ...

The process inserts a command to write sequence number N to a

predetermined memory location. (which will make the buffer idle and send an interrupt to the kernel) ...

The kernel will terminate the process because it has never received the

interrupt. (i.e. a less-privileged process just killed a more-privileged process)

It's the interrupt for implicit sync that never arrived that caused the termination, and the only way another process can cause it is by sending a fence that will never signal. Thus, importing/exporting fences from/to other processes can't be allowed simultaneously with implicit sync.

Compositors (and other privileged processes, and display flipping)

can't trust imported/exported fences. They need a timeout recovery mechanism from the beginning, and the following are some possible solutions to timeouts:

a) use a CPU wait with a small absolute timeout, and display the previous content on timeout b) use a GPU wait with a small absolute timeout, and conditional rendering will choose between the latest content (if signalled) and previous content (if timed out)

The result would be that the desktop can run close to 60 fps even if an app runs at 1 fps.

*Redefining imported/exported fences and breaking some users/OSs is the only way to have userspace GPU command submission, and the deadlock example here is the counterexample proving that there is no other way.*

So, what are the chances this is going to fly with the ecosystem?

Thanks, Marek

Marek Olšák

29 May 29 May

3:33 a.m.

My first email can be ignored except for the sync files. Oh well.

I think I see what you mean, Christian. If we assume that an imported fence is always read only (the buffer with the sequence number is read only), only the process that created and exported the fence can signal it. If the fence is not signaled, the exporting process is guilty. The only thing the importing process must do when it's about to use the fence as a dependency is to notify the kernel about it. Thus, the kernel will always know the dependency graph. Then if the importing process times out, the kernel will blame any of the processes that passed it a fence that is still unsignaled. The kernel will blame the process that timed out only if all imported fences have been signaled. It seems pretty robust.

It's the same with implicit sync except that the buffer with the sequence number is writable. Any process that has an implicitly-sync'd buffer can set the sequence number to 0 or UINT64_MAX. 0 will cause a timeout for the next job, while UINT64_MAX might cause a timeout a little later. The timeout can be mitigated by the kernel because the kernel knows the greatest number that should be there, but it's not possible to know which process is guilty (all processes holding the buffer handle would be suspects).

Marek

On Fri, May 28, 2021 at 6:25 PM Marek Olšák maraeo@gmail.com wrote:

...

If both implicit and explicit synchronization are handled the same, then the kernel won't be able to identify the process that caused an implicit sync deadlock. The process that is stuck waiting for a fence can be innocent, and the kernel can't punish it. Likewise, the GPU reset guery that reports which process is guilty and innocent will only be able to report unknown. Is that OK?

Marek

On Fri, May 28, 2021 at 10:41 AM Christian König < ckoenig.leichtzumerken@gmail.com> wrote:

...
Hi Marek,

well I don't think that implicit and explicit synchronization needs to be mutual exclusive.

What we should do is to have the ability to transport an synchronization object with each BO.

Implicit and explicit synchronization then basically become the same, they just transport the synchronization object differently.

The biggest problem are the sync_files for Android, since they are really not easy to support at all. If Android wants to support user queues we would probably have to do some changes there.

Regards, Christian.

Am 27.05.21 um 23:51 schrieb Marek Olšák:

Hi,

Since Christian believes that we can't deadlock the kernel with some changes there, we just need to make everything nice for userspace too. Instead of explaining how it will work, I will explain the cases where future hardware (and its kernel driver) will break existing userspace in order to protect everybody from deadlocks. Anything that uses implicit sync will be spared, so X and Wayland will be fine, assuming they don't import/export fences. Those use cases that do import/export fences might or might not work, depending on how the fences are used.

One of the necessities is that all fences will become future fences. The semantics of imported/exported fences will change completely and will have new restrictions on the usage. The restrictions are:

Android sync files will be impossible to support, so won't be

supported. (they don't allow future fences)

Implicit sync and explicit sync will be mutually exclusive between

process. A process can either use one or the other, but not both. This is meant to prevent a deadlock condition with future fences where any process can malevolently deadlock execution of any other process, even execution of a higher-privileged process. The kernel will impose the following restrictions to protect against the deadlock:

a) a process with an implicitly-sync'd imported/exported buffer can't import/export a fence from/to another process b) a process with an imported/exported fence can't import/export an implicitly-sync'd buffer from/to another process

Alternative: A higher-privileged process could enforce both restrictions instead of the kernel to protect itself from the deadlock, but this would be a can of worms for existing userspace. It would be better if the kernel just broke unsafe userspace on future hw, just like sync files.

If both implicit and explicit sync are allowed to occur simultaneously, sending a future fence that will never signal to any process will deadlock that process after it acquires the implicit sync lock, which is a sequence number that the process is required to write to memory and send an interrupt from the GPU in a finite time. This is how the deadlock can happen:

The process gets sequence number N from the kernel for an

implicitly-sync'd buffer.

The process inserts (into the GPU user-mapped queue) a wait for

sequence number N-1.

The process inserts a wait for a fence, but it doesn't know that it

will never signal ==> deadlock. ...

The process inserts a command to write sequence number N to a

predetermined memory location. (which will make the buffer idle and send an interrupt to the kernel) ...

The kernel will terminate the process because it has never received the

interrupt. (i.e. a less-privileged process just killed a more-privileged process)

It's the interrupt for implicit sync that never arrived that caused the termination, and the only way another process can cause it is by sending a fence that will never signal. Thus, importing/exporting fences from/to other processes can't be allowed simultaneously with implicit sync.

Compositors (and other privileged processes, and display flipping)

can't trust imported/exported fences. They need a timeout recovery mechanism from the beginning, and the following are some possible solutions to timeouts:

a) use a CPU wait with a small absolute timeout, and display the previous content on timeout b) use a GPU wait with a small absolute timeout, and conditional rendering will choose between the latest content (if signalled) and previous content (if timed out)

The result would be that the desktop can run close to 60 fps even if an app runs at 1 fps.

*Redefining imported/exported fences and breaking some users/OSs is the only way to have userspace GPU command submission, and the deadlock example here is the counterexample proving that there is no other way.*

So, what are the chances this is going to fly with the ecosystem?

Thanks, Marek

Christian König

31 May 31 May

8:25 a.m.

Yes, exactly that's my thinking and also the reason why I'm pondering so hard on the requirement that the memory for shared user fences should not be modifiable by userspace directly.

Christian.

Am 29.05.21 um 05:33 schrieb Marek Olšák:

...

My first email can be ignored except for the sync files. Oh well.

Marek

On Fri, May 28, 2021 at 6:25 PM Marek Olšák <maraeo@gmail.com mailto:maraeo@gmail.com> wrote:

If both implicit and explicit synchronization are handled the
same, then the kernel won't be able to identify the process that
caused an implicit sync deadlock. The process that is stuck
waiting for a fence can be innocent, and the kernel can't punish
it. Likewise, the GPU reset guery that reports which process is
guilty and innocent will only be able to report unknown. Is that OK?

Marek

On Fri, May 28, 2021 at 10:41 AM Christian König
<ckoenig.leichtzumerken@gmail.com
<mailto:ckoenig.leichtzumerken@gmail.com>> wrote:

    Hi Marek,

    well I don't think that implicit and explicit synchronization
    needs to be mutual exclusive.

    What we should do is to have the ability to transport an
    synchronization object with each BO.

    Implicit and explicit synchronization then basically become
    the same, they just transport the synchronization object
    differently.

    The biggest problem are the sync_files for Android, since they
    are really not easy to support at all. If Android wants to
    support user queues we would probably have to do some changes
    there.

    Regards,
    Christian.

    Am 27.05.21 um 23:51 schrieb Marek Olšák:

...

    Hi,

    Since Christian believes that we can't deadlock the kernel
    with some changes there, we just need to make everything nice
    for userspace too. Instead of explaining how it will work, I
    will explain the cases where future hardware (and its kernel
    driver) will break existing userspace in order to protect
    everybody from deadlocks. Anything that uses implicit sync
    will be spared, so X and Wayland will be fine, assuming they
    don't import/export fences. Those use cases that do
    import/export fences might or might not work, depending on
    how the fences are used.

    One of the necessities is that all fences will become future
    fences. The semantics of imported/exported fences will change
    completely and will have new restrictions on the usage. The
    restrictions are:


    1) Android sync files will be impossible to support, so won't
    be supported. (they don't allow future fences)


    2) Implicit sync and explicit sync will be mutually exclusive
    between process. A process can either use one or the other,
    but not both. This is meant to prevent a deadlock condition
    with future fences where any process can malevolently
    deadlock execution of any other process, even execution of a
    higher-privileged process. The kernel will impose the
    following restrictions to protect against the deadlock:

    a) a process with an implicitly-sync'd imported/exported
    buffer can't import/export a fence from/to another process
    b) a process with an imported/exported fence can't
    import/export an implicitly-sync'd buffer from/to another process

    Alternative: A higher-privileged process could enforce both
    restrictions instead of the kernel to protect itself from the
    deadlock, but this would be a can of worms for existing
    userspace. It would be better if the kernel just broke unsafe
    userspace on future hw, just like sync files.

    If both implicit and explicit sync are allowed to occur
    simultaneously, sending a future fence that will never signal
    to any process will deadlock that process after it acquires
    the implicit sync lock, which is a sequence number that the
    process is required to write to memory and send an interrupt
    from the GPU in a finite time. This is how the deadlock can
    happen:

    * The process gets sequence number N from the kernel for an
    implicitly-sync'd buffer.
    * The process inserts (into the GPU user-mapped queue) a wait
    for sequence number N-1.
    * The process inserts a wait for a fence, but it doesn't know
    that it will never signal ==> deadlock.
    ...
    * The process inserts a command to write sequence number N to
    a predetermined memory location. (which will make the buffer
    idle and send an interrupt to the kernel)
    ...
    * The kernel will terminate the process because it has never
    received the interrupt. (i.e. a less-privileged process just
    killed a more-privileged process)

    It's the interrupt for implicit sync that never arrived that
    caused the termination, and the only way another process can
    cause it is by sending a fence that will never signal. Thus,
    importing/exporting fences from/to other processes can't be
    allowed simultaneously with implicit sync.


    3) Compositors (and other privileged processes, and display
    flipping) can't trust imported/exported fences. They need a
    timeout recovery mechanism from the beginning, and the
    following are some possible solutions to timeouts:

    a) use a CPU wait with a small absolute timeout, and display
    the previous content on timeout
    b) use a GPU wait with a small absolute timeout, and
    conditional rendering will choose between the latest content
    (if signalled) and previous content (if timed out)

    The result would be that the desktop can run close to 60 fps
    even if an app runs at 1 fps.

    *Redefining imported/exported fences and breaking some
    users/OSs is the only way to have userspace GPU command
    submission, and the deadlock example here is the
    counterexample proving that there is no other way.*

    So, what are the chances this is going to fly with the ecosystem?

    Thanks,
    Marek

Michel Dänzer

1 Jun 1 Jun

9:02 a.m.

On 2021-05-27 11:51 p.m., Marek Olšák wrote:

...

Compositors (and other privileged processes, and display flipping) can't trust imported/exported fences. They need a timeout recovery mechanism from the beginning, and the following are some possible solutions to timeouts:

a) use a CPU wait with a small absolute timeout, and display the previous content on timeout b) use a GPU wait with a small absolute timeout, and conditional rendering will choose between the latest content (if signalled) and previous content (if timed out)

The result would be that the desktop can run close to 60 fps even if an app runs at 1 fps.

FWIW, this is working with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even with implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to provide the same dma-buf poll semantics as other drivers and high priority GFX contexts via EGL_IMG_context_priority which can preempt lower priority ones).

-- Earthling Michel Dänzer | https://redhat.com Libre software enthusiast | Mesa and X developer

Christian König

10:21 a.m.

Am 01.06.21 um 11:02 schrieb Michel Dänzer:

...

On 2021-05-27 11:51 p.m., Marek Olšák wrote:

...

Compositors (and other privileged processes, and display flipping) can't trust imported/exported fences. They need a timeout recovery mechanism from the beginning, and the following are some possible solutions to timeouts:

a) use a CPU wait with a small absolute timeout, and display the previous content on timeout b) use a GPU wait with a small absolute timeout, and conditional rendering will choose between the latest content (if signalled) and previous content (if timed out)

The result would be that the desktop can run close to 60 fps even if an app runs at 1 fps.

FWIW, this is working with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even with implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to provide the same dma-buf poll semantics as other drivers and high priority GFX contexts via EGL_IMG_context_priority which can preempt lower priority ones).

Yeah, that is really nice to have.

One question is if you wait on the CPU or the GPU for the new surface to become available? The former is a bit bad for latency and power management.

Another question is if that is sufficient as security for the display server or if we need further handling down the road? I mean essentially we are moving the reliability problem into the display server.

Regards, Christian.

Michel Dänzer

10:49 a.m.

On 2021-06-01 12:21 p.m., Christian König wrote:

...

Am 01.06.21 um 11:02 schrieb Michel Dänzer:

...
On 2021-05-27 11:51 p.m., Marek Olšák wrote:

...

Compositors (and other privileged processes, and display flipping) can't trust imported/exported fences. They need a timeout recovery mechanism from the beginning, and the following are some possible solutions to timeouts:

a) use a CPU wait with a small absolute timeout, and display the previous content on timeout b) use a GPU wait with a small absolute timeout, and conditional rendering will choose between the latest content (if signalled) and previous content (if timed out)

The result would be that the desktop can run close to 60 fps even if an app runs at 1 fps.

FWIW, this is working with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even with implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to provide the same dma-buf poll semantics as other drivers and high priority GFX contexts via EGL_IMG_context_priority which can preempt lower priority ones).

Yeah, that is really nice to have.

One question is if you wait on the CPU or the GPU for the new surface to become available?

It's based on polling dma-buf fds, i.e. CPU.

...

The former is a bit bad for latency and power management.

There isn't a choice for Wayland compositors in general, since there can be arbitrary other state which needs to be applied atomically together with the new buffer. (Though in theory, a compositor might get fancy and special-case surface commits which can be handled by waiting on the GPU)

Latency is largely a matter of scheduling in the compositor. The latency incurred by the compositor shouldn't have to be more than single-digit milliseconds. (I've seen total latency from when the client starts processing a (static) frame to when it starts being scanned out as low as ~6 ms with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620, lower than typical with Xorg)

...

Another question is if that is sufficient as security for the display server or if we need further handling down the road? I mean essentially we are moving the reliability problem into the display server.

Good question. This should generally protect the display server from freezing due to client fences never signalling, but there might still be corner cases.

-- Earthling Michel Dänzer | https://redhat.com Libre software enthusiast | Mesa and X developer

Christian König

12:10 p.m.

New subject: [Mesa-dev] Linux Graphics Next: Userspace submission update

Am 01.06.21 um 12:49 schrieb Michel Dänzer:

...

On 2021-06-01 12:21 p.m., Christian König wrote:

...
Am 01.06.21 um 11:02 schrieb Michel Dänzer:

...
On 2021-05-27 11:51 p.m., Marek Olšák wrote:

...

Compositors (and other privileged processes, and display flipping) can't trust imported/exported fences. They need a timeout recovery mechanism from the beginning, and the following are some possible solutions to timeouts:

a) use a CPU wait with a small absolute timeout, and display the previous content on timeout b) use a GPU wait with a small absolute timeout, and conditional rendering will choose between the latest content (if signalled) and previous content (if timed out)

The result would be that the desktop can run close to 60 fps even if an app runs at 1 fps.

FWIW, this is working with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even with implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to provide the same dma-buf poll semantics as other drivers and high priority GFX contexts via EGL_IMG_context_priority which can preempt lower priority ones).

Yeah, that is really nice to have.

One question is if you wait on the CPU or the GPU for the new surface to become available?

It's based on polling dma-buf fds, i.e. CPU.

...
The former is a bit bad for latency and power management.

There isn't a choice for Wayland compositors in general, since there can be arbitrary other state which needs to be applied atomically together with the new buffer. (Though in theory, a compositor might get fancy and special-case surface commits which can be handled by waiting on the GPU)

Latency is largely a matter of scheduling in the compositor. The latency incurred by the compositor shouldn't have to be more than single-digit milliseconds. (I've seen total latency from when the client starts processing a (static) frame to when it starts being scanned out as low as ~6 ms with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620, lower than typical with Xorg)

Well let me describe it like this:

We have an use cases for 144 Hz guaranteed refresh rate. That essentially means that the client application needs to be able to spit out one frame/window content every ~6.9ms. That's tough, but doable.

When you now add 6ms latency in the compositor that means the client application has only .9ms left for it's frame which is basically impossible to do.

See for the user fences handling the display engine will learn to read sequence numbers from memory and decide on it's own if the old frame or the new one is scanned out. To get the latency there as low as possible.

...

...
Another question is if that is sufficient as security for the display server or if we need further handling down the road? I mean essentially we are moving the reliability problem into the display server.

Good question. This should generally protect the display server from freezing due to client fences never signalling, but there might still be corner cases.

Daniel Vetter

12:30 p.m.

New subject: [Mesa-dev] Linux Graphics Next: Userspace submission update

On Tue, Jun 1, 2021 at 2:10 PM Christian König ckoenig.leichtzumerken@gmail.com wrote:

...

Am 01.06.21 um 12:49 schrieb Michel Dänzer:

...
On 2021-06-01 12:21 p.m., Christian König wrote:

...
Am 01.06.21 um 11:02 schrieb Michel Dänzer:

...
On 2021-05-27 11:51 p.m., Marek Olšák wrote:

...

Compositors (and other privileged processes, and display flipping) can't trust imported/exported fences. They need a timeout recovery mechanism from the beginning, and the following are some possible solutions to timeouts:

a) use a CPU wait with a small absolute timeout, and display the previous content on timeout b) use a GPU wait with a small absolute timeout, and conditional rendering will choose between the latest content (if signalled) and previous content (if timed out)

The result would be that the desktop can run close to 60 fps even if an app runs at 1 fps.

FWIW, this is working with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even with implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to provide the same dma-buf poll semantics as other drivers and high priority GFX contexts via EGL_IMG_context_priority which can preempt lower priority ones).

Yeah, that is really nice to have.

One question is if you wait on the CPU or the GPU for the new surface to become available?

It's based on polling dma-buf fds, i.e. CPU.

...
The former is a bit bad for latency and power management.

There isn't a choice for Wayland compositors in general, since there can be arbitrary other state which needs to be applied atomically together with the new buffer. (Though in theory, a compositor might get fancy and special-case surface commits which can be handled by waiting on the GPU)

Latency is largely a matter of scheduling in the compositor. The latency incurred by the compositor shouldn't have to be more than single-digit milliseconds. (I've seen total latency from when the client starts processing a (static) frame to when it starts being scanned out as low as ~6 ms with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620, lower than typical with Xorg)

Well let me describe it like this:

We have an use cases for 144 Hz guaranteed refresh rate. That essentially means that the client application needs to be able to spit out one frame/window content every ~6.9ms. That's tough, but doable.

When you now add 6ms latency in the compositor that means the client application has only .9ms left for it's frame which is basically impossible to do.

See for the user fences handling the display engine will learn to read sequence numbers from memory and decide on it's own if the old frame or the new one is scanned out. To get the latency there as low as possible.

This won't work with implicit sync at all.

If you want to enable this use-case with driver magic and without the compositor being aware of what's going on, the solution is EGLStreams. Not sure we want to go there, but it's definitely a lot more feasible than trying to stuff eglstreams semantics into dma-buf implicit fencing support in a desperate attempt to not change compositors.

I still think the most reasonable approach here is that we wrap a dma_fence compat layer/mode over new hw for existing userspace/compositors. And then enable userspace memory fences and the fancy new features those allow with a new model that's built for them. Also even with dma_fence we could implement your model of staying with the previous buffer (or an older buffer at that's already rendered), but it needs explicit involvement of the compositor. At least without adding eglstreams fd to the kernel and wiring up all the relevant extensions. -Daniel

...

...
...
Another question is if that is sufficient as security for the display server or if we need further handling down the road? I mean essentially we are moving the reliability problem into the display server.

Good question. This should generally protect the display server from freezing due to client fences never signalling, but there might still be corner cases.

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Christian König

12:51 p.m.

New subject: [Mesa-dev] Linux Graphics Next: Userspace submission update

Am 01.06.21 um 14:30 schrieb Daniel Vetter:

...

On Tue, Jun 1, 2021 at 2:10 PM Christian König ckoenig.leichtzumerken@gmail.com wrote:

...
Am 01.06.21 um 12:49 schrieb Michel Dänzer:

...
On 2021-06-01 12:21 p.m., Christian König wrote:

...
Am 01.06.21 um 11:02 schrieb Michel Dänzer:

...
On 2021-05-27 11:51 p.m., Marek Olšák wrote:

...

Compositors (and other privileged processes, and display flipping) can't trust imported/exported fences. They need a timeout recovery mechanism from the beginning, and the following are some possible solutions to timeouts:

a) use a CPU wait with a small absolute timeout, and display the previous content on timeout b) use a GPU wait with a small absolute timeout, and conditional rendering will choose between the latest content (if signalled) and previous content (if timed out)

The result would be that the desktop can run close to 60 fps even if an app runs at 1 fps.

FWIW, this is working with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even with implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to provide the same dma-buf poll semantics as other drivers and high priority GFX contexts via EGL_IMG_context_priority which can preempt lower priority ones).

Yeah, that is really nice to have.

One question is if you wait on the CPU or the GPU for the new surface to become available?

It's based on polling dma-buf fds, i.e. CPU.

...
The former is a bit bad for latency and power management.

There isn't a choice for Wayland compositors in general, since there can be arbitrary other state which needs to be applied atomically together with the new buffer. (Though in theory, a compositor might get fancy and special-case surface commits which can be handled by waiting on the GPU)

Latency is largely a matter of scheduling in the compositor. The latency incurred by the compositor shouldn't have to be more than single-digit milliseconds. (I've seen total latency from when the client starts processing a (static) frame to when it starts being scanned out as low as ~6 ms with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620, lower than typical with Xorg)

Well let me describe it like this:

We have an use cases for 144 Hz guaranteed refresh rate. That essentially means that the client application needs to be able to spit out one frame/window content every ~6.9ms. That's tough, but doable.

When you now add 6ms latency in the compositor that means the client application has only .9ms left for it's frame which is basically impossible to do.

See for the user fences handling the display engine will learn to read sequence numbers from memory and decide on it's own if the old frame or the new one is scanned out. To get the latency there as low as possible.

This won't work with implicit sync at all.

If you want to enable this use-case with driver magic and without the compositor being aware of what's going on, the solution is EGLStreams. Not sure we want to go there, but it's definitely a lot more feasible than trying to stuff eglstreams semantics into dma-buf implicit fencing support in a desperate attempt to not change compositors.

Well not changing compositors is certainly not something I would try with this use case.

Not changing compositors is more like ok we have Ubuntu 20.04 and need to support that we the newest hardware generation.

...

I still think the most reasonable approach here is that we wrap a dma_fence compat layer/mode over new hw for existing userspace/compositors. And then enable userspace memory fences and the fancy new features those allow with a new model that's built for them.

Yeah, that's basically the same direction I'm heading. Question is how to fix all those details.

...

Also even with dma_fence we could implement your model of staying with the previous buffer (or an older buffer at that's already rendered), but it needs explicit involvement of the compositor. At least without adding eglstreams fd to the kernel and wiring up all the relevant extensions.

Question is do we already have some extension which allows different textures to be selected on the fly depending on some state?

E.g. something like use new frame if it's available and old frame otherwise.

If you then apply this to the standard dma_fence based hardware or the new user fence based one is then pretty much irrelevant.

Regards, Christian.

...

-Daniel

...
...
...
Another question is if that is sufficient as security for the display server or if we need further handling down the road? I mean essentially we are moving the reliability problem into the display server.

Good question. This should generally protect the display server from freezing due to client fences never signalling, but there might still be corner cases.

Marek Olšák

1:01 p.m.

New subject: [Mesa-dev] Linux Graphics Next: Userspace submission update

On Tue., Jun. 1, 2021, 08:51 Christian König, < ckoenig.leichtzumerken@gmail.com> wrote:

...

Am 01.06.21 um 14:30 schrieb Daniel Vetter:

...
On Tue, Jun 1, 2021 at 2:10 PM Christian König ckoenig.leichtzumerken@gmail.com wrote:

...
Am 01.06.21 um 12:49 schrieb Michel Dänzer:

...
On 2021-06-01 12:21 p.m., Christian König wrote:

...
Am 01.06.21 um 11:02 schrieb Michel Dänzer:

...
On 2021-05-27 11:51 p.m., Marek Olšák wrote: > 3) Compositors (and other privileged processes, and display

flipping) can't trust imported/exported fences. They need a timeout recovery mechanism from the beginning, and the following are some possible solutions to timeouts:

...
...
...
...
...
> > a) use a CPU wait with a small absolute timeout, and display the

previous content on timeout

...
...
...
...
...
> b) use a GPU wait with a small absolute timeout, and conditional

rendering will choose between the latest content (if signalled) and previous content (if timed out)

...
...
...
...
...
> > The result would be that the desktop can run close to 60 fps even

if an app runs at 1 fps.

...
...
...
...
...
FWIW, this is working with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even

with implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to provide the same dma-buf poll semantics as other drivers and high priority GFX contexts via EGL_IMG_context_priority which can preempt lower priority ones).

...
...
...
...
Yeah, that is really nice to have.

One question is if you wait on the CPU or the GPU for the new surface

to become available?

...
...
...
It's based on polling dma-buf fds, i.e. CPU.

...
The former is a bit bad for latency and power management.

There isn't a choice for Wayland compositors in general, since there

can be arbitrary other state which needs to be applied atomically together with the new buffer. (Though in theory, a compositor might get fancy and special-case surface commits which can be handled by waiting on the GPU)

...
...
...
Latency is largely a matter of scheduling in the compositor. The

latency incurred by the compositor shouldn't have to be more than single-digit milliseconds. (I've seen total latency from when the client starts processing a (static) frame to when it starts being scanned out as low as ~6 ms with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620, lower than typical with Xorg)

...
...
Well let me describe it like this:

We have an use cases for 144 Hz guaranteed refresh rate. That essentially means that the client application needs to be able to spit out one frame/window content every ~6.9ms. That's tough, but doable.

When you now add 6ms latency in the compositor that means the client application has only .9ms left for it's frame which is basically impossible to do.

See for the user fences handling the display engine will learn to read sequence numbers from memory and decide on it's own if the old frame or the new one is scanned out. To get the latency there as low as possible.

This won't work with implicit sync at all.

If you want to enable this use-case with driver magic and without the compositor being aware of what's going on, the solution is EGLStreams. Not sure we want to go there, but it's definitely a lot more feasible than trying to stuff eglstreams semantics into dma-buf implicit fencing support in a desperate attempt to not change compositors.

Well not changing compositors is certainly not something I would try with this use case.

Not changing compositors is more like ok we have Ubuntu 20.04 and need to support that we the newest hardware generation.

...
I still think the most reasonable approach here is that we wrap a dma_fence compat layer/mode over new hw for existing userspace/compositors. And then enable userspace memory fences and the fancy new features those allow with a new model that's built for them.

Yeah, that's basically the same direction I'm heading. Question is how to fix all those details.

...
Also even with dma_fence we could implement your model of staying with the previous buffer (or an older buffer at that's already rendered), but it needs explicit involvement of the compositor. At least without adding eglstreams fd to the kernel and wiring up all the relevant extensions.

Question is do we already have some extension which allows different textures to be selected on the fly depending on some state?

There is no such extension for sync objects, but it can be done with queries, like occlusion queries. There is also no timeout option and it can only do "if" and "if not", but not "if .. else"

Marek

...

E.g. something like use new frame if it's available and old frame otherwise.

If you then apply this to the standard dma_fence based hardware or the new user fence based one is then pretty much irrelevant.

Regards, Christian.

...
-Daniel

...
...
...
Another question is if that is sufficient as security for the display

server or if we need further handling down the road? I mean essentially we are moving the reliability problem into the display server.

...
...
...
Good question. This should generally protect the display server from

freezing due to client fences never signalling, but there might still be corner cases.

...
...
...

Michel Dänzer

1:24 p.m.

New subject: [Mesa-dev] Linux Graphics Next: Userspace submission update

On 2021-06-01 3:01 p.m., Marek Olšák wrote:

...

On Tue., Jun. 1, 2021, 08:51 Christian König, <ckoenig.leichtzumerken@gmail.com mailto:ckoenig.leichtzumerken@gmail.com> wrote:

Am 01.06.21 um 14:30 schrieb Daniel Vetter:
> On Tue, Jun 1, 2021 at 2:10 PM Christian König
> <ckoenig.leichtzumerken@gmail.com <mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
>> Am 01.06.21 um 12:49 schrieb Michel Dänzer:
>>> On 2021-06-01 12:21 p.m., Christian König wrote:
>>>> Am 01.06.21 um 11:02 schrieb Michel Dänzer:
>>>>> On 2021-05-27 11:51 p.m., Marek Olšák wrote:
>>>>>> 3) Compositors (and other privileged processes, and display flipping) can't trust imported/exported fences. They need a timeout recovery mechanism from the beginning, and the following are some possible solutions to timeouts:
>>>>>>
>>>>>> a) use a CPU wait with a small absolute timeout, and display the previous content on timeout
>>>>>> b) use a GPU wait with a small absolute timeout, and conditional rendering will choose between the latest content (if signalled) and previous content (if timed out)
>>>>>>
>>>>>> The result would be that the desktop can run close to 60 fps even if an app runs at 1 fps.
>>>>> FWIW, this is working with
>>>>> https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 <https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880> , even with implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to provide the same dma-buf poll semantics as other drivers and high priority GFX contexts via EGL_IMG_context_priority which can preempt lower priority ones).
>>>> Yeah, that is really nice to have.
>>>>
>>>> One question is if you wait on the CPU or the GPU for the new surface to become available?
>>> It's based on polling dma-buf fds, i.e. CPU.
>>>
>>>> The former is a bit bad for latency and power management.
>>> There isn't a choice for Wayland compositors in general, since there can be arbitrary other state which needs to be applied atomically together with the new buffer. (Though in theory, a compositor might get fancy and special-case surface commits which can be handled by waiting on the GPU)
>>>
>>> Latency is largely a matter of scheduling in the compositor. The latency incurred by the compositor shouldn't have to be more than single-digit milliseconds. (I've seen total latency from when the client starts processing a (static) frame to when it starts being scanned out as low as ~6 ms with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620 <https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620>, lower than typical with Xorg)
>> Well let me describe it like this:
>>
>> We have an use cases for 144 Hz guaranteed refresh rate. That
>> essentially means that the client application needs to be able to spit
>> out one frame/window content every ~6.9ms. That's tough, but doable.
>>
>> When you now add 6ms latency in the compositor that means the client
>> application has only .9ms left for it's frame which is basically
>> impossible to do.
>>
>> See for the user fences handling the display engine will learn to read
>> sequence numbers from memory and decide on it's own if the old frame or
>> the new one is scanned out. To get the latency there as low as possible.
> This won't work with implicit sync at all.
>
> If you want to enable this use-case with driver magic and without the
> compositor being aware of what's going on, the solution is EGLStreams.
> Not sure we want to go there, but it's definitely a lot more feasible
> than trying to stuff eglstreams semantics into dma-buf implicit
> fencing support in a desperate attempt to not change compositors.

EGLStreams are a red herring here. Wayland has atomic state transactions, similar to the atomic KMS API. These semantics could be achieved even with EGLStreams, at least via additional EGL extensions.

Any fancy technique we're discussing here would have to be completely between the Wayland compositor and the kernel, transparent to anything else.

...

> I still think the most reasonable approach here is that we wrap a
> dma_fence compat layer/mode over new hw for existing
> userspace/compositors. And then enable userspace memory fences and the
> fancy new features those allow with a new model that's built for them.

Yeah, that's basically the same direction I'm heading. Question is how
to fix all those details.

> Also even with dma_fence we could implement your model of staying with
> the previous buffer (or an older buffer at that's already rendered),
> but it needs explicit involvement of the compositor. At least without
> adding eglstreams fd to the kernel and wiring up all the relevant
> extensions.

Question is do we already have some extension which allows different
textures to be selected on the fly depending on some state?

There is no such extension for sync objects, but it can be done with queries, like occlusion queries. There is also no timeout option and it can only do "if" and "if not", but not "if .. else"

The Wayland compositor would also need to know which texture the shader ended up sampling from.

-- Earthling Michel Dänzer | https://redhat.com Libre software enthusiast | Mesa and X developer

Daniel Stone

2 Jun 2 Jun

8:57 a.m.

New subject: [Mesa-dev] Linux Graphics Next: Userspace submission update

Hi Christian,

On Tue, 1 Jun 2021 at 13:51, Christian König ckoenig.leichtzumerken@gmail.com wrote:

...

Am 01.06.21 um 14:30 schrieb Daniel Vetter:

...
If you want to enable this use-case with driver magic and without the compositor being aware of what's going on, the solution is EGLStreams. Not sure we want to go there, but it's definitely a lot more feasible than trying to stuff eglstreams semantics into dma-buf implicit fencing support in a desperate attempt to not change compositors.

Well not changing compositors is certainly not something I would try with this use case.

Not changing compositors is more like ok we have Ubuntu 20.04 and need to support that we the newest hardware generation.

Serious question, have you talked to Canonical?

I mean there's a hell of a lot of effort being expended here, but it seems to all be predicated on the assumption that Ubuntu's LTS HWE/backport policy is totally immutable, and that we might need to make the kernel do backflips to work around that. But ... is it? Has anyone actually asked them how they feel about this?

I mean, my answer to the first email is 'no, absolutely not' from the technical perspective (the initial proposal totally breaks current and future userspace), from a design perspective (it breaks a lot of usecases which aren't single-vendor GPU+display+codec, or aren't just a simple desktop), and from a sustainability perspective (cutting Android adrift again isn't acceptable collateral damage to make it easier to backport things to last year's Ubuntu release).

But then again, I don't even know what I'm NAKing here ... ? The original email just lists a proposal to break a ton of things, with proposed replacements which aren't technically viable, and it's not clear why? Can we please have some more details and some reasoning behind them?

I don't mind that userspace (compositor, protocols, clients like Mesa as well as codec APIs) need to do a lot of work to support the new model. I do really care though that the hard-binary-switch model works fine enough for AMD but totally breaks heterogeneous systems and makes it impossible for userspace to do the right thing.

Cheers, Daniel

Marek Olšák

9:34 a.m.

New subject: [Mesa-dev] Linux Graphics Next: Userspace submission update

Yes, we can't break anything because we don't want to complicate things for us. It's pretty much all NAK'd already. We are trying to gather more knowledge and then make better decisions.

The idea we are considering is that we'll expose memory-based sync objects to userspace for read only, and the kernel or hw will strictly control the memory writes to those sync objects. The hole in that idea is that userspace can decide not to signal a job, so even if userspace can't overwrite memory-based sync object states arbitrarily, it can still decide not to signal them, and then a future fence is born.

Marek

On Wed, Jun 2, 2021 at 4:57 AM Daniel Stone daniel@fooishbar.org wrote:

...

Hi Christian,

On Tue, 1 Jun 2021 at 13:51, Christian König ckoenig.leichtzumerken@gmail.com wrote:

...
Am 01.06.21 um 14:30 schrieb Daniel Vetter:

...
If you want to enable this use-case with driver magic and without the compositor being aware of what's going on, the solution is EGLStreams. Not sure we want to go there, but it's definitely a lot more feasible than trying to stuff eglstreams semantics into dma-buf implicit fencing support in a desperate attempt to not change compositors.

Well not changing compositors is certainly not something I would try with this use case.

Not changing compositors is more like ok we have Ubuntu 20.04 and need to support that we the newest hardware generation.

Serious question, have you talked to Canonical?

I mean there's a hell of a lot of effort being expended here, but it seems to all be predicated on the assumption that Ubuntu's LTS HWE/backport policy is totally immutable, and that we might need to make the kernel do backflips to work around that. But ... is it? Has anyone actually asked them how they feel about this?

I mean, my answer to the first email is 'no, absolutely not' from the technical perspective (the initial proposal totally breaks current and future userspace), from a design perspective (it breaks a lot of usecases which aren't single-vendor GPU+display+codec, or aren't just a simple desktop), and from a sustainability perspective (cutting Android adrift again isn't acceptable collateral damage to make it easier to backport things to last year's Ubuntu release).

But then again, I don't even know what I'm NAKing here ... ? The original email just lists a proposal to break a ton of things, with proposed replacements which aren't technically viable, and it's not clear why? Can we please have some more details and some reasoning behind them?

I don't mind that userspace (compositor, protocols, clients like Mesa as well as codec APIs) need to do a lot of work to support the new model. I do really care though that the hard-binary-switch model works fine enough for AMD but totally breaks heterogeneous systems and makes it impossible for userspace to do the right thing.

Cheers, Daniel

Marek Olšák

9:38 a.m.

New subject: [Mesa-dev] Linux Graphics Next: Userspace submission update

On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák maraeo@gmail.com wrote:

...

Yes, we can't break anything because we don't want to complicate things for us. It's pretty much all NAK'd already. We are trying to gather more knowledge and then make better decisions.

The idea we are considering is that we'll expose memory-based sync objects to userspace for read only, and the kernel or hw will strictly control the memory writes to those sync objects. The hole in that idea is that userspace can decide not to signal a job, so even if userspace can't overwrite memory-based sync object states arbitrarily, it can still decide not to signal them, and then a future fence is born.

This would actually be treated as a GPU hang caused by that context, so it should be fine.

Marek

Daniel Vetter

6:48 p.m.

New subject: [Mesa-dev] Linux Graphics Next: Userspace submission update

On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:

...

On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák maraeo@gmail.com wrote:

...
Yes, we can't break anything because we don't want to complicate things for us. It's pretty much all NAK'd already. We are trying to gather more knowledge and then make better decisions.

The idea we are considering is that we'll expose memory-based sync objects to userspace for read only, and the kernel or hw will strictly control the memory writes to those sync objects. The hole in that idea is that userspace can decide not to signal a job, so even if userspace can't overwrite memory-based sync object states arbitrarily, it can still decide not to signal them, and then a future fence is born.

This would actually be treated as a GPU hang caused by that context, so it should be fine.

This is practically what I proposed already, except your not doing it with dma_fence. And on the memory fence side this also doesn't actually give what you want for that compute model.

This seems like a bit a worst of both worlds approach to me? Tons of work in the kernel to hide these not-dma_fence-but-almost, and still pain to actually drive the hardware like it should be for compute or direct display.

Also maybe I've missed it, but I didn't see any replies to my suggestion how to fake the entire dma_fence stuff on top of new hw. Would be interesting to know what doesn't work there instead of amd folks going of into internal again and then coming back with another rfc from out of nowhere :-) -Daniel

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Christian König

6:52 p.m.

New subject: [Mesa-dev] Linux Graphics Next: Userspace submission update

Am 02.06.21 um 20:48 schrieb Daniel Vetter:

...

On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:

...
On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák maraeo@gmail.com wrote:

...
Yes, we can't break anything because we don't want to complicate things for us. It's pretty much all NAK'd already. We are trying to gather more knowledge and then make better decisions.

The idea we are considering is that we'll expose memory-based sync objects to userspace for read only, and the kernel or hw will strictly control the memory writes to those sync objects. The hole in that idea is that userspace can decide not to signal a job, so even if userspace can't overwrite memory-based sync object states arbitrarily, it can still decide not to signal them, and then a future fence is born.

This would actually be treated as a GPU hang caused by that context, so it should be fine.

This is practically what I proposed already, except your not doing it with dma_fence. And on the memory fence side this also doesn't actually give what you want for that compute model.

This seems like a bit a worst of both worlds approach to me? Tons of work in the kernel to hide these not-dma_fence-but-almost, and still pain to actually drive the hardware like it should be for compute or direct display.

Also maybe I've missed it, but I didn't see any replies to my suggestion how to fake the entire dma_fence stuff on top of new hw. Would be interesting to know what doesn't work there instead of amd folks going of into internal again and then coming back with another rfc from out of nowhere :-)

Well to be honest I would just push back on our hardware/firmware guys that we need to keep kernel queues forever before going down that route.

That syncfile and all that Android stuff isn't working out of the box with the new shiny user queue submission model (which in turn is mostly because of Windows) already raised some eyebrows here.

Christian.

...

-Daniel

Daniel Vetter

7:19 p.m.

New subject: [Mesa-dev] Linux Graphics Next: Userspace submission update

On Wed, Jun 02, 2021 at 08:52:38PM +0200, Christian König wrote:

...

Am 02.06.21 um 20:48 schrieb Daniel Vetter:

...
On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:

...
On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák maraeo@gmail.com wrote:

...
Yes, we can't break anything because we don't want to complicate things for us. It's pretty much all NAK'd already. We are trying to gather more knowledge and then make better decisions.

The idea we are considering is that we'll expose memory-based sync objects to userspace for read only, and the kernel or hw will strictly control the memory writes to those sync objects. The hole in that idea is that userspace can decide not to signal a job, so even if userspace can't overwrite memory-based sync object states arbitrarily, it can still decide not to signal them, and then a future fence is born.

This would actually be treated as a GPU hang caused by that context, so it should be fine.

This is practically what I proposed already, except your not doing it with dma_fence. And on the memory fence side this also doesn't actually give what you want for that compute model.

This seems like a bit a worst of both worlds approach to me? Tons of work in the kernel to hide these not-dma_fence-but-almost, and still pain to actually drive the hardware like it should be for compute or direct display.

Also maybe I've missed it, but I didn't see any replies to my suggestion how to fake the entire dma_fence stuff on top of new hw. Would be interesting to know what doesn't work there instead of amd folks going of into internal again and then coming back with another rfc from out of nowhere :-)

Well to be honest I would just push back on our hardware/firmware guys that we need to keep kernel queues forever before going down that route.

I looked again, and you said the model wont work because preemption is way too slow, even when the context is idle.

I guess at that point I got maybe too fed up and just figured "not my problem", but if preempt is too slow as the unload fence, you can do it with pte removal and tlb shootdown too (that is hopefully not too slow, otherwise your hw is just garbage and wont even be fast for direct submit compute workloads).

The only thing that you need to do when you use pte clearing + tlb shootdown instad of preemption as the unload fence for buffers that get moved is that if you get any gpu page fault, you don't serve that, but instead treat it as a tdr and shot the context permanently.

So summarizing the model I proposed:

- you allow userspace to directly write into the ringbuffer, and also write the fences directly

- actual submit is done by the kernel, using drm/scheduler. The kernel blindly trusts userspace to set up everything else, and even just wraps dma_fences around the userspace memory fences.

- the only check is tdr. If a fence doesn't complete an tdr fires, a) the kernel shot the entire context and b) userspace recovers by setting up a new ringbuffer

- memory management is done using ttm only, you still need to supply the buffer list (ofc that list includes the always present ones, so CS will only get the list of special buffers like today). If you hw can't trun gpu page faults and you ever get one we pull up the same old solution: Kernel shots the entire context.

The important thing is that from the gpu pov memory management works exactly like compute workload with direct submit, except that you just terminate the context on _any_ page fault, instead of only those that go somewhere where there's really no mapping and repair the others.

Also I guess from reading the old thread this means you'd disable page fault retry because that is apparently also way too slow for anything.

- memory management uses an unload fence. That unload fences waits for all userspace memory fences (represented as dma_fence) to complete, with maybe some fudge to busy-spin until we've reached the actual end of the ringbuffer (maybe you have a IB tail there after the memory fence write, we have that on intel hw), and it waits for the memory to get "unloaded". This is either preemption, or pte clearing + tlb shootdown, or whatever else your hw provides which is a) used for dynamic memory management b) fast enough for actual memory management.

- any time a context dies we force-complete all it's pending fences, in-order ofc

So from hw pov this looks 99% like direct userspace submit, with the exact same mappings, command sequences and everything else. The only difference is that the rinbuffer head/tail updates happen from drm/scheduler, instead of directly from userspace.

None of this stuff needs funny tricks where the kernel controls the writes to memory fences, or where you need kernel ringbuffers, or anything like thist. Userspace is allowed to do anything stupid, the rules are guaranteed with:

- we rely on the hw isolation features to work, but _exactly_ like compute direct submit would too

- dying on any page fault captures memory management issues

- dying (without kernel recover, this is up to userspace if it cares) on any tdr makes sure fences complete still

...

That syncfile and all that Android stuff isn't working out of the box with the new shiny user queue submission model (which in turn is mostly because of Windows) already raised some eyebrows here.

I think if you really want to make sure the current linux stack doesn't break the _only_ option you have is provide a ctx mode that allows dma_fence and drm/scheduler to be used like today.

For everything else it sounds you're a few years too late, because even just huge kernel changes wont happen in time. Much less rewriting userspace protocols. -Daniel

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Christian König

4 Jun 4 Jun

7 a.m.

New subject: [Mesa-dev] Linux Graphics Next: Userspace submission update

Am 02.06.21 um 21:19 schrieb Daniel Vetter:

...

On Wed, Jun 02, 2021 at 08:52:38PM +0200, Christian König wrote:

...
Am 02.06.21 um 20:48 schrieb Daniel Vetter:

...
On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:

...
On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák maraeo@gmail.com wrote:

...
Yes, we can't break anything because we don't want to complicate things for us. It's pretty much all NAK'd already. We are trying to gather more knowledge and then make better decisions.

The idea we are considering is that we'll expose memory-based sync objects to userspace for read only, and the kernel or hw will strictly control the memory writes to those sync objects. The hole in that idea is that userspace can decide not to signal a job, so even if userspace can't overwrite memory-based sync object states arbitrarily, it can still decide not to signal them, and then a future fence is born.

This would actually be treated as a GPU hang caused by that context, so it should be fine.

This is practically what I proposed already, except your not doing it with dma_fence. And on the memory fence side this also doesn't actually give what you want for that compute model.

This seems like a bit a worst of both worlds approach to me? Tons of work in the kernel to hide these not-dma_fence-but-almost, and still pain to actually drive the hardware like it should be for compute or direct display.

Also maybe I've missed it, but I didn't see any replies to my suggestion how to fake the entire dma_fence stuff on top of new hw. Would be interesting to know what doesn't work there instead of amd folks going of into internal again and then coming back with another rfc from out of nowhere :-)

Well to be honest I would just push back on our hardware/firmware guys that we need to keep kernel queues forever before going down that route.

I looked again, and you said the model wont work because preemption is way too slow, even when the context is idle.

I guess at that point I got maybe too fed up and just figured "not my problem", but if preempt is too slow as the unload fence, you can do it with pte removal and tlb shootdown too (that is hopefully not too slow, otherwise your hw is just garbage and wont even be fast for direct submit compute workloads).

Have you seen that one here: https://www.spinics.net/lists/amd-gfx/msg63101.html :)

I've rejected it because I think polling for 6 seconds on a TLB flush which can block interrupts as well is just madness.

...

The only thing that you need to do when you use pte clearing + tlb shootdown instad of preemption as the unload fence for buffers that get moved is that if you get any gpu page fault, you don't serve that, but instead treat it as a tdr and shot the context permanently.

So summarizing the model I proposed:

you allow userspace to directly write into the ringbuffer, and also write the fences directly

actual submit is done by the kernel, using drm/scheduler. The kernel blindly trusts userspace to set up everything else, and even just wraps dma_fences around the userspace memory fences.

the only check is tdr. If a fence doesn't complete an tdr fires, a) the kernel shot the entire context and b) userspace recovers by setting up a new ringbuffer

memory management is done using ttm only, you still need to supply the buffer list (ofc that list includes the always present ones, so CS will only get the list of special buffers like today). If you hw can't trun gpu page faults and you ever get one we pull up the same old solution: Kernel shots the entire context.

The important thing is that from the gpu pov memory management works exactly like compute workload with direct submit, except that you just terminate the context on _any_ page fault, instead of only those that go somewhere where there's really no mapping and repair the others.

Also I guess from reading the old thread this means you'd disable page fault retry because that is apparently also way too slow for anything.

memory management uses an unload fence. That unload fences waits for all userspace memory fences (represented as dma_fence) to complete, with maybe some fudge to busy-spin until we've reached the actual end of the ringbuffer (maybe you have a IB tail there after the memory fence write, we have that on intel hw), and it waits for the memory to get "unloaded". This is either preemption, or pte clearing + tlb shootdown, or whatever else your hw provides which is a) used for dynamic memory management b) fast enough for actual memory management.

any time a context dies we force-complete all it's pending fences, in-order ofc

So from hw pov this looks 99% like direct userspace submit, with the exact same mappings, command sequences and everything else. The only difference is that the rinbuffer head/tail updates happen from drm/scheduler, instead of directly from userspace.

None of this stuff needs funny tricks where the kernel controls the writes to memory fences, or where you need kernel ringbuffers, or anything like thist. Userspace is allowed to do anything stupid, the rules are guaranteed with:

we rely on the hw isolation features to work, but _exactly_ like compute direct submit would too

dying on any page fault captures memory management issues

dying (without kernel recover, this is up to userspace if it cares) on any tdr makes sure fences complete still

...
That syncfile and all that Android stuff isn't working out of the box with the new shiny user queue submission model (which in turn is mostly because of Windows) already raised some eyebrows here.

I think if you really want to make sure the current linux stack doesn't break the _only_ option you have is provide a ctx mode that allows dma_fence and drm/scheduler to be used like today.

Yeah, but I still can just tell our hw/fw guys that we really really need to keep kernel queues or the whole Linux/Android infrastructure needs to get a compatibility layer like you describe above.

...

For everything else it sounds you're a few years too late, because even just huge kernel changes wont happen in time. Much less rewriting userspace protocols.

Seconded, question is rather if we are going to start migrating at some point or if we should keep pushing on our hw/fw guys.

Christian.

...

-Daniel

Daniel Vetter

8:57 a.m.

New subject: [Mesa-dev] Linux Graphics Next: Userspace submission update

On Fri, Jun 04, 2021 at 09:00:31AM +0200, Christian König wrote:

...

Am 02.06.21 um 21:19 schrieb Daniel Vetter:

...
On Wed, Jun 02, 2021 at 08:52:38PM +0200, Christian König wrote:

...
Am 02.06.21 um 20:48 schrieb Daniel Vetter:

...
On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:

...
On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák maraeo@gmail.com wrote:

...
Yes, we can't break anything because we don't want to complicate things for us. It's pretty much all NAK'd already. We are trying to gather more knowledge and then make better decisions.

The idea we are considering is that we'll expose memory-based sync objects to userspace for read only, and the kernel or hw will strictly control the memory writes to those sync objects. The hole in that idea is that userspace can decide not to signal a job, so even if userspace can't overwrite memory-based sync object states arbitrarily, it can still decide not to signal them, and then a future fence is born.

This would actually be treated as a GPU hang caused by that context, so it should be fine.

This is practically what I proposed already, except your not doing it with dma_fence. And on the memory fence side this also doesn't actually give what you want for that compute model.

This seems like a bit a worst of both worlds approach to me? Tons of work in the kernel to hide these not-dma_fence-but-almost, and still pain to actually drive the hardware like it should be for compute or direct display.

Also maybe I've missed it, but I didn't see any replies to my suggestion how to fake the entire dma_fence stuff on top of new hw. Would be interesting to know what doesn't work there instead of amd folks going of into internal again and then coming back with another rfc from out of nowhere :-)

Well to be honest I would just push back on our hardware/firmware guys that we need to keep kernel queues forever before going down that route.

I looked again, and you said the model wont work because preemption is way too slow, even when the context is idle.

I guess at that point I got maybe too fed up and just figured "not my problem", but if preempt is too slow as the unload fence, you can do it with pte removal and tlb shootdown too (that is hopefully not too slow, otherwise your hw is just garbage and wont even be fast for direct submit compute workloads).

Have you seen that one here: https://www.spinics.net/lists/amd-gfx/msg63101.html :)

I've rejected it because I think polling for 6 seconds on a TLB flush which can block interrupts as well is just madness.

Hm but I thought you had like 2 tlb flush modes, the shitty one (with retrying page faults) and the not so shitty one?

But yeah at that point I think you just have to bite one of the bullets.

The thing is with hmm/userspace memory fence model this will be even worse, because you will _have_ to do this tlb flush deep down in core mm functions, so this is going to be userptr, but worse.

With the dma_resv/dma_fence bo memory management model you can at least wrap that tlb flush into a dma_fence and push the waiting/pinging onto a separate thread or something like that. If the hw really is that slow.

Somewhat aside: Personally I think that sriov needs to move over to the compute model, i.e. indefinite timeouts, no tdr, because everything takes too long. At least looking around sriov timeouts tend to be 10x bare metal, across the board.

But for stuff like cloud gaming that's serious amounts of heavy lifting since it brings us right back "the entire linux/android 3d stack is built on top of dma_fence right now".

...

...
The only thing that you need to do when you use pte clearing + tlb shootdown instad of preemption as the unload fence for buffers that get moved is that if you get any gpu page fault, you don't serve that, but instead treat it as a tdr and shot the context permanently.

So summarizing the model I proposed:

you allow userspace to directly write into the ringbuffer, and also write the fences directly

actual submit is done by the kernel, using drm/scheduler. The kernel blindly trusts userspace to set up everything else, and even just wraps dma_fences around the userspace memory fences.

the only check is tdr. If a fence doesn't complete an tdr fires, a) the kernel shot the entire context and b) userspace recovers by setting up a new ringbuffer

memory management is done using ttm only, you still need to supply the buffer list (ofc that list includes the always present ones, so CS will only get the list of special buffers like today). If you hw can't trun gpu page faults and you ever get one we pull up the same old solution: Kernel shots the entire context.

The important thing is that from the gpu pov memory management works exactly like compute workload with direct submit, except that you just terminate the context on _any_ page fault, instead of only those that go somewhere where there's really no mapping and repair the others.

Also I guess from reading the old thread this means you'd disable page fault retry because that is apparently also way too slow for anything.

memory management uses an unload fence. That unload fences waits for all userspace memory fences (represented as dma_fence) to complete, with maybe some fudge to busy-spin until we've reached the actual end of the ringbuffer (maybe you have a IB tail there after the memory fence write, we have that on intel hw), and it waits for the memory to get "unloaded". This is either preemption, or pte clearing + tlb shootdown, or whatever else your hw provides which is a) used for dynamic memory management b) fast enough for actual memory management.

any time a context dies we force-complete all it's pending fences, in-order ofc

So from hw pov this looks 99% like direct userspace submit, with the exact same mappings, command sequences and everything else. The only difference is that the rinbuffer head/tail updates happen from drm/scheduler, instead of directly from userspace.

None of this stuff needs funny tricks where the kernel controls the writes to memory fences, or where you need kernel ringbuffers, or anything like thist. Userspace is allowed to do anything stupid, the rules are guaranteed with:

we rely on the hw isolation features to work, but _exactly_ like compute direct submit would too

dying on any page fault captures memory management issues

dying (without kernel recover, this is up to userspace if it cares) on any tdr makes sure fences complete still

...
That syncfile and all that Android stuff isn't working out of the box with the new shiny user queue submission model (which in turn is mostly because of Windows) already raised some eyebrows here.

I think if you really want to make sure the current linux stack doesn't break the _only_ option you have is provide a ctx mode that allows dma_fence and drm/scheduler to be used like today.

Yeah, but I still can just tell our hw/fw guys that we really really need to keep kernel queues or the whole Linux/Android infrastructure needs to get a compatibility layer like you describe above.

...
For everything else it sounds you're a few years too late, because even just huge kernel changes wont happen in time. Much less rewriting userspace protocols.

Seconded, question is rather if we are going to start migrating at some point or if we should keep pushing on our hw/fw guys.

So from what I'm hearing other hw might gain the sw compat layer too. Plus I'm hoping that with the sw compat layer it'd be easier to smooth over userspace to the new model (because there will be a long time where we have to support both, maybe even with a runtime switch from userspace memory fences to dma_fence kernel stuff).

But in the end it's up to you what makes more sense between sw work and hw/fw work involved. -Daniel

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Christian König

11:27 a.m.

New subject: [Mesa-dev] Linux Graphics Next: Userspace submission update

Am 04.06.21 um 10:57 schrieb Daniel Vetter:

...

On Fri, Jun 04, 2021 at 09:00:31AM +0200, Christian König wrote:

...
Am 02.06.21 um 21:19 schrieb Daniel Vetter:

...
On Wed, Jun 02, 2021 at 08:52:38PM +0200, Christian König wrote:

...
Am 02.06.21 um 20:48 schrieb Daniel Vetter:

...
On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:

...
On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák maraeo@gmail.com wrote:

> Yes, we can't break anything because we don't want to complicate things > for us. It's pretty much all NAK'd already. We are trying to gather more > knowledge and then make better decisions. > > The idea we are considering is that we'll expose memory-based sync objects > to userspace for read only, and the kernel or hw will strictly control the > memory writes to those sync objects. The hole in that idea is that > userspace can decide not to signal a job, so even if userspace can't > overwrite memory-based sync object states arbitrarily, it can still decide > not to signal them, and then a future fence is born. > This would actually be treated as a GPU hang caused by that context, so it should be fine.

This is practically what I proposed already, except your not doing it with dma_fence. And on the memory fence side this also doesn't actually give what you want for that compute model.

This seems like a bit a worst of both worlds approach to me? Tons of work in the kernel to hide these not-dma_fence-but-almost, and still pain to actually drive the hardware like it should be for compute or direct display.

Also maybe I've missed it, but I didn't see any replies to my suggestion how to fake the entire dma_fence stuff on top of new hw. Would be interesting to know what doesn't work there instead of amd folks going of into internal again and then coming back with another rfc from out of nowhere :-)

Well to be honest I would just push back on our hardware/firmware guys that we need to keep kernel queues forever before going down that route.

I looked again, and you said the model wont work because preemption is way too slow, even when the context is idle.

I guess at that point I got maybe too fed up and just figured "not my problem", but if preempt is too slow as the unload fence, you can do it with pte removal and tlb shootdown too (that is hopefully not too slow, otherwise your hw is just garbage and wont even be fast for direct submit compute workloads).

Have you seen that one here: https://www.spinics.net/lists/amd-gfx/msg63101.html :)

I've rejected it because I think polling for 6 seconds on a TLB flush which can block interrupts as well is just madness.

Hm but I thought you had like 2 tlb flush modes, the shitty one (with retrying page faults) and the not so shitty one?

Yeah, we call this the lightweight and the heavyweight tlb flush.

The lighweight can be used when you are sure that you don't have any of the PTEs currently in flight in the 3D/DMA engine and you just need to invalidate the TLB.

The heavyweight must be used when you need to invalidate the TLB *AND* make sure that no concurrently operation moves new stuff into the TLB.

The problem is for this use case we have to use the heavyweight one.

...

But yeah at that point I think you just have to bite one of the bullets.

Yeah, completely agree. We can choose which way we want to die, but it's certainly not going to be nice whatever we do.

...

The thing is with hmm/userspace memory fence model this will be even worse, because you will _have_ to do this tlb flush deep down in core mm functions, so this is going to be userptr, but worse.

With the dma_resv/dma_fence bo memory management model you can at least wrap that tlb flush into a dma_fence and push the waiting/pinging onto a separate thread or something like that. If the hw really is that slow.

Somewhat aside: Personally I think that sriov needs to move over to the compute model, i.e. indefinite timeouts, no tdr, because everything takes too long. At least looking around sriov timeouts tend to be 10x bare metal, across the board.

But for stuff like cloud gaming that's serious amounts of heavy lifting since it brings us right back "the entire linux/android 3d stack is built on top of dma_fence right now".

...
...
The only thing that you need to do when you use pte clearing + tlb shootdown instad of preemption as the unload fence for buffers that get moved is that if you get any gpu page fault, you don't serve that, but instead treat it as a tdr and shot the context permanently.

So summarizing the model I proposed:

you allow userspace to directly write into the ringbuffer, and also write the fences directly

actual submit is done by the kernel, using drm/scheduler. The kernel blindly trusts userspace to set up everything else, and even just wraps dma_fences around the userspace memory fences.

the only check is tdr. If a fence doesn't complete an tdr fires, a) the kernel shot the entire context and b) userspace recovers by setting up a new ringbuffer

memory management is done using ttm only, you still need to supply the buffer list (ofc that list includes the always present ones, so CS will only get the list of special buffers like today). If you hw can't trun gpu page faults and you ever get one we pull up the same old solution: Kernel shots the entire context.

The important thing is that from the gpu pov memory management works exactly like compute workload with direct submit, except that you just terminate the context on _any_ page fault, instead of only those that go somewhere where there's really no mapping and repair the others.

Also I guess from reading the old thread this means you'd disable page fault retry because that is apparently also way too slow for anything.

memory management uses an unload fence. That unload fences waits for all userspace memory fences (represented as dma_fence) to complete, with maybe some fudge to busy-spin until we've reached the actual end of the ringbuffer (maybe you have a IB tail there after the memory fence write, we have that on intel hw), and it waits for the memory to get "unloaded". This is either preemption, or pte clearing + tlb shootdown, or whatever else your hw provides which is a) used for dynamic memory management b) fast enough for actual memory management.

any time a context dies we force-complete all it's pending fences, in-order ofc

So from hw pov this looks 99% like direct userspace submit, with the exact same mappings, command sequences and everything else. The only difference is that the rinbuffer head/tail updates happen from drm/scheduler, instead of directly from userspace.

None of this stuff needs funny tricks where the kernel controls the writes to memory fences, or where you need kernel ringbuffers, or anything like thist. Userspace is allowed to do anything stupid, the rules are guaranteed with:

we rely on the hw isolation features to work, but _exactly_ like compute direct submit would too

dying on any page fault captures memory management issues

dying (without kernel recover, this is up to userspace if it cares) on any tdr makes sure fences complete still

...
That syncfile and all that Android stuff isn't working out of the box with the new shiny user queue submission model (which in turn is mostly because of Windows) already raised some eyebrows here.

I think if you really want to make sure the current linux stack doesn't break the _only_ option you have is provide a ctx mode that allows dma_fence and drm/scheduler to be used like today.

Yeah, but I still can just tell our hw/fw guys that we really really need to keep kernel queues or the whole Linux/Android infrastructure needs to get a compatibility layer like you describe above.

...
For everything else it sounds you're a few years too late, because even just huge kernel changes wont happen in time. Much less rewriting userspace protocols.

Seconded, question is rather if we are going to start migrating at some point or if we should keep pushing on our hw/fw guys.

So from what I'm hearing other hw might gain the sw compat layer too. Plus I'm hoping that with the sw compat layer it'd be easier to smooth over userspace to the new model (because there will be a long time where we have to support both, maybe even with a runtime switch from userspace memory fences to dma_fence kernel stuff).

But in the end it's up to you what makes more sense between sw work and hw/fw work involved.

I'm currently entertaining my head a bit with the idea of implementing the HW scheduling on the CPU.

The only obstacle that I can really see is that we might need to unmap a page in the CPU page table when a queue becomes idle.

But apart from that it would give us the required functionality, just without the hardware scheduler.

Christian.

...

-Daniel

Daniel Vetter

9 Jun 9 Jun

1:19 p.m.

New subject: [Mesa-dev] Linux Graphics Next: Userspace submission update

On Fri, Jun 04, 2021 at 01:27:15PM +0200, Christian König wrote:

...

Am 04.06.21 um 10:57 schrieb Daniel Vetter:

...
On Fri, Jun 04, 2021 at 09:00:31AM +0200, Christian König wrote:

...
Am 02.06.21 um 21:19 schrieb Daniel Vetter:

...
On Wed, Jun 02, 2021 at 08:52:38PM +0200, Christian König wrote:

...
Am 02.06.21 um 20:48 schrieb Daniel Vetter:

...
On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote: > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák maraeo@gmail.com wrote: > > > Yes, we can't break anything because we don't want to complicate things > > for us. It's pretty much all NAK'd already. We are trying to gather more > > knowledge and then make better decisions. > > > > The idea we are considering is that we'll expose memory-based sync objects > > to userspace for read only, and the kernel or hw will strictly control the > > memory writes to those sync objects. The hole in that idea is that > > userspace can decide not to signal a job, so even if userspace can't > > overwrite memory-based sync object states arbitrarily, it can still decide > > not to signal them, and then a future fence is born. > > > This would actually be treated as a GPU hang caused by that context, so it > should be fine. This is practically what I proposed already, except your not doing it with dma_fence. And on the memory fence side this also doesn't actually give what you want for that compute model.

This seems like a bit a worst of both worlds approach to me? Tons of work in the kernel to hide these not-dma_fence-but-almost, and still pain to actually drive the hardware like it should be for compute or direct display.

Also maybe I've missed it, but I didn't see any replies to my suggestion how to fake the entire dma_fence stuff on top of new hw. Would be interesting to know what doesn't work there instead of amd folks going of into internal again and then coming back with another rfc from out of nowhere :-)

Well to be honest I would just push back on our hardware/firmware guys that we need to keep kernel queues forever before going down that route.

I looked again, and you said the model wont work because preemption is way too slow, even when the context is idle.

I guess at that point I got maybe too fed up and just figured "not my problem", but if preempt is too slow as the unload fence, you can do it with pte removal and tlb shootdown too (that is hopefully not too slow, otherwise your hw is just garbage and wont even be fast for direct submit compute workloads).

Have you seen that one here: https://www.spinics.net/lists/amd-gfx/msg63101.html :)

I've rejected it because I think polling for 6 seconds on a TLB flush which can block interrupts as well is just madness.

Hm but I thought you had like 2 tlb flush modes, the shitty one (with retrying page faults) and the not so shitty one?

Yeah, we call this the lightweight and the heavyweight tlb flush.

The lighweight can be used when you are sure that you don't have any of the PTEs currently in flight in the 3D/DMA engine and you just need to invalidate the TLB.

The heavyweight must be used when you need to invalidate the TLB *AND* make sure that no concurrently operation moves new stuff into the TLB.

The problem is for this use case we have to use the heavyweight one.

Just for my own curiosity: So the lightweight flush is only for in-between CS when you know access is idle? Or does that also not work if userspace has a CS on a dma engine going at the same time because the tlb aren't isolated enough between engines? -Daniel

...

...
But yeah at that point I think you just have to bite one of the bullets.

Yeah, completely agree. We can choose which way we want to die, but it's certainly not going to be nice whatever we do.

...
The thing is with hmm/userspace memory fence model this will be even worse, because you will _have_ to do this tlb flush deep down in core mm functions, so this is going to be userptr, but worse.

With the dma_resv/dma_fence bo memory management model you can at least wrap that tlb flush into a dma_fence and push the waiting/pinging onto a separate thread or something like that. If the hw really is that slow.

Somewhat aside: Personally I think that sriov needs to move over to the compute model, i.e. indefinite timeouts, no tdr, because everything takes too long. At least looking around sriov timeouts tend to be 10x bare metal, across the board.

But for stuff like cloud gaming that's serious amounts of heavy lifting since it brings us right back "the entire linux/android 3d stack is built on top of dma_fence right now".

...
...
The only thing that you need to do when you use pte clearing + tlb shootdown instad of preemption as the unload fence for buffers that get moved is that if you get any gpu page fault, you don't serve that, but instead treat it as a tdr and shot the context permanently.

So summarizing the model I proposed:

you allow userspace to directly write into the ringbuffer, and also write the fences directly

actual submit is done by the kernel, using drm/scheduler. The kernel blindly trusts userspace to set up everything else, and even just wraps dma_fences around the userspace memory fences.

the only check is tdr. If a fence doesn't complete an tdr fires, a) the kernel shot the entire context and b) userspace recovers by setting up a new ringbuffer

memory management is done using ttm only, you still need to supply the buffer list (ofc that list includes the always present ones, so CS will only get the list of special buffers like today). If you hw can't trun gpu page faults and you ever get one we pull up the same old solution: Kernel shots the entire context.

The important thing is that from the gpu pov memory management works exactly like compute workload with direct submit, except that you just terminate the context on _any_ page fault, instead of only those that go somewhere where there's really no mapping and repair the others.

Also I guess from reading the old thread this means you'd disable page fault retry because that is apparently also way too slow for anything.

memory management uses an unload fence. That unload fences waits for all userspace memory fences (represented as dma_fence) to complete, with maybe some fudge to busy-spin until we've reached the actual end of the ringbuffer (maybe you have a IB tail there after the memory fence write, we have that on intel hw), and it waits for the memory to get "unloaded". This is either preemption, or pte clearing + tlb shootdown, or whatever else your hw provides which is a) used for dynamic memory management b) fast enough for actual memory management.

any time a context dies we force-complete all it's pending fences, in-order ofc

So from hw pov this looks 99% like direct userspace submit, with the exact same mappings, command sequences and everything else. The only difference is that the rinbuffer head/tail updates happen from drm/scheduler, instead of directly from userspace.

None of this stuff needs funny tricks where the kernel controls the writes to memory fences, or where you need kernel ringbuffers, or anything like thist. Userspace is allowed to do anything stupid, the rules are guaranteed with:

we rely on the hw isolation features to work, but _exactly_ like compute direct submit would too

dying on any page fault captures memory management issues

dying (without kernel recover, this is up to userspace if it cares) on any tdr makes sure fences complete still

...
That syncfile and all that Android stuff isn't working out of the box with the new shiny user queue submission model (which in turn is mostly because of Windows) already raised some eyebrows here.

I think if you really want to make sure the current linux stack doesn't break the _only_ option you have is provide a ctx mode that allows dma_fence and drm/scheduler to be used like today.

Yeah, but I still can just tell our hw/fw guys that we really really need to keep kernel queues or the whole Linux/Android infrastructure needs to get a compatibility layer like you describe above.

...
For everything else it sounds you're a few years too late, because even just huge kernel changes wont happen in time. Much less rewriting userspace protocols.

Seconded, question is rather if we are going to start migrating at some point or if we should keep pushing on our hw/fw guys.

So from what I'm hearing other hw might gain the sw compat layer too. Plus I'm hoping that with the sw compat layer it'd be easier to smooth over userspace to the new model (because there will be a long time where we have to support both, maybe even with a runtime switch from userspace memory fences to dma_fence kernel stuff).

But in the end it's up to you what makes more sense between sw work and hw/fw work involved.

I'm currently entertaining my head a bit with the idea of implementing the HW scheduling on the CPU.

The only obstacle that I can really see is that we might need to unmap a page in the CPU page table when a queue becomes idle.

But apart from that it would give us the required functionality, just without the hardware scheduler.

Christian.

...
-Daniel

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Christian König

1:58 p.m.

New subject: [Mesa-dev] Linux Graphics Next: Userspace submission update

Am 09.06.21 um 15:19 schrieb Daniel Vetter:

...

[SNIP]

...
Yeah, we call this the lightweight and the heavyweight tlb flush.

The lighweight can be used when you are sure that you don't have any of the PTEs currently in flight in the 3D/DMA engine and you just need to invalidate the TLB.

The heavyweight must be used when you need to invalidate the TLB *AND* make sure that no concurrently operation moves new stuff into the TLB.

The problem is for this use case we have to use the heavyweight one.

Just for my own curiosity: So the lightweight flush is only for in-between CS when you know access is idle? Or does that also not work if userspace has a CS on a dma engine going at the same time because the tlb aren't isolated enough between engines?

More or less correct, yes.

The problem is a lightweight flush only invalidates the TLB, but doesn't take care of entries which have been handed out to the different engines.

In other words what can happen is the following:

1. Shader asks TLB to resolve address X. 2. TLB looks into its cache and can't find address X so it asks the walker to resolve. 3. Walker comes back with result for address X and TLB puts that into its cache and gives it to Shader. 4. Shader starts doing some operation using result for address X. 5. You send lightweight TLB invalidate and TLB throws away cached values for address X. 6. Shader happily still uses whatever the TLB gave to it in step 3 to accesses address X

See it like the shader has their own 1 entry L0 TLB cache which is not affected by the lightweight flush.

The heavyweight flush on the other hand sends out a broadcast signal to everybody and only comes back when we are sure that an address is not in use any more.

Christian.

...

-Daniel

Daniel Vetter

6:31 p.m.

New subject: [Mesa-dev] Linux Graphics Next: Userspace submission update

On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König wrote:

...

Am 09.06.21 um 15:19 schrieb Daniel Vetter:

...
[SNIP]

...
Yeah, we call this the lightweight and the heavyweight tlb flush.

The lighweight can be used when you are sure that you don't have any of the PTEs currently in flight in the 3D/DMA engine and you just need to invalidate the TLB.

The heavyweight must be used when you need to invalidate the TLB *AND* make sure that no concurrently operation moves new stuff into the TLB.

The problem is for this use case we have to use the heavyweight one.

Just for my own curiosity: So the lightweight flush is only for in-between CS when you know access is idle? Or does that also not work if userspace has a CS on a dma engine going at the same time because the tlb aren't isolated enough between engines?

More or less correct, yes.

The problem is a lightweight flush only invalidates the TLB, but doesn't take care of entries which have been handed out to the different engines.

In other words what can happen is the following:

Shader asks TLB to resolve address X.

TLB looks into its cache and can't find address X so it asks the walker

to resolve. 3. Walker comes back with result for address X and TLB puts that into its cache and gives it to Shader. 4. Shader starts doing some operation using result for address X. 5. You send lightweight TLB invalidate and TLB throws away cached values for address X. 6. Shader happily still uses whatever the TLB gave to it in step 3 to accesses address X

See it like the shader has their own 1 entry L0 TLB cache which is not affected by the lightweight flush.

The heavyweight flush on the other hand sends out a broadcast signal to everybody and only comes back when we are sure that an address is not in use any more.

Ah makes sense. On intel the shaders only operate in VA, everything goes around as explicit async messages to IO blocks. So we don't have this, the only difference in tlb flushes is between tlb flush in the IB and an mmio one which is independent for anything currently being executed on an egine. -Daniel

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Marek Olšák

10 Jun 10 Jun

3:59 p.m.

New subject: [Mesa-dev] Linux Graphics Next: Userspace submission update

Hi Daniel,

We just talked about this whole topic internally and we came up to the conclusion that the hardware needs to understand sync object handles and have high-level wait and signal operations in the command stream. Sync objects will be backed by memory, but they won't be readable or writable by processes directly. The hardware will log all accesses to sync objects and will send the log to the kernel periodically. The kernel will identify malicious behavior.

Example of a hardware command stream: ... ImplicitSyncWait(syncObjHandle, sequenceNumber); // the sequence number is assigned by the kernel Draw(); ImplicitSyncSignalWhenDone(syncObjHandle); ...

I'm afraid we have no other choice because of the TLB invalidation overhead.

Marek

On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter daniel@ffwll.ch wrote:

...

On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König wrote:

...
Am 09.06.21 um 15:19 schrieb Daniel Vetter:

...
[SNIP]

...
Yeah, we call this the lightweight and the heavyweight tlb flush.

The lighweight can be used when you are sure that you don't have any

of the

...
...
...
PTEs currently in flight in the 3D/DMA engine and you just need to invalidate the TLB.

The heavyweight must be used when you need to invalidate the TLB

*AND* make

...
...
...
sure that no concurrently operation moves new stuff into the TLB.

The problem is for this use case we have to use the heavyweight one.

Just for my own curiosity: So the lightweight flush is only for

in-between

...
...
CS when you know access is idle? Or does that also not work if

userspace

...
...
has a CS on a dma engine going at the same time because the tlb aren't isolated enough between engines?

More or less correct, yes.

The problem is a lightweight flush only invalidates the TLB, but doesn't take care of entries which have been handed out to the different engines.

In other words what can happen is the following:

Shader asks TLB to resolve address X.

TLB looks into its cache and can't find address X so it asks the

walker

...
to resolve. 3. Walker comes back with result for address X and TLB puts that into its cache and gives it to Shader. 4. Shader starts doing some operation using result for address X. 5. You send lightweight TLB invalidate and TLB throws away cached values

for

...
address X. 6. Shader happily still uses whatever the TLB gave to it in step 3 to accesses address X

See it like the shader has their own 1 entry L0 TLB cache which is not affected by the lightweight flush.

The heavyweight flush on the other hand sends out a broadcast signal to everybody and only comes back when we are sure that an address is not in

use

...
any more.

Ah makes sense. On intel the shaders only operate in VA, everything goes around as explicit async messages to IO blocks. So we don't have this, the only difference in tlb flushes is between tlb flush in the IB and an mmio one which is independent for anything currently being executed on an egine.

-Daniel

Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Christian König

4:33 p.m.

New subject: [Mesa-dev] Linux Graphics Next: Userspace submission update

Hi guys,

maybe soften that a bit. Reading from the shared memory of the user fence is ok for everybody. What we need to take more care of is the writing side.

So my current thinking is that we allow read only access, but writing a new sequence value needs to go through the scheduler/kernel.

So when the CPU wants to signal a timeline fence it needs to call an IOCTL. When the GPU wants to signal the timeline fence it needs to hand that of to the hardware scheduler.

If we lockup the kernel can check with the hardware who did the last write and what value was written.

That together with an IOCTL to give out sequence number for implicit sync to applications should be sufficient for the kernel to track who is responsible if something bad happens.

In other words when the hardware says that the shader wrote stuff like 0xdeadbeef 0x0 or 0xffffffff into memory we kill the process who did that.

If the hardware says that seq - 1 was written fine, but seq is missing then the kernel blames whoever was supposed to write seq.

Just pieping the write through a privileged instance should be fine to make sure that we don't run into issues.

Christian.

Am 10.06.21 um 17:59 schrieb Marek Olšák:

...

Hi Daniel,

Example of a hardware command stream: ... ImplicitSyncWait(syncObjHandle, sequenceNumber); // the sequence number is assigned by the kernel Draw(); ImplicitSyncSignalWhenDone(syncObjHandle); ...

I'm afraid we have no other choice because of the TLB invalidation overhead.

Marek

On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter <daniel@ffwll.ch mailto:daniel@ffwll.ch> wrote:

On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König wrote:
> Am 09.06.21 um 15:19 schrieb Daniel Vetter:
> > [SNIP]
> > > Yeah, we call this the lightweight and the heavyweight tlb
flush.
> > >
> > > The lighweight can be used when you are sure that you don't
have any of the
> > > PTEs currently in flight in the 3D/DMA engine and you just
need to
> > > invalidate the TLB.
> > >
> > > The heavyweight must be used when you need to invalidate the
TLB *AND* make
> > > sure that no concurrently operation moves new stuff into the
TLB.
> > >
> > > The problem is for this use case we have to use the
heavyweight one.
> > Just for my own curiosity: So the lightweight flush is only
for in-between
> > CS when you know access is idle? Or does that also not work if
userspace
> > has a CS on a dma engine going at the same time because the
tlb aren't
> > isolated enough between engines?
>
> More or less correct, yes.
>
> The problem is a lightweight flush only invalidates the TLB, but
doesn't
> take care of entries which have been handed out to the different
engines.
>
> In other words what can happen is the following:
>
> 1. Shader asks TLB to resolve address X.
> 2. TLB looks into its cache and can't find address X so it asks
the walker
> to resolve.
> 3. Walker comes back with result for address X and TLB puts that
into its
> cache and gives it to Shader.
> 4. Shader starts doing some operation using result for address X.
> 5. You send lightweight TLB invalidate and TLB throws away
cached values for
> address X.
> 6. Shader happily still uses whatever the TLB gave to it in step
3 to
> accesses address X
>
> See it like the shader has their own 1 entry L0 TLB cache which
is not
> affected by the lightweight flush.
>
> The heavyweight flush on the other hand sends out a broadcast
signal to
> everybody and only comes back when we are sure that an address
is not in use
> any more.

Ah makes sense. On intel the shaders only operate in VA,
everything goes
around as explicit async messages to IO blocks. So we don't have
this, the
only difference in tlb flushes is between tlb flush in the IB and
an mmio
one which is independent for anything currently being executed on an
egine.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch <http://blog.ffwll.ch>

Marek Olšák

14 Jun 14 Jun

5:10 p.m.

New subject: [Mesa-dev] Linux Graphics Next: Userspace submission update

The call to the hw scheduler has a limitation on the size of all parameters combined. I think we can only pass a 32-bit sequence number and a ~16-bit global (per-GPU) syncobj handle in one call and not much else.

The syncobj handle can be an element index in a global (per-GPU) syncobj table and it's read only for all processes with the exception of the signal command. Syncobjs can either have per VMID write access flags for the signal command (slow), or any process can write to any syncobjs and only rely on the kernel checking the write log (fast).

In any case, we can execute the memory write in the queue engine and only use the hw scheduler for logging, which would be perfect.

Marek

On Thu, Jun 10, 2021 at 12:33 PM Christian König < ckoenig.leichtzumerken@gmail.com> wrote:

...

Hi guys,

maybe soften that a bit. Reading from the shared memory of the user fence is ok for everybody. What we need to take more care of is the writing side.

So my current thinking is that we allow read only access, but writing a new sequence value needs to go through the scheduler/kernel.

So when the CPU wants to signal a timeline fence it needs to call an IOCTL. When the GPU wants to signal the timeline fence it needs to hand that of to the hardware scheduler.

If we lockup the kernel can check with the hardware who did the last write and what value was written.

That together with an IOCTL to give out sequence number for implicit sync to applications should be sufficient for the kernel to track who is responsible if something bad happens.

In other words when the hardware says that the shader wrote stuff like 0xdeadbeef 0x0 or 0xffffffff into memory we kill the process who did that.

If the hardware says that seq - 1 was written fine, but seq is missing then the kernel blames whoever was supposed to write seq.

Just pieping the write through a privileged instance should be fine to make sure that we don't run into issues.

Christian.

Am 10.06.21 um 17:59 schrieb Marek Olšák:

Hi Daniel,

We just talked about this whole topic internally and we came up to the conclusion that the hardware needs to understand sync object handles and have high-level wait and signal operations in the command stream. Sync objects will be backed by memory, but they won't be readable or writable by processes directly. The hardware will log all accesses to sync objects and will send the log to the kernel periodically. The kernel will identify malicious behavior.

Example of a hardware command stream: ... ImplicitSyncWait(syncObjHandle, sequenceNumber); // the sequence number is assigned by the kernel Draw(); ImplicitSyncSignalWhenDone(syncObjHandle); ...

I'm afraid we have no other choice because of the TLB invalidation overhead.

Marek

On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter daniel@ffwll.ch wrote:

...
On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König wrote:

...
Am 09.06.21 um 15:19 schrieb Daniel Vetter:

...
[SNIP]

...
Yeah, we call this the lightweight and the heavyweight tlb flush.

The lighweight can be used when you are sure that you don't have

any of the

...
...
...
PTEs currently in flight in the 3D/DMA engine and you just need to invalidate the TLB.

The heavyweight must be used when you need to invalidate the TLB

*AND* make

...
...
...
sure that no concurrently operation moves new stuff into the TLB.

The problem is for this use case we have to use the heavyweight one.

Just for my own curiosity: So the lightweight flush is only for

in-between

...
...
CS when you know access is idle? Or does that also not work if

userspace

...
...
has a CS on a dma engine going at the same time because the tlb aren't isolated enough between engines?

More or less correct, yes.

The problem is a lightweight flush only invalidates the TLB, but doesn't take care of entries which have been handed out to the different

engines.

...
In other words what can happen is the following:

Shader asks TLB to resolve address X.

TLB looks into its cache and can't find address X so it asks the

walker

...
to resolve. 3. Walker comes back with result for address X and TLB puts that into

its

...
cache and gives it to Shader. 4. Shader starts doing some operation using result for address X. 5. You send lightweight TLB invalidate and TLB throws away cached

values for

...
address X. 6. Shader happily still uses whatever the TLB gave to it in step 3 to accesses address X

See it like the shader has their own 1 entry L0 TLB cache which is not affected by the lightweight flush.

The heavyweight flush on the other hand sends out a broadcast signal to everybody and only comes back when we are sure that an address is not

in use

...
any more.

Ah makes sense. On intel the shaders only operate in VA, everything goes around as explicit async messages to IO blocks. So we don't have this, the only difference in tlb flushes is between tlb flush in the IB and an mmio one which is independent for anything currently being executed on an egine.

-Daniel

Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Christian König

5:13 p.m.

New subject: [Mesa-dev] Linux Graphics Next: Userspace submission update

As long as we can figure out who touched to a certain sync object last that would indeed work, yes.

Christian.

Am 14.06.21 um 19:10 schrieb Marek Olšák:

...

In any case, we can execute the memory write in the queue engine and only use the hw scheduler for logging, which would be perfect.

Marek

On Thu, Jun 10, 2021 at 12:33 PM Christian König <ckoenig.leichtzumerken@gmail.com mailto:ckoenig.leichtzumerken@gmail.com> wrote:

Hi guys,

maybe soften that a bit. Reading from the shared memory of the
user fence is ok for everybody. What we need to take more care of
is the writing side.

So my current thinking is that we allow read only access, but
writing a new sequence value needs to go through the scheduler/kernel.

So when the CPU wants to signal a timeline fence it needs to call
an IOCTL. When the GPU wants to signal the timeline fence it needs
to hand that of to the hardware scheduler.

If we lockup the kernel can check with the hardware who did the
last write and what value was written.

That together with an IOCTL to give out sequence number for
implicit sync to applications should be sufficient for the kernel
to track who is responsible if something bad happens.

In other words when the hardware says that the shader wrote stuff
like 0xdeadbeef 0x0 or 0xffffffff into memory we kill the process
who did that.

If the hardware says that seq - 1 was written fine, but seq is
missing then the kernel blames whoever was supposed to write seq.

Just pieping the write through a privileged instance should be
fine to make sure that we don't run into issues.

Christian.

Am 10.06.21 um 17:59 schrieb Marek Olšák:

...

Hi Daniel,

We just talked about this whole topic internally and we came up
to the conclusion that the hardware needs to understand sync
object handles and have high-level wait and signal operations in
the command stream. Sync objects will be backed by memory, but
they won't be readable or writable by processes directly. The
hardware will log all accesses to sync objects and will send the
log to the kernel periodically. The kernel will identify
malicious behavior.

Example of a hardware command stream:
...
ImplicitSyncWait(syncObjHandle, sequenceNumber); // the sequence
number is assigned by the kernel
Draw();
ImplicitSyncSignalWhenDone(syncObjHandle);
...

I'm afraid we have no other choice because of the TLB
invalidation overhead.

Marek


On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter <daniel@ffwll.ch
<mailto:daniel@ffwll.ch>> wrote:

    On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König wrote:
    > Am 09.06.21 um 15:19 schrieb Daniel Vetter:
    > > [SNIP]
    > > > Yeah, we call this the lightweight and the heavyweight
    tlb flush.
    > > >
    > > > The lighweight can be used when you are sure that you
    don't have any of the
    > > > PTEs currently in flight in the 3D/DMA engine and you
    just need to
    > > > invalidate the TLB.
    > > >
    > > > The heavyweight must be used when you need to
    invalidate the TLB *AND* make
    > > > sure that no concurrently operation moves new stuff
    into the TLB.
    > > >
    > > > The problem is for this use case we have to use the
    heavyweight one.
    > > Just for my own curiosity: So the lightweight flush is
    only for in-between
    > > CS when you know access is idle? Or does that also not
    work if userspace
    > > has a CS on a dma engine going at the same time because
    the tlb aren't
    > > isolated enough between engines?
    >
    > More or less correct, yes.
    >
    > The problem is a lightweight flush only invalidates the
    TLB, but doesn't
    > take care of entries which have been handed out to the
    different engines.
    >
    > In other words what can happen is the following:
    >
    > 1. Shader asks TLB to resolve address X.
    > 2. TLB looks into its cache and can't find address X so it
    asks the walker
    > to resolve.
    > 3. Walker comes back with result for address X and TLB puts
    that into its
    > cache and gives it to Shader.
    > 4. Shader starts doing some operation using result for
    address X.
    > 5. You send lightweight TLB invalidate and TLB throws away
    cached values for
    > address X.
    > 6. Shader happily still uses whatever the TLB gave to it in
    step 3 to
    > accesses address X
    >
    > See it like the shader has their own 1 entry L0 TLB cache
    which is not
    > affected by the lightweight flush.
    >
    > The heavyweight flush on the other hand sends out a
    broadcast signal to
    > everybody and only comes back when we are sure that an
    address is not in use
    > any more.

    Ah makes sense. On intel the shaders only operate in VA,
    everything goes
    around as explicit async messages to IO blocks. So we don't
    have this, the
    only difference in tlb flushes is between tlb flush in the IB
    and an mmio
    one which is independent for anything currently being
    executed on an
    egine.
    -Daniel
    -- 
    Daniel Vetter
    Software Engineer, Intel Corporation
    http://blog.ffwll.ch <http://blog.ffwll.ch>

Daniel Vetter

17 Jun 17 Jun

4:48 p.m.

New subject: [Mesa-dev] Linux Graphics Next: Userspace submission update

On Mon, Jun 14, 2021 at 07:13:00PM +0200, Christian König wrote:

...

As long as we can figure out who touched to a certain sync object last that would indeed work, yes.

Don't you need to know who will touch it next, i.e. who is holding up your fence? Or maybe I'm just again totally confused. -Daniel

...

Christian.

Am 14.06.21 um 19:10 schrieb Marek Olšák:

...

In any case, we can execute the memory write in the queue engine and only use the hw scheduler for logging, which would be perfect.

Marek

On Thu, Jun 10, 2021 at 12:33 PM Christian König <ckoenig.leichtzumerken@gmail.com mailto:ckoenig.leichtzumerken@gmail.com> wrote:

Hi guys,

maybe soften that a bit. Reading from the shared memory of the
user fence is ok for everybody. What we need to take more care of
is the writing side.

So my current thinking is that we allow read only access, but
writing a new sequence value needs to go through the scheduler/kernel.

So when the CPU wants to signal a timeline fence it needs to call
an IOCTL. When the GPU wants to signal the timeline fence it needs
to hand that of to the hardware scheduler.

If we lockup the kernel can check with the hardware who did the
last write and what value was written.

That together with an IOCTL to give out sequence number for
implicit sync to applications should be sufficient for the kernel
to track who is responsible if something bad happens.

In other words when the hardware says that the shader wrote stuff
like 0xdeadbeef 0x0 or 0xffffffff into memory we kill the process
who did that.

If the hardware says that seq - 1 was written fine, but seq is
missing then the kernel blames whoever was supposed to write seq.

Just pieping the write through a privileged instance should be
fine to make sure that we don't run into issues.

Christian.

Am 10.06.21 um 17:59 schrieb Marek Olšák:

...

Hi Daniel,

We just talked about this whole topic internally and we came up
to the conclusion that the hardware needs to understand sync
object handles and have high-level wait and signal operations in
the command stream. Sync objects will be backed by memory, but
they won't be readable or writable by processes directly. The
hardware will log all accesses to sync objects and will send the
log to the kernel periodically. The kernel will identify
malicious behavior.

Example of a hardware command stream:
...
ImplicitSyncWait(syncObjHandle, sequenceNumber); // the sequence
number is assigned by the kernel
Draw();
ImplicitSyncSignalWhenDone(syncObjHandle);
...

I'm afraid we have no other choice because of the TLB
invalidation overhead.

Marek


On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter <daniel@ffwll.ch
<mailto:daniel@ffwll.ch>> wrote:

    On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König wrote:
    > Am 09.06.21 um 15:19 schrieb Daniel Vetter:
    > > [SNIP]
    > > > Yeah, we call this the lightweight and the heavyweight
    tlb flush.
    > > >
    > > > The lighweight can be used when you are sure that you
    don't have any of the
    > > > PTEs currently in flight in the 3D/DMA engine and you
    just need to
    > > > invalidate the TLB.
    > > >
    > > > The heavyweight must be used when you need to
    invalidate the TLB *AND* make
    > > > sure that no concurrently operation moves new stuff
    into the TLB.
    > > >
    > > > The problem is for this use case we have to use the
    heavyweight one.
    > > Just for my own curiosity: So the lightweight flush is
    only for in-between
    > > CS when you know access is idle? Or does that also not
    work if userspace
    > > has a CS on a dma engine going at the same time because
    the tlb aren't
    > > isolated enough between engines?
    >
    > More or less correct, yes.
    >
    > The problem is a lightweight flush only invalidates the
    TLB, but doesn't
    > take care of entries which have been handed out to the
    different engines.
    >
    > In other words what can happen is the following:
    >
    > 1. Shader asks TLB to resolve address X.
    > 2. TLB looks into its cache and can't find address X so it
    asks the walker
    > to resolve.
    > 3. Walker comes back with result for address X and TLB puts
    that into its
    > cache and gives it to Shader.
    > 4. Shader starts doing some operation using result for
    address X.
    > 5. You send lightweight TLB invalidate and TLB throws away
    cached values for
    > address X.
    > 6. Shader happily still uses whatever the TLB gave to it in
    step 3 to
    > accesses address X
    >
    > See it like the shader has their own 1 entry L0 TLB cache
    which is not
    > affected by the lightweight flush.
    >
    > The heavyweight flush on the other hand sends out a
    broadcast signal to
    > everybody and only comes back when we are sure that an
    address is not in use
    > any more.

    Ah makes sense. On intel the shaders only operate in VA,
    everything goes
    around as explicit async messages to IO blocks. So we don't
    have this, the
    only difference in tlb flushes is between tlb flush in the IB
    and an mmio
    one which is independent for anything currently being
    executed on an
    egine.
    -Daniel
    --         Daniel Vetter
    Software Engineer, Intel Corporation
    http://blog.ffwll.ch <http://blog.ffwll.ch>

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Marek Olšák

6:28 p.m.

New subject: [Mesa-dev] Linux Graphics Next: Userspace submission update

The kernel will know who should touch the implicit-sync semaphore next, and at the same time, the copy of all write requests to the implicit-sync semaphore will be forwarded to the kernel for monitoring and bo_wait.

Syncobjs could either use the same monitored access as implicit sync or be completely unmonitored. We haven't decided yet.

Syncfiles could either use one of the above or wait for a syncobj to go idle before converting to a syncfile.

Marek

On Thu, Jun 17, 2021 at 12:48 PM Daniel Vetter daniel@ffwll.ch wrote:

...

On Mon, Jun 14, 2021 at 07:13:00PM +0200, Christian König wrote:

...

As long as we can figure out who touched to a certain sync object last

that

...

would indeed work, yes.

Don't you need to know who will touch it next, i.e. who is holding up your fence? Or maybe I'm just again totally confused. -Daniel

...

Christian.

Am 14.06.21 um 19:10 schrieb Marek Olšák:

...
The call to the hw scheduler has a limitation on the size of all parameters combined. I think we can only pass a 32-bit sequence number and a ~16-bit global (per-GPU) syncobj handle in one call and not much else.

The syncobj handle can be an element index in a global (per-GPU)

syncobj

...

...
table and it's read only for all processes with the exception of the signal command. Syncobjs can either have per VMID write access flags

for

...

...
the signal command (slow), or any process can write to any syncobjs and only rely on the kernel checking the write log (fast).

In any case, we can execute the memory write in the queue engine and only use the hw scheduler for logging, which would be perfect.

Marek

On Thu, Jun 10, 2021 at 12:33 PM Christian König <ckoenig.leichtzumerken@gmail.com mailto:ckoenig.leichtzumerken@gmail.com> wrote:
Hi guys,

maybe soften that a bit. Reading from the shared memory of the
user fence is ok for everybody. What we need to take more care of
is the writing side.

So my current thinking is that we allow read only access, but
writing a new sequence value needs to go through the

scheduler/kernel.

...

So when the CPU wants to signal a timeline fence it needs to call
an IOCTL. When the GPU wants to signal the timeline fence it needs
to hand that of to the hardware scheduler.

If we lockup the kernel can check with the hardware who did the
last write and what value was written.

That together with an IOCTL to give out sequence number for
implicit sync to applications should be sufficient for the kernel
to track who is responsible if something bad happens.

In other words when the hardware says that the shader wrote stuff
like 0xdeadbeef 0x0 or 0xffffffff into memory we kill the process
who did that.

If the hardware says that seq - 1 was written fine, but seq is
missing then the kernel blames whoever was supposed to write seq.

Just pieping the write through a privileged instance should be
fine to make sure that we don't run into issues.

Christian.

Am 10.06.21 um 17:59 schrieb Marek Olšák:

...

Hi Daniel,

We just talked about this whole topic internally and we came up
to the conclusion that the hardware needs to understand sync
object handles and have high-level wait and signal operations in
the command stream. Sync objects will be backed by memory, but
they won't be readable or writable by processes directly. The
hardware will log all accesses to sync objects and will send the
log to the kernel periodically. The kernel will identify
malicious behavior.

Example of a hardware command stream:
...
ImplicitSyncWait(syncObjHandle, sequenceNumber); // the sequence
number is assigned by the kernel
Draw();
ImplicitSyncSignalWhenDone(syncObjHandle);
...

I'm afraid we have no other choice because of the TLB
invalidation overhead.

Marek


On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter <daniel@ffwll.ch
<mailto:daniel@ffwll.ch>> wrote:

    On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König

wrote:

...

    > Am 09.06.21 um 15:19 schrieb Daniel Vetter:
    > > [SNIP]
    > > > Yeah, we call this the lightweight and the heavyweight
    tlb flush.
    > > >
    > > > The lighweight can be used when you are sure that you
    don't have any of the
    > > > PTEs currently in flight in the 3D/DMA engine and you
    just need to
    > > > invalidate the TLB.
    > > >
    > > > The heavyweight must be used when you need to
    invalidate the TLB *AND* make
    > > > sure that no concurrently operation moves new stuff
    into the TLB.
    > > >
    > > > The problem is for this use case we have to use the
    heavyweight one.
    > > Just for my own curiosity: So the lightweight flush is
    only for in-between
    > > CS when you know access is idle? Or does that also not
    work if userspace
    > > has a CS on a dma engine going at the same time because
    the tlb aren't
    > > isolated enough between engines?
    >
    > More or less correct, yes.
    >
    > The problem is a lightweight flush only invalidates the
    TLB, but doesn't
    > take care of entries which have been handed out to the
    different engines.
    >
    > In other words what can happen is the following:
    >
    > 1. Shader asks TLB to resolve address X.
    > 2. TLB looks into its cache and can't find address X so it
    asks the walker
    > to resolve.
    > 3. Walker comes back with result for address X and TLB puts
    that into its
    > cache and gives it to Shader.
    > 4. Shader starts doing some operation using result for
    address X.
    > 5. You send lightweight TLB invalidate and TLB throws away
    cached values for
    > address X.
    > 6. Shader happily still uses whatever the TLB gave to it in
    step 3 to
    > accesses address X
    >
    > See it like the shader has their own 1 entry L0 TLB cache
    which is not
    > affected by the lightweight flush.
    >
    > The heavyweight flush on the other hand sends out a
    broadcast signal to
    > everybody and only comes back when we are sure that an
    address is not in use
    > any more.

    Ah makes sense. On intel the shaders only operate in VA,
    everything goes
    around as explicit async messages to IO blocks. So we don't
    have this, the
    only difference in tlb flushes is between tlb flush in the IB
    and an mmio
    one which is independent for anything currently being
    executed on an
    egine.
    -Daniel
    --         Daniel Vetter
    Software Engineer, Intel Corporation
    http://blog.ffwll.ch <http://blog.ffwll.ch>

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Daniel Vetter

7:04 p.m.

New subject: [Mesa-dev] Linux Graphics Next: Userspace submission update

On Thu, Jun 17, 2021 at 02:28:06PM -0400, Marek Olšák wrote:

...

The kernel will know who should touch the implicit-sync semaphore next, and at the same time, the copy of all write requests to the implicit-sync semaphore will be forwarded to the kernel for monitoring and bo_wait.

Syncobjs could either use the same monitored access as implicit sync or be completely unmonitored. We haven't decided yet.

Syncfiles could either use one of the above or wait for a syncobj to go idle before converting to a syncfile.

Hm this sounds all like you're planning to completely rewrap everything ... I'm assuming the plan is still that this is going to be largely wrapped in dma_fence? Maybe with timeline objects being a bit more optimized, but I'm not sure how much you can optimize without breaking the interfaces. -Daniel

...

Marek

On Thu, Jun 17, 2021 at 12:48 PM Daniel Vetter daniel@ffwll.ch wrote:

...

On Mon, Jun 14, 2021 at 07:13:00PM +0200, Christian König wrote:

...

As long as we can figure out who touched to a certain sync object last

that

...

would indeed work, yes.

Don't you need to know who will touch it next, i.e. who is holding up your fence? Or maybe I'm just again totally confused. -Daniel

...

Christian.

Am 14.06.21 um 19:10 schrieb Marek Olšák:

...
The call to the hw scheduler has a limitation on the size of all parameters combined. I think we can only pass a 32-bit sequence number and a ~16-bit global (per-GPU) syncobj handle in one call and not much else.

The syncobj handle can be an element index in a global (per-GPU)

syncobj

...

...
table and it's read only for all processes with the exception of the signal command. Syncobjs can either have per VMID write access flags

for

...

...
the signal command (slow), or any process can write to any syncobjs and only rely on the kernel checking the write log (fast).

In any case, we can execute the memory write in the queue engine and only use the hw scheduler for logging, which would be perfect.

Marek

On Thu, Jun 10, 2021 at 12:33 PM Christian König <ckoenig.leichtzumerken@gmail.com mailto:ckoenig.leichtzumerken@gmail.com> wrote:
Hi guys,

maybe soften that a bit. Reading from the shared memory of the
user fence is ok for everybody. What we need to take more care of
is the writing side.

So my current thinking is that we allow read only access, but
writing a new sequence value needs to go through the

scheduler/kernel.

...

So when the CPU wants to signal a timeline fence it needs to call
an IOCTL. When the GPU wants to signal the timeline fence it needs
to hand that of to the hardware scheduler.

If we lockup the kernel can check with the hardware who did the
last write and what value was written.

That together with an IOCTL to give out sequence number for
implicit sync to applications should be sufficient for the kernel
to track who is responsible if something bad happens.

In other words when the hardware says that the shader wrote stuff
like 0xdeadbeef 0x0 or 0xffffffff into memory we kill the process
who did that.

If the hardware says that seq - 1 was written fine, but seq is
missing then the kernel blames whoever was supposed to write seq.

Just pieping the write through a privileged instance should be
fine to make sure that we don't run into issues.

Christian.

Am 10.06.21 um 17:59 schrieb Marek Olšák:

...

Hi Daniel,

We just talked about this whole topic internally and we came up
to the conclusion that the hardware needs to understand sync
object handles and have high-level wait and signal operations in
the command stream. Sync objects will be backed by memory, but
they won't be readable or writable by processes directly. The
hardware will log all accesses to sync objects and will send the
log to the kernel periodically. The kernel will identify
malicious behavior.

Example of a hardware command stream:
...
ImplicitSyncWait(syncObjHandle, sequenceNumber); // the sequence
number is assigned by the kernel
Draw();
ImplicitSyncSignalWhenDone(syncObjHandle);
...

I'm afraid we have no other choice because of the TLB
invalidation overhead.

Marek


On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter <daniel@ffwll.ch
<mailto:daniel@ffwll.ch>> wrote:

    On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König

wrote:

...

    > Am 09.06.21 um 15:19 schrieb Daniel Vetter:
    > > [SNIP]
    > > > Yeah, we call this the lightweight and the heavyweight
    tlb flush.
    > > >
    > > > The lighweight can be used when you are sure that you
    don't have any of the
    > > > PTEs currently in flight in the 3D/DMA engine and you
    just need to
    > > > invalidate the TLB.
    > > >
    > > > The heavyweight must be used when you need to
    invalidate the TLB *AND* make
    > > > sure that no concurrently operation moves new stuff
    into the TLB.
    > > >
    > > > The problem is for this use case we have to use the
    heavyweight one.
    > > Just for my own curiosity: So the lightweight flush is
    only for in-between
    > > CS when you know access is idle? Or does that also not
    work if userspace
    > > has a CS on a dma engine going at the same time because
    the tlb aren't
    > > isolated enough between engines?
    >
    > More or less correct, yes.
    >
    > The problem is a lightweight flush only invalidates the
    TLB, but doesn't
    > take care of entries which have been handed out to the
    different engines.
    >
    > In other words what can happen is the following:
    >
    > 1. Shader asks TLB to resolve address X.
    > 2. TLB looks into its cache and can't find address X so it
    asks the walker
    > to resolve.
    > 3. Walker comes back with result for address X and TLB puts
    that into its
    > cache and gives it to Shader.
    > 4. Shader starts doing some operation using result for
    address X.
    > 5. You send lightweight TLB invalidate and TLB throws away
    cached values for
    > address X.
    > 6. Shader happily still uses whatever the TLB gave to it in
    step 3 to
    > accesses address X
    >
    > See it like the shader has their own 1 entry L0 TLB cache
    which is not
    > affected by the lightweight flush.
    >
    > The heavyweight flush on the other hand sends out a
    broadcast signal to
    > everybody and only comes back when we are sure that an
    address is not in use
    > any more.

    Ah makes sense. On intel the shaders only operate in VA,
    everything goes
    around as explicit async messages to IO blocks. So we don't
    have this, the
    only difference in tlb flushes is between tlb flush in the IB
    and an mmio
    one which is independent for anything currently being
    executed on an
    egine.
    -Daniel
    --         Daniel Vetter
    Software Engineer, Intel Corporation
    http://blog.ffwll.ch <http://blog.ffwll.ch>

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Marek Olšák

7:23 p.m.

New subject: [Mesa-dev] Linux Graphics Next: Userspace submission update

Timeline semaphore waits (polling on memory) will be unmonitored and as fast as the roundtrip to memory. Semaphore writes will be slower because the copy of those write requests will also be forwarded to the kernel. Arbitrary writes are not protected by the hw but the kernel will take action against such behavior because it will receive them too.

I don't know if that would work with dma_fence.

Marek

On Thu, Jun 17, 2021 at 3:04 PM Daniel Vetter daniel@ffwll.ch wrote:

...

On Thu, Jun 17, 2021 at 02:28:06PM -0400, Marek Olšák wrote:

...
The kernel will know who should touch the implicit-sync semaphore next,

and

...
at the same time, the copy of all write requests to the implicit-sync semaphore will be forwarded to the kernel for monitoring and bo_wait.

Syncobjs could either use the same monitored access as implicit sync or

be

...
completely unmonitored. We haven't decided yet.

Syncfiles could either use one of the above or wait for a syncobj to go idle before converting to a syncfile.

Hm this sounds all like you're planning to completely rewrap everything ... I'm assuming the plan is still that this is going to be largely wrapped in dma_fence? Maybe with timeline objects being a bit more optimized, but I'm not sure how much you can optimize without breaking the interfaces. -Daniel

...
Marek

On Thu, Jun 17, 2021 at 12:48 PM Daniel Vetter daniel@ffwll.ch wrote:

...
On Mon, Jun 14, 2021 at 07:13:00PM +0200, Christian König wrote:

...
As long as we can figure out who touched to a certain sync object

last

...
...
that

...
would indeed work, yes.

Don't you need to know who will touch it next, i.e. who is holding up

your

...
...
fence? Or maybe I'm just again totally confused. -Daniel

...
Christian.

Am 14.06.21 um 19:10 schrieb Marek Olšák:

...
The call to the hw scheduler has a limitation on the size of all parameters combined. I think we can only pass a 32-bit sequence

number

...
...
...
...
and a ~16-bit global (per-GPU) syncobj handle in one call and not

much

...
...
...
...
else.

The syncobj handle can be an element index in a global (per-GPU)

syncobj

...
...
table and it's read only for all processes with the exception of

the

...
...
...
...
signal command. Syncobjs can either have per VMID write access

flags

...
...
for

...
...
the signal command (slow), or any process can write to any

syncobjs and

...
...
...
...
only rely on the kernel checking the write log (fast).

In any case, we can execute the memory write in the queue engine

and

...
...
...
...
only use the hw scheduler for logging, which would be perfect.

Marek

On Thu, Jun 10, 2021 at 12:33 PM Christian König <ckoenig.leichtzumerken@gmail.com mailto:ckoenig.leichtzumerken@gmail.com> wrote:
Hi guys,

maybe soften that a bit. Reading from the shared memory of the
user fence is ok for everybody. What we need to take more care
of

...
...
...
...
is the writing side.

So my current thinking is that we allow read only access, but
writing a new sequence value needs to go through the
scheduler/kernel.

...
...
So when the CPU wants to signal a timeline fence it needs to
call

...
...
...
...
an IOCTL. When the GPU wants to signal the timeline fence it
needs

...
...
...
...
to hand that of to the hardware scheduler.

If we lockup the kernel can check with the hardware who did the
last write and what value was written.

That together with an IOCTL to give out sequence number for
implicit sync to applications should be sufficient for the
kernel

...
...
...
...
to track who is responsible if something bad happens.

In other words when the hardware says that the shader wrote
stuff

...
...
...
...
like 0xdeadbeef 0x0 or 0xffffffff into memory we kill the
process

...
...
...
...
who did that.

If the hardware says that seq - 1 was written fine, but seq is
missing then the kernel blames whoever was supposed to write
seq.

...
...
...
...
Just pieping the write through a privileged instance should be
fine to make sure that we don't run into issues.

Christian.

Am 10.06.21 um 17:59 schrieb Marek Olšák:
...
Hi Daniel,

We just talked about this whole topic internally and we came
up

...
...
...
...
...
to the conclusion that the hardware needs to understand sync
object handles and have high-level wait and signal
operations in

...
...
...
...
...
the command stream. Sync objects will be backed by memory,
but

...
...
...
...
...
they won't be readable or writable by processes directly. The
hardware will log all accesses to sync objects and will send
the

...
...
...
...
...
log to the kernel periodically. The kernel will identify
malicious behavior.

Example of a hardware command stream:
...
ImplicitSyncWait(syncObjHandle, sequenceNumber); // the
sequence

...
...
...
...
...
number is assigned by the kernel
Draw();
ImplicitSyncSignalWhenDone(syncObjHandle);
...

I'm afraid we have no other choice because of the TLB
invalidation overhead.

Marek


On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter <
daniel@ffwll.ch

...
...
...
...
...
<mailto:daniel@ffwll.ch>> wrote:

    On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König
wrote:

...
...
...
    > Am 09.06.21 um 15:19 schrieb Daniel Vetter:
    > > [SNIP]
    > > > Yeah, we call this the lightweight and the
heavyweight

...
...
...
...
...
    tlb flush.
    > > >
    > > > The lighweight can be used when you are sure that
you

...
...
...
...
...
    don't have any of the
    > > > PTEs currently in flight in the 3D/DMA engine and
you

...
...
...
...
...
    just need to
    > > > invalidate the TLB.
    > > >
    > > > The heavyweight must be used when you need to
    invalidate the TLB *AND* make
    > > > sure that no concurrently operation moves new stuff
    into the TLB.
    > > >
    > > > The problem is for this use case we have to use the
    heavyweight one.
    > > Just for my own curiosity: So the lightweight flush
is

...
...
...
...
...
    only for in-between
    > > CS when you know access is idle? Or does that also
not

...
...
...
...
...
    work if userspace
    > > has a CS on a dma engine going at the same time
because

...
...
...
...
...
    the tlb aren't
    > > isolated enough between engines?
    >
    > More or less correct, yes.
    >
    > The problem is a lightweight flush only invalidates the
    TLB, but doesn't
    > take care of entries which have been handed out to the
    different engines.
    >
    > In other words what can happen is the following:
    >
    > 1. Shader asks TLB to resolve address X.
    > 2. TLB looks into its cache and can't find address X
so it

...
...
...
...
...
    asks the walker
    > to resolve.
    > 3. Walker comes back with result for address X and TLB
puts

...
...
...
...
...
    that into its
    > cache and gives it to Shader.
    > 4. Shader starts doing some operation using result for
    address X.
    > 5. You send lightweight TLB invalidate and TLB throws
away

...
...
...
...
...
    cached values for
    > address X.
    > 6. Shader happily still uses whatever the TLB gave to
it in

...
...
...
...
...
    step 3 to
    > accesses address X
    >
    > See it like the shader has their own 1 entry L0 TLB
cache

...
...
...
...
...
    which is not
    > affected by the lightweight flush.
    >
    > The heavyweight flush on the other hand sends out a
    broadcast signal to
    > everybody and only comes back when we are sure that an
    address is not in use
    > any more.

    Ah makes sense. On intel the shaders only operate in VA,
    everything goes
    around as explicit async messages to IO blocks. So we
don't

...
...
...
...
...
    have this, the
    only difference in tlb flushes is between tlb flush in
the IB

...
...
...
...
...
    and an mmio
    one which is independent for anything currently being
    executed on an
    egine.
    -Daniel
    --         Daniel Vetter
    Software Engineer, Intel Corporation
    http://blog.ffwll.ch <http://blog.ffwll.ch>
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Marek Olšák

3 Jun 3 Jun

3:16 a.m.

New subject: [Mesa-dev] Linux Graphics Next: Userspace submission update

On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter daniel@ffwll.ch wrote:

...

On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:

...
On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák maraeo@gmail.com wrote:

...
Yes, we can't break anything because we don't want to complicate things for us. It's pretty much all NAK'd already. We are trying to gather

more

...
...
knowledge and then make better decisions.

The idea we are considering is that we'll expose memory-based sync

objects

...
...
to userspace for read only, and the kernel or hw will strictly control

the

...
...
memory writes to those sync objects. The hole in that idea is that userspace can decide not to signal a job, so even if userspace can't overwrite memory-based sync object states arbitrarily, it can still

decide

...
...
not to signal them, and then a future fence is born.

This would actually be treated as a GPU hang caused by that context, so

it

...
should be fine.

This is practically what I proposed already, except your not doing it with dma_fence. And on the memory fence side this also doesn't actually give what you want for that compute model.

This seems like a bit a worst of both worlds approach to me? Tons of work in the kernel to hide these not-dma_fence-but-almost, and still pain to actually drive the hardware like it should be for compute or direct display.

Also maybe I've missed it, but I didn't see any replies to my suggestion how to fake the entire dma_fence stuff on top of new hw. Would be interesting to know what doesn't work there instead of amd folks going of into internal again and then coming back with another rfc from out of nowhere :-)

Going internal again is probably a good idea to spare you the long discussions and not waste your time, but we haven't talked about the dma_fence stuff internally other than acknowledging that it can be solved.

The compute use case already uses the hw as-is with no inter-process sharing, which mostly keeps the kernel out of the picture. It uses glFinish to sync with GL.

The gfx use case needs new hardware logic to support implicit and explicit sync. When we propose a solution, it's usually torn apart the next day by ourselves.

Since we are talking about next hw or next next hw, preemption should be better.

user queue = user-mapped ring buffer

For implicit sync, we will only let userspace lock access to a buffer via a user queue, which waits for the per-buffer sequence counter in memory to be

...

= the number assigned by the kernel, and later unlock the access with

another command, which increments the per-buffer sequence counter in memory with atomic_inc regardless of the number assigned by the kernel. The kernel counter and the counter in memory can be out-of-sync, and I'll explain why it's OK. If a process increments the kernel counter but not the memory counter, that's its problem and it's the same as a GPU hang caused by that process. If a process increments the memory counter but not the kernel counter, the ">=" condition alongside atomic_inc guarantee that signaling n will signal n+1, so it will never deadlock but also it will effectively disable synchronization. This method of disabling synchronization is similar to a process corrupting the buffer, which should be fine. Can you find any flaw in it? I can't find any.

The explicit submit can be done by userspace (if there is no synchronization), but we plan to use the kernel to do it for implicit sync. Essentially, the kernel will receive a buffer list and addresses of wait commands in the user queue. It will assign new sequence numbers to all buffers and write those numbers into the wait commands, and ring the hw doorbell to start execution of that queue.

Marek

Daniel Vetter

7:47 a.m.

New subject: [Mesa-dev] Linux Graphics Next: Userspace submission update

On Wed, Jun 02, 2021 at 11:16:39PM -0400, Marek Olšák wrote:

...

On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter daniel@ffwll.ch wrote:

...
On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:

...
On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák maraeo@gmail.com wrote:

...
Yes, we can't break anything because we don't want to complicate things for us. It's pretty much all NAK'd already. We are trying to gather

more

...
...
knowledge and then make better decisions.

The idea we are considering is that we'll expose memory-based sync

objects

...
...
to userspace for read only, and the kernel or hw will strictly control

the

...
...
memory writes to those sync objects. The hole in that idea is that userspace can decide not to signal a job, so even if userspace can't overwrite memory-based sync object states arbitrarily, it can still

decide

...
...
not to signal them, and then a future fence is born.

This would actually be treated as a GPU hang caused by that context, so

it

...
should be fine.

This is practically what I proposed already, except your not doing it with dma_fence. And on the memory fence side this also doesn't actually give what you want for that compute model.

This seems like a bit a worst of both worlds approach to me? Tons of work in the kernel to hide these not-dma_fence-but-almost, and still pain to actually drive the hardware like it should be for compute or direct display.

Also maybe I've missed it, but I didn't see any replies to my suggestion how to fake the entire dma_fence stuff on top of new hw. Would be interesting to know what doesn't work there instead of amd folks going of into internal again and then coming back with another rfc from out of nowhere :-)

Going internal again is probably a good idea to spare you the long discussions and not waste your time, but we haven't talked about the dma_fence stuff internally other than acknowledging that it can be solved.

The compute use case already uses the hw as-is with no inter-process sharing, which mostly keeps the kernel out of the picture. It uses glFinish to sync with GL.

The gfx use case needs new hardware logic to support implicit and explicit sync. When we propose a solution, it's usually torn apart the next day by ourselves.

Since we are talking about next hw or next next hw, preemption should be better.

user queue = user-mapped ring buffer

For implicit sync, we will only let userspace lock access to a buffer via a user queue, which waits for the per-buffer sequence counter in memory to be

...
= the number assigned by the kernel, and later unlock the access with

another command, which increments the per-buffer sequence counter in memory with atomic_inc regardless of the number assigned by the kernel. The kernel counter and the counter in memory can be out-of-sync, and I'll explain why it's OK. If a process increments the kernel counter but not the memory counter, that's its problem and it's the same as a GPU hang caused by that process. If a process increments the memory counter but not the kernel counter, the ">=" condition alongside atomic_inc guarantee that signaling n will signal n+1, so it will never deadlock but also it will effectively disable synchronization. This method of disabling synchronization is similar to a process corrupting the buffer, which should be fine. Can you find any flaw in it? I can't find any.

Hm maybe I misunderstood what exactly you wanted to do earlier. That kind of "we let userspace free-wheel whatever it wants, kernel ensures correctness of the resulting chain of dma_fence with reset the entire context" is what I proposed too.

Like you say, userspace is allowed to render garbage already.

...

The explicit submit can be done by userspace (if there is no synchronization), but we plan to use the kernel to do it for implicit sync. Essentially, the kernel will receive a buffer list and addresses of wait commands in the user queue. It will assign new sequence numbers to all buffers and write those numbers into the wait commands, and ring the hw doorbell to start execution of that queue.

Yeah for implicit sync I think kernel and using drm/scheduler to sort out the dma_fence dependencies is probably best. Since you can filter out which dma_fence you hand to the scheduler for dependency tracking you can filter out your own ones and let the hw handle those directly (depending how much your hw can do an all that). On i915 we might do that to be able to use MI_SEMAPHORE_WAIT/SIGNAL functionality in the hw and fw scheduler.

For buffer tracking with implicit sync I think cleanest is probably to still keep them wrapped as dma_fence and stuffed into dma_resv, but conceptually it's the same. If we let every driver reinvent their own buffer tracking just because the hw works a bit different it'll be a mess.

Wrt wait commands: I'm honestly not sure why you'd do that. Userspace gets to keep the pieces if it gets it wrong. You do still need to handle external dma_fence though, hence drm/scheduler frontend to sort these out. -Daniel

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Marek Olšák

8:20 a.m.

New subject: [Mesa-dev] Linux Graphics Next: Userspace submission update

On Thu, Jun 3, 2021 at 3:47 AM Daniel Vetter daniel@ffwll.ch wrote:

...

On Wed, Jun 02, 2021 at 11:16:39PM -0400, Marek Olšák wrote:

...
On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter daniel@ffwll.ch wrote:

...
On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:

...
On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák maraeo@gmail.com wrote:

...
Yes, we can't break anything because we don't want to complicate

things

...
...
...
...
for us. It's pretty much all NAK'd already. We are trying to gather

more

...
...
knowledge and then make better decisions.

The idea we are considering is that we'll expose memory-based sync

objects

...
...
to userspace for read only, and the kernel or hw will strictly

control

...
...
the

...
...
memory writes to those sync objects. The hole in that idea is that userspace can decide not to signal a job, so even if userspace

can't

...
...
...
...
overwrite memory-based sync object states arbitrarily, it can still

decide

...
...
not to signal them, and then a future fence is born.

This would actually be treated as a GPU hang caused by that context,

so

...
...
it

...
should be fine.

This is practically what I proposed already, except your not doing it

with

...
...
dma_fence. And on the memory fence side this also doesn't actually give what you want for that compute model.

This seems like a bit a worst of both worlds approach to me? Tons of

work

...
...
in the kernel to hide these not-dma_fence-but-almost, and still pain to actually drive the hardware like it should be for compute or direct display.

Also maybe I've missed it, but I didn't see any replies to my

suggestion

...
...
how to fake the entire dma_fence stuff on top of new hw. Would be interesting to know what doesn't work there instead of amd folks going

of

...
...
into internal again and then coming back with another rfc from out of nowhere :-)

Going internal again is probably a good idea to spare you the long discussions and not waste your time, but we haven't talked about the dma_fence stuff internally other than acknowledging that it can be

solved.

...
The compute use case already uses the hw as-is with no inter-process sharing, which mostly keeps the kernel out of the picture. It uses

glFinish

...
to sync with GL.

The gfx use case needs new hardware logic to support implicit and

explicit

...
sync. When we propose a solution, it's usually torn apart the next day by ourselves.

Since we are talking about next hw or next next hw, preemption should be better.

user queue = user-mapped ring buffer

For implicit sync, we will only let userspace lock access to a buffer

via a

...
user queue, which waits for the per-buffer sequence counter in memory to

be

...
...
= the number assigned by the kernel, and later unlock the access with

another command, which increments the per-buffer sequence counter in

memory

...
with atomic_inc regardless of the number assigned by the kernel. The

kernel

...
counter and the counter in memory can be out-of-sync, and I'll explain

why

...
it's OK. If a process increments the kernel counter but not the memory counter, that's its problem and it's the same as a GPU hang caused by

that

...
process. If a process increments the memory counter but not the kernel counter, the ">=" condition alongside atomic_inc guarantee that

signaling n

...
will signal n+1, so it will never deadlock but also it will effectively disable synchronization. This method of disabling synchronization is similar to a process corrupting the buffer, which should be fine. Can you find any flaw in it? I can't find any.

Hm maybe I misunderstood what exactly you wanted to do earlier. That kind of "we let userspace free-wheel whatever it wants, kernel ensures correctness of the resulting chain of dma_fence with reset the entire context" is what I proposed too.

Like you say, userspace is allowed to render garbage already.

...
The explicit submit can be done by userspace (if there is no synchronization), but we plan to use the kernel to do it for implicit

sync.

...
Essentially, the kernel will receive a buffer list and addresses of wait commands in the user queue. It will assign new sequence numbers to all buffers and write those numbers into the wait commands, and ring the hw doorbell to start execution of that queue.

Yeah for implicit sync I think kernel and using drm/scheduler to sort out the dma_fence dependencies is probably best. Since you can filter out which dma_fence you hand to the scheduler for dependency tracking you can filter out your own ones and let the hw handle those directly (depending how much your hw can do an all that). On i915 we might do that to be able to use MI_SEMAPHORE_WAIT/SIGNAL functionality in the hw and fw scheduler.

For buffer tracking with implicit sync I think cleanest is probably to still keep them wrapped as dma_fence and stuffed into dma_resv, but conceptually it's the same. If we let every driver reinvent their own buffer tracking just because the hw works a bit different it'll be a mess.

Wrt wait commands: I'm honestly not sure why you'd do that. Userspace gets to keep the pieces if it gets it wrong. You do still need to handle external dma_fence though, hence drm/scheduler frontend to sort these out.

The reason is to disallow lower-privileged process to deadlock/hang a higher-privileged process where the kernel can't tell who did it. If the implicit-sync sequence counter is read only to userspace and only incrementable by the unlock-signal command after the lock-wait command appeared in the same queue (both together forming a critical section), userspace can't manipulate it arbitrarily and we get almost the exact same behavior as implicit sync has today. That means any implicitly-sync'd buffer from any process can be fully trusted by a compositor to signal in a finite time, and possibly even trusted by the kernel. The only thing that's different is that a malicious process can disable implicit sync for a buffer in all processes/kernel, but it can't hang other processes/kernel (it can only hang itself and the kernel will be notified). So I'm a happy panda now. :)

Marek

Daniel Vetter

10:03 a.m.

New subject: [Mesa-dev] Linux Graphics Next: Userspace submission update

On Thu, Jun 03, 2021 at 04:20:18AM -0400, Marek Olšák wrote:

...

On Thu, Jun 3, 2021 at 3:47 AM Daniel Vetter daniel@ffwll.ch wrote:

...
On Wed, Jun 02, 2021 at 11:16:39PM -0400, Marek Olšák wrote:

...
On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter daniel@ffwll.ch wrote:

...
On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:

...
On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák maraeo@gmail.com wrote:

...
Yes, we can't break anything because we don't want to complicate

things

...
...
...
...
for us. It's pretty much all NAK'd already. We are trying to gather

more

...
...
knowledge and then make better decisions.

The idea we are considering is that we'll expose memory-based sync

objects

...
...
to userspace for read only, and the kernel or hw will strictly

control

...
...
the

...
...
memory writes to those sync objects. The hole in that idea is that userspace can decide not to signal a job, so even if userspace

can't

...
...
...
...
overwrite memory-based sync object states arbitrarily, it can still

decide

...
...
not to signal them, and then a future fence is born.

This would actually be treated as a GPU hang caused by that context,

so

...
...
it

...
should be fine.

This is practically what I proposed already, except your not doing it

with

...
...
dma_fence. And on the memory fence side this also doesn't actually give what you want for that compute model.

This seems like a bit a worst of both worlds approach to me? Tons of

work

...
...
in the kernel to hide these not-dma_fence-but-almost, and still pain to actually drive the hardware like it should be for compute or direct display.

Also maybe I've missed it, but I didn't see any replies to my

suggestion

...
...
how to fake the entire dma_fence stuff on top of new hw. Would be interesting to know what doesn't work there instead of amd folks going

of

...
...
into internal again and then coming back with another rfc from out of nowhere :-)

Going internal again is probably a good idea to spare you the long discussions and not waste your time, but we haven't talked about the dma_fence stuff internally other than acknowledging that it can be

solved.

...
The compute use case already uses the hw as-is with no inter-process sharing, which mostly keeps the kernel out of the picture. It uses

glFinish

...
to sync with GL.

The gfx use case needs new hardware logic to support implicit and

explicit

...
sync. When we propose a solution, it's usually torn apart the next day by ourselves.

Since we are talking about next hw or next next hw, preemption should be better.

user queue = user-mapped ring buffer

For implicit sync, we will only let userspace lock access to a buffer

via a

...
user queue, which waits for the per-buffer sequence counter in memory to

be

...
...
= the number assigned by the kernel, and later unlock the access with

another command, which increments the per-buffer sequence counter in

memory

...
with atomic_inc regardless of the number assigned by the kernel. The

kernel

...
counter and the counter in memory can be out-of-sync, and I'll explain

why

...
it's OK. If a process increments the kernel counter but not the memory counter, that's its problem and it's the same as a GPU hang caused by

that

...
process. If a process increments the memory counter but not the kernel counter, the ">=" condition alongside atomic_inc guarantee that

signaling n

...
will signal n+1, so it will never deadlock but also it will effectively disable synchronization. This method of disabling synchronization is similar to a process corrupting the buffer, which should be fine. Can you find any flaw in it? I can't find any.

Hm maybe I misunderstood what exactly you wanted to do earlier. That kind of "we let userspace free-wheel whatever it wants, kernel ensures correctness of the resulting chain of dma_fence with reset the entire context" is what I proposed too.

Like you say, userspace is allowed to render garbage already.

...
The explicit submit can be done by userspace (if there is no synchronization), but we plan to use the kernel to do it for implicit

sync.

...
Essentially, the kernel will receive a buffer list and addresses of wait commands in the user queue. It will assign new sequence numbers to all buffers and write those numbers into the wait commands, and ring the hw doorbell to start execution of that queue.

Yeah for implicit sync I think kernel and using drm/scheduler to sort out the dma_fence dependencies is probably best. Since you can filter out which dma_fence you hand to the scheduler for dependency tracking you can filter out your own ones and let the hw handle those directly (depending how much your hw can do an all that). On i915 we might do that to be able to use MI_SEMAPHORE_WAIT/SIGNAL functionality in the hw and fw scheduler.

For buffer tracking with implicit sync I think cleanest is probably to still keep them wrapped as dma_fence and stuffed into dma_resv, but conceptually it's the same. If we let every driver reinvent their own buffer tracking just because the hw works a bit different it'll be a mess.

Wrt wait commands: I'm honestly not sure why you'd do that. Userspace gets to keep the pieces if it gets it wrong. You do still need to handle external dma_fence though, hence drm/scheduler frontend to sort these out.

The reason is to disallow lower-privileged process to deadlock/hang a higher-privileged process where the kernel can't tell who did it. If the implicit-sync sequence counter is read only to userspace and only incrementable by the unlock-signal command after the lock-wait command appeared in the same queue (both together forming a critical section), userspace can't manipulate it arbitrarily and we get almost the exact same behavior as implicit sync has today. That means any implicitly-sync'd buffer from any process can be fully trusted by a compositor to signal in a finite time, and possibly even trusted by the kernel. The only thing that's different is that a malicious process can disable implicit sync for a buffer in all processes/kernel, but it can't hang other processes/kernel (it can only hang itself and the kernel will be notified). So I'm a happy panda now. :)

Yeah I think that's not going to work too well, and is too many piled up hacks. Within a drm_file fd you can do whatever you feel like, since it's just one client.

But once implicit sync kicks in I think you need to go with dma_fence and drm/scheduler to handle the dependencies, and tdr kicking it. With the dma_fence you do know who's the offender - you might not know why, but that doesn't matter, you just shred the entire context and let that userspace figure out the details.

I think trying to make memory fences work as implicit sync directly, without wrapping them in a dma_fence and assorted guarantees, will just not work.

And once you do wrap them in dma_fence, then all the other problems go away: cross-driver sync, syncfiles, ... So I really don't see the benefit of this half-way approach.

Yes there's going to be a tad bit of overhead, but that's already there in the current model. And it can't hurt to have a bit of motivation for compositors to switch over to userspace memory fences properly. -Daniel

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Marek Olšák

10:55 a.m.

New subject: [Mesa-dev] Linux Graphics Next: Userspace submission update

On Thu., Jun. 3, 2021, 06:03 Daniel Vetter, daniel@ffwll.ch wrote:

...

On Thu, Jun 03, 2021 at 04:20:18AM -0400, Marek Olšák wrote:

...
On Thu, Jun 3, 2021 at 3:47 AM Daniel Vetter daniel@ffwll.ch wrote:

...
On Wed, Jun 02, 2021 at 11:16:39PM -0400, Marek Olšák wrote:

...
On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter daniel@ffwll.ch

wrote:

...
...
...
...
On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:

...
On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák maraeo@gmail.com

wrote:

...
...
...
...
...
> Yes, we can't break anything because we don't want to

complicate

...
...
things

...
...
...
> for us. It's pretty much all NAK'd already. We are trying to

gather

...
...
...
...
more

...
> knowledge and then make better decisions. > > The idea we are considering is that we'll expose memory-based

sync

...
...
...
...
objects

...
> to userspace for read only, and the kernel or hw will strictly

control

...
...
the

...
> memory writes to those sync objects. The hole in that idea is

that

...
...
...
...
...
> userspace can decide not to signal a job, so even if userspace

can't

...
...
...
> overwrite memory-based sync object states arbitrarily, it can

still

...
...
...
...
decide

...
> not to signal them, and then a future fence is born. >

This would actually be treated as a GPU hang caused by that

context,

...
...
so

...
...
it

...
should be fine.

This is practically what I proposed already, except your not doing

it

...
...
with

...
...
dma_fence. And on the memory fence side this also doesn't actually

give

...
...
...
...
what you want for that compute model.

This seems like a bit a worst of both worlds approach to me? Tons

of

...
...
work

...
...
in the kernel to hide these not-dma_fence-but-almost, and still

pain to

...
...
...
...
actually drive the hardware like it should be for compute or direct display.

Also maybe I've missed it, but I didn't see any replies to my

suggestion

...
...
how to fake the entire dma_fence stuff on top of new hw. Would be interesting to know what doesn't work there instead of amd folks

going

...
...
of

...
...
into internal again and then coming back with another rfc from out

of

...
...
...
...
nowhere :-)

Going internal again is probably a good idea to spare you the long discussions and not waste your time, but we haven't talked about the dma_fence stuff internally other than acknowledging that it can be

solved.

...
The compute use case already uses the hw as-is with no inter-process sharing, which mostly keeps the kernel out of the picture. It uses

glFinish

...
to sync with GL.

The gfx use case needs new hardware logic to support implicit and

explicit

...
sync. When we propose a solution, it's usually torn apart the next

day by

...
...
...
ourselves.

Since we are talking about next hw or next next hw, preemption

should be

...
...
...
better.

user queue = user-mapped ring buffer

For implicit sync, we will only let userspace lock access to a buffer

via a

...
user queue, which waits for the per-buffer sequence counter in

memory to

...
...
be

...
...
= the number assigned by the kernel, and later unlock the access

with

...
...
...
another command, which increments the per-buffer sequence counter in

memory

...
with atomic_inc regardless of the number assigned by the kernel. The

kernel

...
counter and the counter in memory can be out-of-sync, and I'll

explain

...
...
why

...
it's OK. If a process increments the kernel counter but not the

memory

...
...
...
counter, that's its problem and it's the same as a GPU hang caused by

that

...
process. If a process increments the memory counter but not the

kernel

...
...
...
counter, the ">=" condition alongside atomic_inc guarantee that

signaling n

...
will signal n+1, so it will never deadlock but also it will

effectively

...
...
...
disable synchronization. This method of disabling synchronization is similar to a process corrupting the buffer, which should be fine.

Can you

...
...
...
find any flaw in it? I can't find any.

Hm maybe I misunderstood what exactly you wanted to do earlier. That

kind

...
...
of "we let userspace free-wheel whatever it wants, kernel ensures correctness of the resulting chain of dma_fence with reset the entire context" is what I proposed too.

Like you say, userspace is allowed to render garbage already.

...
The explicit submit can be done by userspace (if there is no synchronization), but we plan to use the kernel to do it for implicit

sync.

...
Essentially, the kernel will receive a buffer list and addresses of

wait

...
...
...
commands in the user queue. It will assign new sequence numbers to

all

...
...
...
buffers and write those numbers into the wait commands, and ring the

hw

...
...
...
doorbell to start execution of that queue.

Yeah for implicit sync I think kernel and using drm/scheduler to sort

out

...
...
the dma_fence dependencies is probably best. Since you can filter out which dma_fence you hand to the scheduler for dependency tracking you

can

...
...
filter out your own ones and let the hw handle those directly

(depending

...
...
how much your hw can do an all that). On i915 we might do that to be

able

...
...
to use MI_SEMAPHORE_WAIT/SIGNAL functionality in the hw and fw

scheduler.

...
...
For buffer tracking with implicit sync I think cleanest is probably to still keep them wrapped as dma_fence and stuffed into dma_resv, but conceptually it's the same. If we let every driver reinvent their own buffer tracking just because the hw works a bit different it'll be a

mess.

...
...
Wrt wait commands: I'm honestly not sure why you'd do that. Userspace

gets

...
...
to keep the pieces if it gets it wrong. You do still need to handle external dma_fence though, hence drm/scheduler frontend to sort these

out.

...
...
The reason is to disallow lower-privileged process to deadlock/hang a higher-privileged process where the kernel can't tell who did it. If the implicit-sync sequence counter is read only to userspace and only incrementable by the unlock-signal command after the lock-wait command appeared in the same queue (both together forming a critical section), userspace can't manipulate it arbitrarily and we get almost the exact

same

...
behavior as implicit sync has today. That means any implicitly-sync'd buffer from any process can be fully trusted by a compositor to signal

in a

...
finite time, and possibly even trusted by the kernel. The only thing

that's

...
different is that a malicious process can disable implicit sync for a buffer in all processes/kernel, but it can't hang other processes/kernel (it can only hang itself and the kernel will be notified). So I'm a happy panda now. :)

Yeah I think that's not going to work too well, and is too many piled up hacks. Within a drm_file fd you can do whatever you feel like, since it's just one client.

But once implicit sync kicks in I think you need to go with dma_fence and drm/scheduler to handle the dependencies, and tdr kicking it. With the dma_fence you do know who's the offender - you might not know why, but that doesn't matter, you just shred the entire context and let that userspace figure out the details.

I think trying to make memory fences work as implicit sync directly, without wrapping them in a dma_fence and assorted guarantees, will just not work.

And once you do wrap them in dma_fence, then all the other problems go away: cross-driver sync, syncfiles, ... So I really don't see the benefit of this half-way approach.

Yes there's going to be a tad bit of overhead, but that's already there in the current model. And it can't hurt to have a bit of motivation for compositors to switch over to userspace memory fences properly.

Well, Christian thinks that we need a high level synchronization primitive in hw. I don't know myself and you may be right. A software scheduler with user queues might be one option. My part is only to find out how much of the scheduler logic can be moved to the hardware.

We plan to have memory timeline semaphores, or simply monotonic counters, and a fence will be represented by the counter address and a constant sequence number for the <= comparison. One counter can represent up to 2^64 different fences. Giving any process write access to a fence is the same as giving it the power to manipulate the signalled state of a sequence of up to 2^64 fences. That could mess up a lot of things. However, if the hardware had a high level synchronization primitive with access rights and a limited set of clearly defined operations such that we can formally prove whether it's safe for everybody, we could have a solution where we don't have to involve the software scheduler and just let the hardware do everything.

Marek

-Daniel

...

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Daniel Vetter

11:22 a.m.

New subject: [Mesa-dev] Linux Graphics Next: Userspace submission update

On Thu, Jun 3, 2021 at 12:55 PM Marek Olšák maraeo@gmail.com wrote:

...

On Thu., Jun. 3, 2021, 06:03 Daniel Vetter, daniel@ffwll.ch wrote:

...
On Thu, Jun 03, 2021 at 04:20:18AM -0400, Marek Olšák wrote:

...
On Thu, Jun 3, 2021 at 3:47 AM Daniel Vetter daniel@ffwll.ch wrote:

...
On Wed, Jun 02, 2021 at 11:16:39PM -0400, Marek Olšák wrote:

...
On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter daniel@ffwll.ch wrote:

...
On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote: > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák maraeo@gmail.com wrote: > > > Yes, we can't break anything because we don't want to complicate

things

...
...
> > for us. It's pretty much all NAK'd already. We are trying to gather more > > knowledge and then make better decisions. > > > > The idea we are considering is that we'll expose memory-based sync objects > > to userspace for read only, and the kernel or hw will strictly

control

...
...
the > > memory writes to those sync objects. The hole in that idea is that > > userspace can decide not to signal a job, so even if userspace

can't

...
...
> > overwrite memory-based sync object states arbitrarily, it can still decide > > not to signal them, and then a future fence is born. > > > > This would actually be treated as a GPU hang caused by that context,

so

...
...
it > should be fine.

This is practically what I proposed already, except your not doing it

with

...
...
dma_fence. And on the memory fence side this also doesn't actually give what you want for that compute model.

This seems like a bit a worst of both worlds approach to me? Tons of

work

...
...
in the kernel to hide these not-dma_fence-but-almost, and still pain to actually drive the hardware like it should be for compute or direct display.

Also maybe I've missed it, but I didn't see any replies to my

suggestion

...
...
how to fake the entire dma_fence stuff on top of new hw. Would be interesting to know what doesn't work there instead of amd folks going

of

...
...
into internal again and then coming back with another rfc from out of nowhere :-)

Going internal again is probably a good idea to spare you the long discussions and not waste your time, but we haven't talked about the dma_fence stuff internally other than acknowledging that it can be

solved.

...
The compute use case already uses the hw as-is with no inter-process sharing, which mostly keeps the kernel out of the picture. It uses

glFinish

...
to sync with GL.

The gfx use case needs new hardware logic to support implicit and

explicit

...
sync. When we propose a solution, it's usually torn apart the next day by ourselves.

Since we are talking about next hw or next next hw, preemption should be better.

user queue = user-mapped ring buffer

For implicit sync, we will only let userspace lock access to a buffer

via a

...
user queue, which waits for the per-buffer sequence counter in memory to

be

...
...
= the number assigned by the kernel, and later unlock the access with

another command, which increments the per-buffer sequence counter in

memory

...
with atomic_inc regardless of the number assigned by the kernel. The

kernel

...
counter and the counter in memory can be out-of-sync, and I'll explain

why

...
it's OK. If a process increments the kernel counter but not the memory counter, that's its problem and it's the same as a GPU hang caused by

that

...
process. If a process increments the memory counter but not the kernel counter, the ">=" condition alongside atomic_inc guarantee that

signaling n

...
will signal n+1, so it will never deadlock but also it will effectively disable synchronization. This method of disabling synchronization is similar to a process corrupting the buffer, which should be fine. Can you find any flaw in it? I can't find any.

Hm maybe I misunderstood what exactly you wanted to do earlier. That kind of "we let userspace free-wheel whatever it wants, kernel ensures correctness of the resulting chain of dma_fence with reset the entire context" is what I proposed too.

Like you say, userspace is allowed to render garbage already.

...
The explicit submit can be done by userspace (if there is no synchronization), but we plan to use the kernel to do it for implicit

sync.

...
Essentially, the kernel will receive a buffer list and addresses of wait commands in the user queue. It will assign new sequence numbers to all buffers and write those numbers into the wait commands, and ring the hw doorbell to start execution of that queue.

Yeah for implicit sync I think kernel and using drm/scheduler to sort out the dma_fence dependencies is probably best. Since you can filter out which dma_fence you hand to the scheduler for dependency tracking you can filter out your own ones and let the hw handle those directly (depending how much your hw can do an all that). On i915 we might do that to be able to use MI_SEMAPHORE_WAIT/SIGNAL functionality in the hw and fw scheduler.

For buffer tracking with implicit sync I think cleanest is probably to still keep them wrapped as dma_fence and stuffed into dma_resv, but conceptually it's the same. If we let every driver reinvent their own buffer tracking just because the hw works a bit different it'll be a mess.

Wrt wait commands: I'm honestly not sure why you'd do that. Userspace gets to keep the pieces if it gets it wrong. You do still need to handle external dma_fence though, hence drm/scheduler frontend to sort these out.

The reason is to disallow lower-privileged process to deadlock/hang a higher-privileged process where the kernel can't tell who did it. If the implicit-sync sequence counter is read only to userspace and only incrementable by the unlock-signal command after the lock-wait command appeared in the same queue (both together forming a critical section), userspace can't manipulate it arbitrarily and we get almost the exact same behavior as implicit sync has today. That means any implicitly-sync'd buffer from any process can be fully trusted by a compositor to signal in a finite time, and possibly even trusted by the kernel. The only thing that's different is that a malicious process can disable implicit sync for a buffer in all processes/kernel, but it can't hang other processes/kernel (it can only hang itself and the kernel will be notified). So I'm a happy panda now. :)

Yeah I think that's not going to work too well, and is too many piled up hacks. Within a drm_file fd you can do whatever you feel like, since it's just one client.

But once implicit sync kicks in I think you need to go with dma_fence and drm/scheduler to handle the dependencies, and tdr kicking it. With the dma_fence you do know who's the offender - you might not know why, but that doesn't matter, you just shred the entire context and let that userspace figure out the details.

I think trying to make memory fences work as implicit sync directly, without wrapping them in a dma_fence and assorted guarantees, will just not work.

And once you do wrap them in dma_fence, then all the other problems go away: cross-driver sync, syncfiles, ... So I really don't see the benefit of this half-way approach.

Yes there's going to be a tad bit of overhead, but that's already there in the current model. And it can't hurt to have a bit of motivation for compositors to switch over to userspace memory fences properly.

Well, Christian thinks that we need a high level synchronization primitive in hw. I don't know myself and you may be right. A software scheduler with user queues might be one option. My part is only to find out how much of the scheduler logic can be moved to the hardware.

We plan to have memory timeline semaphores, or simply monotonic counters, and a fence will be represented by the counter address and a constant sequence number for the <= comparison. One counter can represent up to 2^64 different fences. Giving any process write access to a fence is the same as giving it the power to manipulate the signalled state of a sequence of up to 2^64 fences. That could mess up a lot of things. However, if the hardware had a high level synchronization primitive with access rights and a limited set of clearly defined operations such that we can formally prove whether it's safe for everybody, we could have a solution where we don't have to involve the software scheduler and just let the hardware do everything.

I don't think hw access rights control on memory fences makes sense. There's two cases:

- brave new world of native userspace memory fences. Currently that's compute, maybe direct display vk, hopefully/eventually compositors and desktops too. If you get an untrusted fence, you need to have fallback logic no matter what, and by design. vk is explicit in stating that if things hang, you get to keep all the pieces. So the compositor needs to _always_ treat userspace memory fences as hostile, wrap them in a timeout, and have a fallback frame/scene in its rendering path. Probably same for the kernel on display side, maybe down to the display hw picking the "right" frame depending upon the fence value right before scanout as we discussed earlier. There's no point in hw access rights because by design, even if no one tampers with your fence, it might never signal. So you have to cope with a hostile fence from untrusted sources anyway (and within an app it's trusted and you just die as in stuck in an endless loop until someon sends a SIGKILL when you deadlock or get it wrong some other way).

- old compat mode where we need to use dma_fence, otherwise we end up with another round of "amdgpu redefines implicit sync in incompatible ways", and Christian&me don't even know yet how to fix the current round without breaking use-cases badly yet. So it has to be dma_fence, and it has to be the same rules as on old hw, or it's just not going to work. This means you need to force in-order retiring of fences in the kernel, and you need to enforce timeout. None of this needs hw access rights control, since once more it's just software constructs in the kernel. As a first appromixation, drm/scheduler + the fence chain we already have in syncobj is probably good enough for this. E.g. if userspace rolls the fence backwards then the kernel just ignores that, because its internal dma_fence has signalled, and dma_fences never unsignal (it's a bit in struct dma_fence, once it's set we stop checking hw). And if it doesn't signal in time, then tdr jumps in and fixes the mess. Hw access control doesn't fix anything here, because you have to deal with timeout and ordering already anyway, or the dma_fence contract is broken.

So in both cases hw access control gains you nothing (at least I'm not seeing anything), it just makes the design more tricky. "Userspace can manipulate the fences" is _intentionally_ how these things work, we need a design that works with that hw design, not against it and somehow tries to get us back to the old world, but only halfway (i.e. not useful at all, since old userspace needs us to go all the way back to dma_fence, and new userspace wants to fully exploit userspace memory fences without artificial limitations for no reason). -Daniel

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Marek Olšák

5:52 p.m.

New subject: [Mesa-dev] Linux Graphics Next: Userspace submission update

Daniel, I think what you are suggesting is that we need to enable user queues with the drm scheduler and dma_fence first, and once that works, we can investigate how much of that kernel logic can be moved to the hw. Would that work? In theory it shouldn't matter whether the kernel does it or the hw does it. It's the same code, just in a different place.

Thanks, Marek

On Thu, Jun 3, 2021 at 7:22 AM Daniel Vetter daniel@ffwll.ch wrote:

...

On Thu, Jun 3, 2021 at 12:55 PM Marek Olšák maraeo@gmail.com wrote:

...
On Thu., Jun. 3, 2021, 06:03 Daniel Vetter, daniel@ffwll.ch wrote:

...
On Thu, Jun 03, 2021 at 04:20:18AM -0400, Marek Olšák wrote:

...
On Thu, Jun 3, 2021 at 3:47 AM Daniel Vetter daniel@ffwll.ch wrote:

...
On Wed, Jun 02, 2021 at 11:16:39PM -0400, Marek Olšák wrote:

...
On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter daniel@ffwll.ch

wrote:

...
...
...
...
...
> On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote: > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák maraeo@gmail.com

wrote:

...
...
...
...
...
> > > > > Yes, we can't break anything because we don't want to

complicate

...
...
...
...
things

...
> > > for us. It's pretty much all NAK'd already. We are trying

to gather

...
...
...
...
...
> more > > > knowledge and then make better decisions. > > > > > > The idea we are considering is that we'll expose

memory-based sync

...
...
...
...
...
> objects > > > to userspace for read only, and the kernel or hw will

strictly

...
...
...
...
control

...
> the > > > memory writes to those sync objects. The hole in that idea

is that

...
...
...
...
...
> > > userspace can decide not to signal a job, so even if

userspace

...
...
...
...
can't

...
> > > overwrite memory-based sync object states arbitrarily, it

can still

...
...
...
...
...
> decide > > > not to signal them, and then a future fence is born. > > > > > > > This would actually be treated as a GPU hang caused by that

context,

...
...
...
...
so

...
> it > > should be fine. > > This is practically what I proposed already, except your not

doing it

...
...
...
...
with

...
> dma_fence. And on the memory fence side this also doesn't

actually give

...
...
...
...
...
> what you want for that compute model. > > This seems like a bit a worst of both worlds approach to me?

Tons of

...
...
...
...
work

...
> in the kernel to hide these not-dma_fence-but-almost, and still

pain to

...
...
...
...
...
> actually drive the hardware like it should be for compute or

direct

...
...
...
...
...
> display. > > Also maybe I've missed it, but I didn't see any replies to my

suggestion

...
> how to fake the entire dma_fence stuff on top of new hw. Would

be

...
...
...
...
...
> interesting to know what doesn't work there instead of amd

folks going

...
...
...
...
of

...
> into internal again and then coming back with another rfc from

out of

...
...
...
...
...
> nowhere :-) >

Going internal again is probably a good idea to spare you the long discussions and not waste your time, but we haven't talked about

the

...
...
...
...
...
dma_fence stuff internally other than acknowledging that it can be

solved.

...
The compute use case already uses the hw as-is with no

inter-process

...
...
...
...
...
sharing, which mostly keeps the kernel out of the picture. It uses

glFinish

...
to sync with GL.

The gfx use case needs new hardware logic to support implicit and

explicit

...
sync. When we propose a solution, it's usually torn apart the

next day by

...
...
...
...
...
ourselves.

Since we are talking about next hw or next next hw, preemption

should be

...
...
...
...
...
better.

user queue = user-mapped ring buffer

For implicit sync, we will only let userspace lock access to a

buffer

...
...
...
...
via a

...
user queue, which waits for the per-buffer sequence counter in

memory to

...
...
...
...
be

...
>= the number assigned by the kernel, and later unlock the access

with

...
...
...
...
...
another command, which increments the per-buffer sequence counter

in

...
...
...
...
memory

...
with atomic_inc regardless of the number assigned by the kernel.

The

...
...
...
...
kernel

...
counter and the counter in memory can be out-of-sync, and I'll

explain

...
...
...
...
why

...
it's OK. If a process increments the kernel counter but not the

memory

...
...
...
...
...
counter, that's its problem and it's the same as a GPU hang

caused by

...
...
...
...
that

...
process. If a process increments the memory counter but not the

kernel

...
...
...
...
...
counter, the ">=" condition alongside atomic_inc guarantee that

signaling n

...
will signal n+1, so it will never deadlock but also it will

effectively

...
...
...
...
...
disable synchronization. This method of disabling synchronization

is

...
...
...
...
...
similar to a process corrupting the buffer, which should be fine.

Can you

...
...
...
...
...
find any flaw in it? I can't find any.

Hm maybe I misunderstood what exactly you wanted to do earlier.

That kind

...
...
...
...
of "we let userspace free-wheel whatever it wants, kernel ensures correctness of the resulting chain of dma_fence with reset the

entire

...
...
...
...
context" is what I proposed too.

Like you say, userspace is allowed to render garbage already.

...
The explicit submit can be done by userspace (if there is no synchronization), but we plan to use the kernel to do it for

implicit

...
...
...
...
sync.

...
Essentially, the kernel will receive a buffer list and addresses

of wait

...
...
...
...
...
commands in the user queue. It will assign new sequence numbers

to all

...
...
...
...
...
buffers and write those numbers into the wait commands, and ring

the hw

...
...
...
...
...
doorbell to start execution of that queue.

Yeah for implicit sync I think kernel and using drm/scheduler to

sort out

...
...
...
...
the dma_fence dependencies is probably best. Since you can filter

out

...
...
...
...
which dma_fence you hand to the scheduler for dependency tracking

you can

...
...
...
...
filter out your own ones and let the hw handle those directly

(depending

...
...
...
...
how much your hw can do an all that). On i915 we might do that to

be able

...
...
...
...
to use MI_SEMAPHORE_WAIT/SIGNAL functionality in the hw and fw

scheduler.

...
...
...
...
For buffer tracking with implicit sync I think cleanest is probably

to

...
...
...
...
still keep them wrapped as dma_fence and stuffed into dma_resv, but conceptually it's the same. If we let every driver reinvent their

own

...
...
...
...
buffer tracking just because the hw works a bit different it'll be

a mess.

...
...
...
...
Wrt wait commands: I'm honestly not sure why you'd do that.

Userspace gets

...
...
...
...
to keep the pieces if it gets it wrong. You do still need to handle external dma_fence though, hence drm/scheduler frontend to sort

these out.

...
...
...
...
The reason is to disallow lower-privileged process to deadlock/hang a higher-privileged process where the kernel can't tell who did it. If

the

...
...
...
implicit-sync sequence counter is read only to userspace and only incrementable by the unlock-signal command after the lock-wait command appeared in the same queue (both together forming a critical section), userspace can't manipulate it arbitrarily and we get almost the exact

same

...
...
...
behavior as implicit sync has today. That means any implicitly-sync'd buffer from any process can be fully trusted by a compositor to

signal in a

...
...
...
finite time, and possibly even trusted by the kernel. The only thing

that's

...
...
...
different is that a malicious process can disable implicit sync for a buffer in all processes/kernel, but it can't hang other

processes/kernel

...
...
...
(it can only hang itself and the kernel will be notified). So I'm a

happy

...
...
...
panda now. :)

Yeah I think that's not going to work too well, and is too many piled up hacks. Within a drm_file fd you can do whatever you feel like, since

it's

...
...
just one client.

But once implicit sync kicks in I think you need to go with dma_fence

and

...
...
drm/scheduler to handle the dependencies, and tdr kicking it. With the dma_fence you do know who's the offender - you might not know why, but that doesn't matter, you just shred the entire context and let that userspace figure out the details.

I think trying to make memory fences work as implicit sync directly, without wrapping them in a dma_fence and assorted guarantees, will just not work.

And once you do wrap them in dma_fence, then all the other problems go away: cross-driver sync, syncfiles, ... So I really don't see the

benefit

...
...
of this half-way approach.

Yes there's going to be a tad bit of overhead, but that's already there

in

...
...
the current model. And it can't hurt to have a bit of motivation for compositors to switch over to userspace memory fences properly.

Well, Christian thinks that we need a high level synchronization

primitive in hw. I don't know myself and you may be right. A software scheduler with user queues might be one option. My part is only to find out how much of the scheduler logic can be moved to the hardware.

...
We plan to have memory timeline semaphores, or simply monotonic

counters, and a fence will be represented by the counter address and a constant sequence number for the <= comparison. One counter can represent up to 2^64 different fences. Giving any process write access to a fence is the same as giving it the power to manipulate the signalled state of a sequence of up to 2^64 fences. That could mess up a lot of things. However, if the hardware had a high level synchronization primitive with access rights and a limited set of clearly defined operations such that we can formally prove whether it's safe for everybody, we could have a solution where we don't have to involve the software scheduler and just let the hardware do everything.

I don't think hw access rights control on memory fences makes sense. There's two cases:

brave new world of native userspace memory fences. Currently that's

compute, maybe direct display vk, hopefully/eventually compositors and desktops too. If you get an untrusted fence, you need to have fallback logic no matter what, and by design. vk is explicit in stating that if things hang, you get to keep all the pieces. So the compositor needs to _always_ treat userspace memory fences as hostile, wrap them in a timeout, and have a fallback frame/scene in its rendering path. Probably same for the kernel on display side, maybe down to the display hw picking the "right" frame depending upon the fence value right before scanout as we discussed earlier. There's no point in hw access rights because by design, even if no one tampers with your fence, it might never signal. So you have to cope with a hostile fence from untrusted sources anyway (and within an app it's trusted and you just die as in stuck in an endless loop until someon sends a SIGKILL when you deadlock or get it wrong some other way).

old compat mode where we need to use dma_fence, otherwise we end up

with another round of "amdgpu redefines implicit sync in incompatible ways", and Christian&me don't even know yet how to fix the current round without breaking use-cases badly yet. So it has to be dma_fence, and it has to be the same rules as on old hw, or it's just not going to work. This means you need to force in-order retiring of fences in the kernel, and you need to enforce timeout. None of this needs hw access rights control, since once more it's just software constructs in the kernel. As a first appromixation, drm/scheduler + the fence chain we already have in syncobj is probably good enough for this. E.g. if userspace rolls the fence backwards then the kernel just ignores that, because its internal dma_fence has signalled, and dma_fences never unsignal (it's a bit in struct dma_fence, once it's set we stop checking hw). And if it doesn't signal in time, then tdr jumps in and fixes the mess. Hw access control doesn't fix anything here, because you have to deal with timeout and ordering already anyway, or the dma_fence contract is broken.

So in both cases hw access control gains you nothing (at least I'm not seeing anything), it just makes the design more tricky. "Userspace can manipulate the fences" is _intentionally_ how these things work, we need a design that works with that hw design, not against it and somehow tries to get us back to the old world, but only halfway (i.e. not useful at all, since old userspace needs us to go all the way back to dma_fence, and new userspace wants to fully exploit userspace memory fences without artificial limitations for no reason).

-Daniel

Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Daniel Vetter

7:18 p.m.

New subject: [Mesa-dev] Linux Graphics Next: Userspace submission update

On Thu, Jun 3, 2021 at 7:53 PM Marek Olšák maraeo@gmail.com wrote:

...

Daniel, I think what you are suggesting is that we need to enable user queues with the drm scheduler and dma_fence first, and once that works, we can investigate how much of that kernel logic can be moved to the hw. Would that work? In theory it shouldn't matter whether the kernel does it or the hw does it. It's the same code, just in a different place.

Yeah I guess that's another way to look at it. Maybe in practice you'll just move it from the kernel to userspace, which then programs the hw waits directly into its IB. That's at least how I'd do it on i915, assuming I'd have such hw. So these fences that userspace programs directly (to sync within itself) won't even show up as dependencies in the kernel.

And then yes on the other side you can lift work from the drm/scheduler wrt dependencies you get in the kernel (whether explicit sync with sync_file, or implicit sync fished out of dma_resv) and program the hw directly that way. That would mean that userspace wont fill the ringbuffer directly, but the kernel would do that, so that you have space to stuff in the additional waits. Again assuming i915 hw model, maybe works differently on amd. Iirc we have some of that already in the i915 scheduler, but I'd need to recheck how much it really uses the hw semaphores. -Daniel

...

Thanks, Marek

On Thu, Jun 3, 2021 at 7:22 AM Daniel Vetter daniel@ffwll.ch wrote:

...
On Thu, Jun 3, 2021 at 12:55 PM Marek Olšák maraeo@gmail.com wrote:

...
On Thu., Jun. 3, 2021, 06:03 Daniel Vetter, daniel@ffwll.ch wrote:

...
On Thu, Jun 03, 2021 at 04:20:18AM -0400, Marek Olšák wrote:

...
On Thu, Jun 3, 2021 at 3:47 AM Daniel Vetter daniel@ffwll.ch wrote:

...
On Wed, Jun 02, 2021 at 11:16:39PM -0400, Marek Olšák wrote: > On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter daniel@ffwll.ch wrote: > > > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote: > > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák maraeo@gmail.com wrote: > > > > > > > Yes, we can't break anything because we don't want to complicate things > > > > for us. It's pretty much all NAK'd already. We are trying to gather > > more > > > > knowledge and then make better decisions. > > > > > > > > The idea we are considering is that we'll expose memory-based sync > > objects > > > > to userspace for read only, and the kernel or hw will strictly control > > the > > > > memory writes to those sync objects. The hole in that idea is that > > > > userspace can decide not to signal a job, so even if userspace can't > > > > overwrite memory-based sync object states arbitrarily, it can still > > decide > > > > not to signal them, and then a future fence is born. > > > > > > > > > > This would actually be treated as a GPU hang caused by that context, so > > it > > > should be fine. > > > > This is practically what I proposed already, except your not doing it with > > dma_fence. And on the memory fence side this also doesn't actually give > > what you want for that compute model. > > > > This seems like a bit a worst of both worlds approach to me? Tons of work > > in the kernel to hide these not-dma_fence-but-almost, and still pain to > > actually drive the hardware like it should be for compute or direct > > display. > > > > Also maybe I've missed it, but I didn't see any replies to my suggestion > > how to fake the entire dma_fence stuff on top of new hw. Would be > > interesting to know what doesn't work there instead of amd folks going of > > into internal again and then coming back with another rfc from out of > > nowhere :-) > > > > Going internal again is probably a good idea to spare you the long > discussions and not waste your time, but we haven't talked about the > dma_fence stuff internally other than acknowledging that it can be solved. > > The compute use case already uses the hw as-is with no inter-process > sharing, which mostly keeps the kernel out of the picture. It uses glFinish > to sync with GL. > > The gfx use case needs new hardware logic to support implicit and explicit > sync. When we propose a solution, it's usually torn apart the next day by > ourselves. > > Since we are talking about next hw or next next hw, preemption should be > better. > > user queue = user-mapped ring buffer > > For implicit sync, we will only let userspace lock access to a buffer via a > user queue, which waits for the per-buffer sequence counter in memory to be > >= the number assigned by the kernel, and later unlock the access with > another command, which increments the per-buffer sequence counter in memory > with atomic_inc regardless of the number assigned by the kernel. The kernel > counter and the counter in memory can be out-of-sync, and I'll explain why > it's OK. If a process increments the kernel counter but not the memory > counter, that's its problem and it's the same as a GPU hang caused by that > process. If a process increments the memory counter but not the kernel > counter, the ">=" condition alongside atomic_inc guarantee that signaling n > will signal n+1, so it will never deadlock but also it will effectively > disable synchronization. This method of disabling synchronization is > similar to a process corrupting the buffer, which should be fine. Can you > find any flaw in it? I can't find any.

Hm maybe I misunderstood what exactly you wanted to do earlier. That kind of "we let userspace free-wheel whatever it wants, kernel ensures correctness of the resulting chain of dma_fence with reset the entire context" is what I proposed too.

Like you say, userspace is allowed to render garbage already.

> The explicit submit can be done by userspace (if there is no > synchronization), but we plan to use the kernel to do it for implicit sync. > Essentially, the kernel will receive a buffer list and addresses of wait > commands in the user queue. It will assign new sequence numbers to all > buffers and write those numbers into the wait commands, and ring the hw > doorbell to start execution of that queue.

Yeah for implicit sync I think kernel and using drm/scheduler to sort out the dma_fence dependencies is probably best. Since you can filter out which dma_fence you hand to the scheduler for dependency tracking you can filter out your own ones and let the hw handle those directly (depending how much your hw can do an all that). On i915 we might do that to be able to use MI_SEMAPHORE_WAIT/SIGNAL functionality in the hw and fw scheduler.

For buffer tracking with implicit sync I think cleanest is probably to still keep them wrapped as dma_fence and stuffed into dma_resv, but conceptually it's the same. If we let every driver reinvent their own buffer tracking just because the hw works a bit different it'll be a mess.

Wrt wait commands: I'm honestly not sure why you'd do that. Userspace gets to keep the pieces if it gets it wrong. You do still need to handle external dma_fence though, hence drm/scheduler frontend to sort these out.

The reason is to disallow lower-privileged process to deadlock/hang a higher-privileged process where the kernel can't tell who did it. If the implicit-sync sequence counter is read only to userspace and only incrementable by the unlock-signal command after the lock-wait command appeared in the same queue (both together forming a critical section), userspace can't manipulate it arbitrarily and we get almost the exact same behavior as implicit sync has today. That means any implicitly-sync'd buffer from any process can be fully trusted by a compositor to signal in a finite time, and possibly even trusted by the kernel. The only thing that's different is that a malicious process can disable implicit sync for a buffer in all processes/kernel, but it can't hang other processes/kernel (it can only hang itself and the kernel will be notified). So I'm a happy panda now. :)

Yeah I think that's not going to work too well, and is too many piled up hacks. Within a drm_file fd you can do whatever you feel like, since it's just one client.

But once implicit sync kicks in I think you need to go with dma_fence and drm/scheduler to handle the dependencies, and tdr kicking it. With the dma_fence you do know who's the offender - you might not know why, but that doesn't matter, you just shred the entire context and let that userspace figure out the details.

I think trying to make memory fences work as implicit sync directly, without wrapping them in a dma_fence and assorted guarantees, will just not work.

And once you do wrap them in dma_fence, then all the other problems go away: cross-driver sync, syncfiles, ... So I really don't see the benefit of this half-way approach.

Yes there's going to be a tad bit of overhead, but that's already there in the current model. And it can't hurt to have a bit of motivation for compositors to switch over to userspace memory fences properly.

Well, Christian thinks that we need a high level synchronization primitive in hw. I don't know myself and you may be right. A software scheduler with user queues might be one option. My part is only to find out how much of the scheduler logic can be moved to the hardware.

We plan to have memory timeline semaphores, or simply monotonic counters, and a fence will be represented by the counter address and a constant sequence number for the <= comparison. One counter can represent up to 2^64 different fences. Giving any process write access to a fence is the same as giving it the power to manipulate the signalled state of a sequence of up to 2^64 fences. That could mess up a lot of things. However, if the hardware had a high level synchronization primitive with access rights and a limited set of clearly defined operations such that we can formally prove whether it's safe for everybody, we could have a solution where we don't have to involve the software scheduler and just let the hardware do everything.

I don't think hw access rights control on memory fences makes sense. There's two cases:

brave new world of native userspace memory fences. Currently that's

compute, maybe direct display vk, hopefully/eventually compositors and desktops too. If you get an untrusted fence, you need to have fallback logic no matter what, and by design. vk is explicit in stating that if things hang, you get to keep all the pieces. So the compositor needs to _always_ treat userspace memory fences as hostile, wrap them in a timeout, and have a fallback frame/scene in its rendering path. Probably same for the kernel on display side, maybe down to the display hw picking the "right" frame depending upon the fence value right before scanout as we discussed earlier. There's no point in hw access rights because by design, even if no one tampers with your fence, it might never signal. So you have to cope with a hostile fence from untrusted sources anyway (and within an app it's trusted and you just die as in stuck in an endless loop until someon sends a SIGKILL when you deadlock or get it wrong some other way).

old compat mode where we need to use dma_fence, otherwise we end up

with another round of "amdgpu redefines implicit sync in incompatible ways", and Christian&me don't even know yet how to fix the current round without breaking use-cases badly yet. So it has to be dma_fence, and it has to be the same rules as on old hw, or it's just not going to work. This means you need to force in-order retiring of fences in the kernel, and you need to enforce timeout. None of this needs hw access rights control, since once more it's just software constructs in the kernel. As a first appromixation, drm/scheduler + the fence chain we already have in syncobj is probably good enough for this. E.g. if userspace rolls the fence backwards then the kernel just ignores that, because its internal dma_fence has signalled, and dma_fences never unsignal (it's a bit in struct dma_fence, once it's set we stop checking hw). And if it doesn't signal in time, then tdr jumps in and fixes the mess. Hw access control doesn't fix anything here, because you have to deal with timeout and ordering already anyway, or the dma_fence contract is broken.

So in both cases hw access control gains you nothing (at least I'm not seeing anything), it just makes the design more tricky. "Userspace can manipulate the fences" is _intentionally_ how these things work, we need a design that works with that hw design, not against it and somehow tries to get us back to the old world, but only halfway (i.e. not useful at all, since old userspace needs us to go all the way back to dma_fence, and new userspace wants to fully exploit userspace memory fences without artificial limitations for no reason).

-Daniel

Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Marek Olšák

4 Jun 4 Jun

5:26 a.m.

New subject: [Mesa-dev] Linux Graphics Next: Userspace submission update

On Thu., Jun. 3, 2021, 15:18 Daniel Vetter, daniel@ffwll.ch wrote:

...

On Thu, Jun 3, 2021 at 7:53 PM Marek Olšák maraeo@gmail.com wrote:

...
Daniel, I think what you are suggesting is that we need to enable user

queues with the drm scheduler and dma_fence first, and once that works, we can investigate how much of that kernel logic can be moved to the hw. Would that work? In theory it shouldn't matter whether the kernel does it or the hw does it. It's the same code, just in a different place.

Yeah I guess that's another way to look at it. Maybe in practice you'll just move it from the kernel to userspace, which then programs the hw waits directly into its IB. That's at least how I'd do it on i915, assuming I'd have such hw. So these fences that userspace programs directly (to sync within itself) won't even show up as dependencies in the kernel.

And then yes on the other side you can lift work from the drm/scheduler wrt dependencies you get in the kernel (whether explicit sync with sync_file, or implicit sync fished out of dma_resv) and program the hw directly that way. That would mean that userspace wont fill the ringbuffer directly, but the kernel would do that, so that you have space to stuff in the additional waits. Again assuming i915 hw model, maybe works differently on amd. Iirc we have some of that already in the i915 scheduler, but I'd need to recheck how much it really uses the hw semaphores.

I was thinking we would pass per process syncobj handles and buffer handles into commands in the user queue, or something equivalent. We do have a large degree of programmability in the hw that we can do something like that. The question is whether this high level user->hw interface would have any advantage over trivial polling on memory, etc. My impression is no because the kernel would be robust enough that it wouldn't matter what userspace does, but I don't know. Anyway, all we need is user queues and what your proposed seems totally sufficient.

Marek

-Daniel

...

...
Thanks, Marek

On Thu, Jun 3, 2021 at 7:22 AM Daniel Vetter daniel@ffwll.ch wrote:

...
On Thu, Jun 3, 2021 at 12:55 PM Marek Olšák maraeo@gmail.com wrote:

...
On Thu., Jun. 3, 2021, 06:03 Daniel Vetter, daniel@ffwll.ch wrote:

...
On Thu, Jun 03, 2021 at 04:20:18AM -0400, Marek Olšák wrote:

...
On Thu, Jun 3, 2021 at 3:47 AM Daniel Vetter daniel@ffwll.ch

wrote:

...
...
...
...
...
> On Wed, Jun 02, 2021 at 11:16:39PM -0400, Marek Olšák wrote: > > On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter daniel@ffwll.ch

wrote:

...
...
...
...
...
> > > > > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote: > > > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák <

maraeo@gmail.com> wrote:

...
...
...
...
...
> > > > > > > > > Yes, we can't break anything because we don't want to

complicate

...
...
...
...
...
> things > > > > > for us. It's pretty much all NAK'd already. We are

trying to gather

...
...
...
...
...
> > > more > > > > > knowledge and then make better decisions. > > > > > > > > > > The idea we are considering is that we'll expose

memory-based sync

...
...
...
...
...
> > > objects > > > > > to userspace for read only, and the kernel or hw will

strictly

...
...
...
...
...
> control > > > the > > > > > memory writes to those sync objects. The hole in that

idea is that

...
...
...
...
...
> > > > > userspace can decide not to signal a job, so even if

userspace

...
...
...
...
...
> can't > > > > > overwrite memory-based sync object states arbitrarily,

it can still

...
...
...
...
...
> > > decide > > > > > not to signal them, and then a future fence is born. > > > > > > > > > > > > > This would actually be treated as a GPU hang caused by

that context,

...
...
...
...
...
> so > > > it > > > > should be fine. > > > > > > This is practically what I proposed already, except your not

doing it

...
...
...
...
...
> with > > > dma_fence. And on the memory fence side this also doesn't

actually give

...
...
...
...
...
> > > what you want for that compute model. > > > > > > This seems like a bit a worst of both worlds approach to me?

Tons of

...
...
...
...
...
> work > > > in the kernel to hide these not-dma_fence-but-almost, and

still pain to

...
...
...
...
...
> > > actually drive the hardware like it should be for compute or

direct

...
...
...
...
...
> > > display. > > > > > > Also maybe I've missed it, but I didn't see any replies to my > suggestion > > > how to fake the entire dma_fence stuff on top of new hw.

Would be

...
...
...
...
...
> > > interesting to know what doesn't work there instead of amd

folks going

...
...
...
...
...
> of > > > into internal again and then coming back with another rfc

from out of

...
...
...
...
...
> > > nowhere :-) > > > > > > > Going internal again is probably a good idea to spare you the

long

...
...
...
...
...
> > discussions and not waste your time, but we haven't talked

about the

...
...
...
...
...
> > dma_fence stuff internally other than acknowledging that it

can be

...
...
...
...
...
> solved. > > > > The compute use case already uses the hw as-is with no

inter-process

...
...
...
...
...
> > sharing, which mostly keeps the kernel out of the picture. It

uses

...
...
...
...
...
> glFinish > > to sync with GL. > > > > The gfx use case needs new hardware logic to support implicit

and

...
...
...
...
...
> explicit > > sync. When we propose a solution, it's usually torn apart the

next day by

...
...
...
...
...
> > ourselves. > > > > Since we are talking about next hw or next next hw, preemption

should be

...
...
...
...
...
> > better. > > > > user queue = user-mapped ring buffer > > > > For implicit sync, we will only let userspace lock access to a

buffer

...
...
...
...
...
> via a > > user queue, which waits for the per-buffer sequence counter in

memory to

...
...
...
...
...
> be > > >= the number assigned by the kernel, and later unlock the

access with

...
...
...
...
...
> > another command, which increments the per-buffer sequence

counter in

...
...
...
...
...
> memory > > with atomic_inc regardless of the number assigned by the

kernel. The

...
...
...
...
...
> kernel > > counter and the counter in memory can be out-of-sync, and I'll

explain

...
...
...
...
...
> why > > it's OK. If a process increments the kernel counter but not

the memory

...
...
...
...
...
> > counter, that's its problem and it's the same as a GPU hang

caused by

...
...
...
...
...
> that > > process. If a process increments the memory counter but not

the kernel

...
...
...
...
...
> > counter, the ">=" condition alongside atomic_inc guarantee that > signaling n > > will signal n+1, so it will never deadlock but also it will

effectively

...
...
...
...
...
> > disable synchronization. This method of disabling

synchronization is

...
...
...
...
...
> > similar to a process corrupting the buffer, which should be

fine. Can you

...
...
...
...
...
> > find any flaw in it? I can't find any. > > Hm maybe I misunderstood what exactly you wanted to do earlier.

That kind

...
...
...
...
...
> of "we let userspace free-wheel whatever it wants, kernel ensures > correctness of the resulting chain of dma_fence with reset the

entire

...
...
...
...
...
> context" is what I proposed too. > > Like you say, userspace is allowed to render garbage already. > > > The explicit submit can be done by userspace (if there is no > > synchronization), but we plan to use the kernel to do it for

implicit

...
...
...
...
...
> sync. > > Essentially, the kernel will receive a buffer list and

addresses of wait

...
...
...
...
...
> > commands in the user queue. It will assign new sequence

numbers to all

...
...
...
...
...
> > buffers and write those numbers into the wait commands, and

ring the hw

...
...
...
...
...
> > doorbell to start execution of that queue. > > Yeah for implicit sync I think kernel and using drm/scheduler to

sort out

...
...
...
...
...
> the dma_fence dependencies is probably best. Since you can

filter out

...
...
...
...
...
> which dma_fence you hand to the scheduler for dependency

tracking you can

...
...
...
...
...
> filter out your own ones and let the hw handle those directly

(depending

...
...
...
...
...
> how much your hw can do an all that). On i915 we might do that

to be able

...
...
...
...
...
> to use MI_SEMAPHORE_WAIT/SIGNAL functionality in the hw and fw

scheduler.

...
...
...
...
...
> > For buffer tracking with implicit sync I think cleanest is

probably to

...
...
...
...
...
> still keep them wrapped as dma_fence and stuffed into dma_resv,

but

...
...
...
...
...
> conceptually it's the same. If we let every driver reinvent

their own

...
...
...
...
...
> buffer tracking just because the hw works a bit different it'll

be a mess.

...
...
...
...
...
> > Wrt wait commands: I'm honestly not sure why you'd do that.

Userspace gets

...
...
...
...
...
> to keep the pieces if it gets it wrong. You do still need to

handle

...
...
...
...
...
> external dma_fence though, hence drm/scheduler frontend to sort

these out.

...
...
...
...
...
>

The reason is to disallow lower-privileged process to

deadlock/hang a

...
...
...
...
...
higher-privileged process where the kernel can't tell who did it.

If the

...
...
...
...
...
implicit-sync sequence counter is read only to userspace and only incrementable by the unlock-signal command after the lock-wait

command

...
...
...
...
...
appeared in the same queue (both together forming a critical

section),

...
...
...
...
...
userspace can't manipulate it arbitrarily and we get almost the

exact same

...
...
...
...
...
behavior as implicit sync has today. That means any

implicitly-sync'd

...
...
...
...
...
buffer from any process can be fully trusted by a compositor to

signal in a

...
...
...
...
...
finite time, and possibly even trusted by the kernel. The only

thing that's

...
...
...
...
...
different is that a malicious process can disable implicit sync

for a

...
...
...
...
...
buffer in all processes/kernel, but it can't hang other

processes/kernel

...
...
...
...
...
(it can only hang itself and the kernel will be notified). So I'm

a happy

...
...
...
...
...
panda now. :)

Yeah I think that's not going to work too well, and is too many

piled up

...
...
...
...
hacks. Within a drm_file fd you can do whatever you feel like, since

it's

...
...
...
...
just one client.

But once implicit sync kicks in I think you need to go with

dma_fence and

...
...
...
...
drm/scheduler to handle the dependencies, and tdr kicking it. With

the

...
...
...
...
dma_fence you do know who's the offender - you might not know why,

but

...
...
...
...
that doesn't matter, you just shred the entire context and let that userspace figure out the details.

I think trying to make memory fences work as implicit sync directly, without wrapping them in a dma_fence and assorted guarantees, will

just

...
...
...
...
not work.

And once you do wrap them in dma_fence, then all the other problems

go

...
...
...
...
away: cross-driver sync, syncfiles, ... So I really don't see the

benefit

...
...
...
...
of this half-way approach.

Yes there's going to be a tad bit of overhead, but that's already

there in

...
...
...
...
the current model. And it can't hurt to have a bit of motivation for compositors to switch over to userspace memory fences properly.

Well, Christian thinks that we need a high level synchronization

primitive in hw. I don't know myself and you may be right. A software scheduler with user queues might be one option. My part is only to find out how much of the scheduler logic can be moved to the hardware.

...
...
...
We plan to have memory timeline semaphores, or simply monotonic

counters, and a fence will be represented by the counter address and a constant sequence number for the <= comparison. One counter can represent up to 2^64 different fences. Giving any process write access to a fence is the same as giving it the power to manipulate the signalled state of a sequence of up to 2^64 fences. That could mess up a lot of things. However, if the hardware had a high level synchronization primitive with access rights and a limited set of clearly defined operations such that we can formally prove whether it's safe for everybody, we could have a solution where we don't have to involve the software scheduler and just let the hardware do everything.

...
...
I don't think hw access rights control on memory fences makes sense. There's two cases:

brave new world of native userspace memory fences. Currently that's

compute, maybe direct display vk, hopefully/eventually compositors and desktops too. If you get an untrusted fence, you need to have fallback logic no matter what, and by design. vk is explicit in stating that if things hang, you get to keep all the pieces. So the compositor needs to _always_ treat userspace memory fences as hostile, wrap them in a timeout, and have a fallback frame/scene in its rendering path. Probably same for the kernel on display side, maybe down to the display hw picking the "right" frame depending upon the fence value right before scanout as we discussed earlier. There's no point in hw access rights because by design, even if no one tampers with your fence, it might never signal. So you have to cope with a hostile fence from untrusted sources anyway (and within an app it's trusted and you just die as in stuck in an endless loop until someon sends a SIGKILL when you deadlock or get it wrong some other way).

old compat mode where we need to use dma_fence, otherwise we end up

with another round of "amdgpu redefines implicit sync in incompatible ways", and Christian&me don't even know yet how to fix the current round without breaking use-cases badly yet. So it has to be dma_fence, and it has to be the same rules as on old hw, or it's just not going to work. This means you need to force in-order retiring of fences in the kernel, and you need to enforce timeout. None of this needs hw access rights control, since once more it's just software constructs in the kernel. As a first appromixation, drm/scheduler + the fence chain we already have in syncobj is probably good enough for this. E.g. if userspace rolls the fence backwards then the kernel just ignores that, because its internal dma_fence has signalled, and dma_fences never unsignal (it's a bit in struct dma_fence, once it's set we stop checking hw). And if it doesn't signal in time, then tdr jumps in and fixes the mess. Hw access control doesn't fix anything here, because you have to deal with timeout and ordering already anyway, or the dma_fence contract is broken.

So in both cases hw access control gains you nothing (at least I'm not seeing anything), it just makes the design more tricky. "Userspace can manipulate the fences" is _intentionally_ how these things work, we need a design that works with that hw design, not against it and somehow tries to get us back to the old world, but only halfway (i.e. not useful at all, since old userspace needs us to go all the way back to dma_fence, and new userspace wants to fully exploit userspace memory fences without artificial limitations for no reason).

-Daniel

Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Christian König

2 Jun 2 Jun

9:44 a.m.

New subject: [Mesa-dev] Linux Graphics Next: Userspace submission update

Am 02.06.21 um 10:57 schrieb Daniel Stone:

...

Hi Christian,

On Tue, 1 Jun 2021 at 13:51, Christian König ckoenig.leichtzumerken@gmail.com wrote:

...
Am 01.06.21 um 14:30 schrieb Daniel Vetter:

...
If you want to enable this use-case with driver magic and without the compositor being aware of what's going on, the solution is EGLStreams. Not sure we want to go there, but it's definitely a lot more feasible than trying to stuff eglstreams semantics into dma-buf implicit fencing support in a desperate attempt to not change compositors.

Well not changing compositors is certainly not something I would try with this use case.

Not changing compositors is more like ok we have Ubuntu 20.04 and need to support that we the newest hardware generation.

Serious question, have you talked to Canonical?

I mean there's a hell of a lot of effort being expended here, but it seems to all be predicated on the assumption that Ubuntu's LTS HWE/backport policy is totally immutable, and that we might need to make the kernel do backflips to work around that. But ... is it? Has anyone actually asked them how they feel about this?

This was merely an example. What I wanted to say is that we need to support system already deployed.

In other words our customers won't accept that they need to replace the compositor just because they switch to a new hardware generation.

...

I mean, my answer to the first email is 'no, absolutely not' from the technical perspective (the initial proposal totally breaks current and future userspace), from a design perspective (it breaks a lot of usecases which aren't single-vendor GPU+display+codec, or aren't just a simple desktop), and from a sustainability perspective (cutting Android adrift again isn't acceptable collateral damage to make it easier to backport things to last year's Ubuntu release).

But then again, I don't even know what I'm NAKing here ... ? The original email just lists a proposal to break a ton of things, with proposed replacements which aren't technically viable, and it's not clear why? Can we please have some more details and some reasoning behind them?

I don't mind that userspace (compositor, protocols, clients like Mesa as well as codec APIs) need to do a lot of work to support the new model. I do really care though that the hard-binary-switch model works fine enough for AMD but totally breaks heterogeneous systems and makes it impossible for userspace to do the right thing.

Well how the handling for new Android, distributions etc... is going to look like is a completely different story.

And I completely agree with both Daniel Vetter and you that we need to keep this in mind when designing the compatibility with older software.

For Android I'm really not sure what to do. In general Android is already trying to do the right thing by using explicit sync, the problem is that this is build around the idea that this explicit sync is syncfile kernel based.

Either we need to change Android and come up with something that works with user fences as well or we somehow invent a compatibility layer for syncfile as well.

Regards, Christian.

...

Cheers, Daniel

Marek Olšák

9:58 a.m.

New subject: [Mesa-dev] Linux Graphics Next: Userspace submission update

On Wed, Jun 2, 2021 at 5:44 AM Christian König < ckoenig.leichtzumerken@gmail.com> wrote:

...

Am 02.06.21 um 10:57 schrieb Daniel Stone:

...
Hi Christian,

On Tue, 1 Jun 2021 at 13:51, Christian König ckoenig.leichtzumerken@gmail.com wrote:

...
Am 01.06.21 um 14:30 schrieb Daniel Vetter:

...
If you want to enable this use-case with driver magic and without the compositor being aware of what's going on, the solution is EGLStreams. Not sure we want to go there, but it's definitely a lot more feasible than trying to stuff eglstreams semantics into dma-buf implicit fencing support in a desperate attempt to not change compositors.

Well not changing compositors is certainly not something I would try with this use case.

Not changing compositors is more like ok we have Ubuntu 20.04 and need to support that we the newest hardware generation.

Serious question, have you talked to Canonical?

I mean there's a hell of a lot of effort being expended here, but it seems to all be predicated on the assumption that Ubuntu's LTS HWE/backport policy is totally immutable, and that we might need to make the kernel do backflips to work around that. But ... is it? Has anyone actually asked them how they feel about this?

This was merely an example. What I wanted to say is that we need to support system already deployed.

In other words our customers won't accept that they need to replace the compositor just because they switch to a new hardware generation.

...
I mean, my answer to the first email is 'no, absolutely not' from the technical perspective (the initial proposal totally breaks current and future userspace), from a design perspective (it breaks a lot of usecases which aren't single-vendor GPU+display+codec, or aren't just a simple desktop), and from a sustainability perspective (cutting Android adrift again isn't acceptable collateral damage to make it easier to backport things to last year's Ubuntu release).

But then again, I don't even know what I'm NAKing here ... ? The original email just lists a proposal to break a ton of things, with proposed replacements which aren't technically viable, and it's not clear why? Can we please have some more details and some reasoning behind them?

I don't mind that userspace (compositor, protocols, clients like Mesa as well as codec APIs) need to do a lot of work to support the new model. I do really care though that the hard-binary-switch model works fine enough for AMD but totally breaks heterogeneous systems and makes it impossible for userspace to do the right thing.

Well how the handling for new Android, distributions etc... is going to look like is a completely different story.

And I completely agree with both Daniel Vetter and you that we need to keep this in mind when designing the compatibility with older software.

For Android I'm really not sure what to do. In general Android is already trying to do the right thing by using explicit sync, the problem is that this is build around the idea that this explicit sync is syncfile kernel based.

Either we need to change Android and come up with something that works with user fences as well or we somehow invent a compatibility layer for syncfile as well.

What's the issue with syncfiles that syncobjs don't suffer from?

Marek

Christian König

10:06 a.m.

New subject: [Mesa-dev] Linux Graphics Next: Userspace submission update

Am 02.06.21 um 11:58 schrieb Marek Olšák:

...

On Wed, Jun 2, 2021 at 5:44 AM Christian König <ckoenig.leichtzumerken@gmail.com mailto:ckoenig.leichtzumerken@gmail.com> wrote:

Am 02.06.21 um 10:57 schrieb Daniel Stone:
> Hi Christian,
>
> On Tue, 1 Jun 2021 at 13:51, Christian König
> <ckoenig.leichtzumerken@gmail.com
<mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
>> Am 01.06.21 um 14:30 schrieb Daniel Vetter:
>>> If you want to enable this use-case with driver magic and
without the
>>> compositor being aware of what's going on, the solution is
EGLStreams.
>>> Not sure we want to go there, but it's definitely a lot more
feasible
>>> than trying to stuff eglstreams semantics into dma-buf implicit
>>> fencing support in a desperate attempt to not change compositors.
>> Well not changing compositors is certainly not something I
would try
>> with this use case.
>>
>> Not changing compositors is more like ok we have Ubuntu 20.04
and need
>> to support that we the newest hardware generation.
> Serious question, have you talked to Canonical?
>
> I mean there's a hell of a lot of effort being expended here, but it
> seems to all be predicated on the assumption that Ubuntu's LTS
> HWE/backport policy is totally immutable, and that we might need to
> make the kernel do backflips to work around that. But ... is it? Has
> anyone actually asked them how they feel about this?

This was merely an example. What I wanted to say is that we need to
support system already deployed.

In other words our customers won't accept that they need to
replace the
compositor just because they switch to a new hardware generation.

> I mean, my answer to the first email is 'no, absolutely not'
from the
> technical perspective (the initial proposal totally breaks
current and
> future userspace), from a design perspective (it breaks a lot of
> usecases which aren't single-vendor GPU+display+codec, or aren't
just
> a simple desktop), and from a sustainability perspective (cutting
> Android adrift again isn't acceptable collateral damage to make it
> easier to backport things to last year's Ubuntu release).
>
> But then again, I don't even know what I'm NAKing here ... ? The
> original email just lists a proposal to break a ton of things, with
> proposed replacements which aren't technically viable, and it's not
> clear why? Can we please have some more details and some reasoning
> behind them?
>
> I don't mind that userspace (compositor, protocols, clients like
Mesa
> as well as codec APIs) need to do a lot of work to support the new
> model. I do really care though that the hard-binary-switch model
works
> fine enough for AMD but totally breaks heterogeneous systems and
makes
> it impossible for userspace to do the right thing.

Well how the handling for new Android, distributions etc... is
going to
look like is a completely different story.

And I completely agree with both Daniel Vetter and you that we
need to
keep this in mind when designing the compatibility with older
software.

For Android I'm really not sure what to do. In general Android is
already trying to do the right thing by using explicit sync, the
problem
is that this is build around the idea that this explicit sync is
syncfile kernel based.

Either we need to change Android and come up with something that
works
with user fences as well or we somehow invent a compatibility
layer for
syncfile as well.

What's the issue with syncfiles that syncobjs don't suffer from?

Syncobjs where designed with future fences in mind. In other words we already have the ability to wait for a future submission to appear with all the nasty locking implications.

Syncfile on the other hand are just a container for up to N kernel fences and since we don't have kernel fences any more that is rather tricky to keep working.

Going to look into the uAPI around syncfiles once more and see if we can somehow use that for user fences as well.

Christian.

...

Marek

Michel Dänzer

1 Jun 1 Jun

1:18 p.m.

New subject: [Mesa-dev] Linux Graphics Next: Userspace submission update

On 2021-06-01 2:10 p.m., Christian König wrote:

...

Am 01.06.21 um 12:49 schrieb Michel Dänzer:

...
On 2021-06-01 12:21 p.m., Christian König wrote:

...
Am 01.06.21 um 11:02 schrieb Michel Dänzer:

...
On 2021-05-27 11:51 p.m., Marek Olšák wrote:

...

Compositors (and other privileged processes, and display flipping) can't trust imported/exported fences. They need a timeout recovery mechanism from the beginning, and the following are some possible solutions to timeouts:

a) use a CPU wait with a small absolute timeout, and display the previous content on timeout b) use a GPU wait with a small absolute timeout, and conditional rendering will choose between the latest content (if signalled) and previous content (if timed out)

The result would be that the desktop can run close to 60 fps even if an app runs at 1 fps.

FWIW, this is working with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even with implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to provide the same dma-buf poll semantics as other drivers and high priority GFX contexts via EGL_IMG_context_priority which can preempt lower priority ones).

Yeah, that is really nice to have.

One question is if you wait on the CPU or the GPU for the new surface to become available?

It's based on polling dma-buf fds, i.e. CPU.

...
The former is a bit bad for latency and power management.

There isn't a choice for Wayland compositors in general, since there can be arbitrary other state which needs to be applied atomically together with the new buffer. (Though in theory, a compositor might get fancy and special-case surface commits which can be handled by waiting on the GPU)

Latency is largely a matter of scheduling in the compositor. The latency incurred by the compositor shouldn't have to be more than single-digit milliseconds. (I've seen total latency from when the client starts processing a (static) frame to when it starts being scanned out as low as ~6 ms with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620, lower than typical with Xorg)

Well let me describe it like this:

We have an use cases for 144 Hz guaranteed refresh rate. That essentially means that the client application needs to be able to spit out one frame/window content every ~6.9ms. That's tough, but doable.

When you now add 6ms latency in the compositor that means the client application has only .9ms left for it's frame which is basically impossible to do.

You misunderstood me. 6 ms is the lowest possible end-to-end latency from client to scanout, but the client can start as early as it wants/needs to. It's a trade-off between latency and the risk of missing a scanout cycle.

-- Earthling Michel Dänzer | https://redhat.com Libre software enthusiast | Mesa and X developer

Michel Dänzer

5:39 p.m.

New subject: [Mesa-dev] Linux Graphics Next: Userspace submission update

On 2021-06-01 3:18 p.m., Michel Dänzer wrote:

...

On 2021-06-01 2:10 p.m., Christian König wrote:

...
Am 01.06.21 um 12:49 schrieb Michel Dänzer:

...
On 2021-06-01 12:21 p.m., Christian König wrote:

...
Am 01.06.21 um 11:02 schrieb Michel Dänzer:

...
On 2021-05-27 11:51 p.m., Marek Olšák wrote:

...

Compositors (and other privileged processes, and display flipping) can't trust imported/exported fences. They need a timeout recovery mechanism from the beginning, and the following are some possible solutions to timeouts:

a) use a CPU wait with a small absolute timeout, and display the previous content on timeout b) use a GPU wait with a small absolute timeout, and conditional rendering will choose between the latest content (if signalled) and previous content (if timed out)

The result would be that the desktop can run close to 60 fps even if an app runs at 1 fps.

FWIW, this is working with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even with implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to provide the same dma-buf poll semantics as other drivers and high priority GFX contexts via EGL_IMG_context_priority which can preempt lower priority ones).

Yeah, that is really nice to have.

One question is if you wait on the CPU or the GPU for the new surface to become available?

It's based on polling dma-buf fds, i.e. CPU.

...
The former is a bit bad for latency and power management.

There isn't a choice for Wayland compositors in general, since there can be arbitrary other state which needs to be applied atomically together with the new buffer. (Though in theory, a compositor might get fancy and special-case surface commits which can be handled by waiting on the GPU)

Latency is largely a matter of scheduling in the compositor. The latency incurred by the compositor shouldn't have to be more than single-digit milliseconds. (I've seen total latency from when the client starts processing a (static) frame to when it starts being scanned out as low as ~6 ms with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620, lower than typical with Xorg)

Well let me describe it like this:

We have an use cases for 144 Hz guaranteed refresh rate. That essentially means that the client application needs to be able to spit out one frame/window content every ~6.9ms. That's tough, but doable.

When you now add 6ms latency in the compositor that means the client application has only .9ms left for it's frame which is basically impossible to do.

You misunderstood me. 6 ms is the lowest possible end-to-end latency from client to scanout, but the client can start as early as it wants/needs to. It's a trade-off between latency and the risk of missing a scanout cycle.

Note that what I wrote above is about the case where the compositor needs to draw its own frame sampling from the client buffer. If your concern is about a fullscreen application for which the compositor can directly use the application buffers for scanout, it should be possible in theory to get the latency incurred by the compositor down to ~1 ms.

If that's too much[0], it could be improved further by adding atomic KMS API to replace a pending page flip with another one. Then the compositor could just directly submit a flip as soon as a new buffer becomes ready (or even as soon as the client submits it to the compositor, depending on how exactly the new KMS API works). Then the minimum latency should be mostly up to the kernel driver / HW.

Another possibility would be for the application to use KMS directly, e.g. via a DRM lease. That might still require the same new API to get the flip submission latency significantly below 1 ms though.

[0] Though I'm not sure how to reconcile that with "spitting out one frame every ~6.9ms is tough", as that means the theoretical minimum total client→scanout latency is ~7 ms (and missing a scanout cycle ~doubles the latency).

-- Earthling Michel Dänzer | https://redhat.com Libre software enthusiast | Mesa and X developer

Daniel Stone

5:42 p.m.

New subject: [Mesa-dev] Linux Graphics Next: Userspace submission update

Hi,

On Tue, 1 Jun 2021 at 14:18, Michel Dänzer michel@daenzer.net wrote:

...

On 2021-06-01 2:10 p.m., Christian König wrote:

...
Am 01.06.21 um 12:49 schrieb Michel Dänzer:

...
There isn't a choice for Wayland compositors in general, since there can be arbitrary other state which needs to be applied atomically together with the new buffer. (Though in theory, a compositor might get fancy and special-case surface commits which can be handled by waiting on the GPU)

Yeah, this is pretty crucial.

...

...
...
Latency is largely a matter of scheduling in the compositor. The latency incurred by the compositor shouldn't have to be more than single-digit milliseconds. (I've seen total latency from when the client starts processing a (static) frame to when it starts being scanned out as low as ~6 ms with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620, lower than typical with Xorg)

Well let me describe it like this:

We have an use cases for 144 Hz guaranteed refresh rate. That essentially means that the client application needs to be able to spit out one frame/window content every ~6.9ms. That's tough, but doable.

When you now add 6ms latency in the compositor that means the client application has only .9ms left for it's frame which is basically impossible to do.

You misunderstood me. 6 ms is the lowest possible end-to-end latency from client to scanout, but the client can start as early as it wants/needs to. It's a trade-off between latency and the risk of missing a scanout cycle.

Not quite.

When weston-presentation-shm is reporting is a 6ms delta between when it started its rendering loop and when the frame was presented to screen. How w-p-s was run matters a lot, because you can insert an arbitrary delay in there to simulate client rendering. It also matters a lot that the client is SHM, because that will result in Mutter doing glTexImage2D on whatever size the window is, then doing a full GL compositing pass, so even if it's run with zero delay, 6ms isn't 'the amount of time it takes Mutter to get a frame to screen', it's measuring the overhead of a texture upload and full-screen composition as well.

I'm assuming the 'guaranteed 144Hz' target is a fullscreen GL client, for which you definitely avoid TexImage2D, and could hopefully (assuming the client picks a modifier which can be displayed) also avoid the composition pass in favour of direct scanout from the client buffer; that would give you a much lower number.

Each compositor has its own heuristics around timing. They all make their own tradeoff between low latency and fewest dropped frames. Obviously, the higher your latency, the lower the chance of missing a deadline. There's a lot of detail in the MR that Michel linked.

Cheers, Daniel

Michel Dänzer

2 Jun 2 Jun

8:09 a.m.

On 2021-06-01 12:49 p.m., Michel Dänzer wrote:

...

On 2021-06-01 12:21 p.m., Christian König wrote:

...
Another question is if that is sufficient as security for the display server or if we need further handling down the road? I mean essentially we are moving the reliability problem into the display server.

Good question. This should generally protect the display server from freezing due to client fences never signalling, but there might still be corner cases.

E.g. a client might be able to sneak in a fence between when the compositor checks fences and when it submits its drawing to the kernel.

-- Earthling Michel Dänzer | https://redhat.com Libre software enthusiast | Mesa and X developer

Daniel Vetter

7:20 p.m.

On Wed, Jun 02, 2021 at 10:09:01AM +0200, Michel Dänzer wrote:

...

On 2021-06-01 12:49 p.m., Michel Dänzer wrote:

...
On 2021-06-01 12:21 p.m., Christian König wrote:

...
Another question is if that is sufficient as security for the display server or if we need further handling down the road? I mean essentially we are moving the reliability problem into the display server.

Good question. This should generally protect the display server from freezing due to client fences never signalling, but there might still be corner cases.

E.g. a client might be able to sneak in a fence between when the compositor checks fences and when it submits its drawing to the kernel.

This is why implicit sync should be handled with explicit IPC. You pick the fence up once, and then you need to tell your GL stack to _not_ do implicit sync. Would need a new extension. vk afaiui does this automatically already. -Daniel

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

1414

Age (days ago)

1435

Last active (days ago)

dri-devel@lists.freedesktop.org

49 comments

5 participants

tags (0)

participants (5)

Christian König
Daniel Stone
Daniel Vetter
Marek Olšák
Michel Dänzer