Hi,
Since Christian believes that we can't deadlock the kernel with some changes there, we just need to make everything nice for userspace too. Instead of explaining how it will work, I will explain the cases where future hardware (and its kernel driver) will break existing userspace in order to protect everybody from deadlocks. Anything that uses implicit sync will be spared, so X and Wayland will be fine, assuming they don't import/export fences. Those use cases that do import/export fences might or might not work, depending on how the fences are used.
One of the necessities is that all fences will become future fences. The semantics of imported/exported fences will change completely and will have new restrictions on the usage. The restrictions are:
1) Android sync files will be impossible to support, so won't be supported. (they don't allow future fences)
2) Implicit sync and explicit sync will be mutually exclusive between process. A process can either use one or the other, but not both. This is meant to prevent a deadlock condition with future fences where any process can malevolently deadlock execution of any other process, even execution of a higher-privileged process. The kernel will impose the following restrictions to protect against the deadlock:
a) a process with an implicitly-sync'd imported/exported buffer can't import/export a fence from/to another process b) a process with an imported/exported fence can't import/export an implicitly-sync'd buffer from/to another process
Alternative: A higher-privileged process could enforce both restrictions instead of the kernel to protect itself from the deadlock, but this would be a can of worms for existing userspace. It would be better if the kernel just broke unsafe userspace on future hw, just like sync files.
If both implicit and explicit sync are allowed to occur simultaneously, sending a future fence that will never signal to any process will deadlock that process after it acquires the implicit sync lock, which is a sequence number that the process is required to write to memory and send an interrupt from the GPU in a finite time. This is how the deadlock can happen:
* The process gets sequence number N from the kernel for an implicitly-sync'd buffer. * The process inserts (into the GPU user-mapped queue) a wait for sequence number N-1. * The process inserts a wait for a fence, but it doesn't know that it will never signal ==> deadlock. ... * The process inserts a command to write sequence number N to a predetermined memory location. (which will make the buffer idle and send an interrupt to the kernel) ... * The kernel will terminate the process because it has never received the interrupt. (i.e. a less-privileged process just killed a more-privileged process)
It's the interrupt for implicit sync that never arrived that caused the termination, and the only way another process can cause it is by sending a fence that will never signal. Thus, importing/exporting fences from/to other processes can't be allowed simultaneously with implicit sync.
3) Compositors (and other privileged processes, and display flipping) can't trust imported/exported fences. They need a timeout recovery mechanism from the beginning, and the following are some possible solutions to timeouts:
a) use a CPU wait with a small absolute timeout, and display the previous content on timeout b) use a GPU wait with a small absolute timeout, and conditional rendering will choose between the latest content (if signalled) and previous content (if timed out)
The result would be that the desktop can run close to 60 fps even if an app runs at 1 fps.
*Redefining imported/exported fences and breaking some users/OSs is the only way to have userspace GPU command submission, and the deadlock example here is the counterexample proving that there is no other way.*
So, what are the chances this is going to fly with the ecosystem?
Thanks, Marek
Hi Marek,
well I don't think that implicit and explicit synchronization needs to be mutual exclusive.
What we should do is to have the ability to transport an synchronization object with each BO.
Implicit and explicit synchronization then basically become the same, they just transport the synchronization object differently.
The biggest problem are the sync_files for Android, since they are really not easy to support at all. If Android wants to support user queues we would probably have to do some changes there.
Regards, Christian.
Am 27.05.21 um 23:51 schrieb Marek Olšák:
Hi,
Since Christian believes that we can't deadlock the kernel with some changes there, we just need to make everything nice for userspace too. Instead of explaining how it will work, I will explain the cases where future hardware (and its kernel driver) will break existing userspace in order to protect everybody from deadlocks. Anything that uses implicit sync will be spared, so X and Wayland will be fine, assuming they don't import/export fences. Those use cases that do import/export fences might or might not work, depending on how the fences are used.
One of the necessities is that all fences will become future fences. The semantics of imported/exported fences will change completely and will have new restrictions on the usage. The restrictions are:
- Android sync files will be impossible to support, so won't be
supported. (they don't allow future fences)
- Implicit sync and explicit sync will be mutually exclusive between
process. A process can either use one or the other, but not both. This is meant to prevent a deadlock condition with future fences where any process can malevolently deadlock execution of any other process, even execution of a higher-privileged process. The kernel will impose the following restrictions to protect against the deadlock:
a) a process with an implicitly-sync'd imported/exported buffer can't import/export a fence from/to another process b) a process with an imported/exported fence can't import/export an implicitly-sync'd buffer from/to another process
Alternative: A higher-privileged process could enforce both restrictions instead of the kernel to protect itself from the deadlock, but this would be a can of worms for existing userspace. It would be better if the kernel just broke unsafe userspace on future hw, just like sync files.
If both implicit and explicit sync are allowed to occur simultaneously, sending a future fence that will never signal to any process will deadlock that process after it acquires the implicit sync lock, which is a sequence number that the process is required to write to memory and send an interrupt from the GPU in a finite time. This is how the deadlock can happen:
- The process gets sequence number N from the kernel for an
implicitly-sync'd buffer.
- The process inserts (into the GPU user-mapped queue) a wait for
sequence number N-1.
- The process inserts a wait for a fence, but it doesn't know that it
will never signal ==> deadlock. ...
- The process inserts a command to write sequence number N to a
predetermined memory location. (which will make the buffer idle and send an interrupt to the kernel) ...
- The kernel will terminate the process because it has never received
the interrupt. (i.e. a less-privileged process just killed a more-privileged process)
It's the interrupt for implicit sync that never arrived that caused the termination, and the only way another process can cause it is by sending a fence that will never signal. Thus, importing/exporting fences from/to other processes can't be allowed simultaneously with implicit sync.
- Compositors (and other privileged processes, and display flipping)
can't trust imported/exported fences. They need a timeout recovery mechanism from the beginning, and the following are some possible solutions to timeouts:
a) use a CPU wait with a small absolute timeout, and display the previous content on timeout b) use a GPU wait with a small absolute timeout, and conditional rendering will choose between the latest content (if signalled) and previous content (if timed out)
The result would be that the desktop can run close to 60 fps even if an app runs at 1 fps.
*Redefining imported/exported fences and breaking some users/OSs is the only way to have userspace GPU command submission, and the deadlock example here is the counterexample proving that there is no other way.*
So, what are the chances this is going to fly with the ecosystem?
Thanks, Marek
If both implicit and explicit synchronization are handled the same, then the kernel won't be able to identify the process that caused an implicit sync deadlock. The process that is stuck waiting for a fence can be innocent, and the kernel can't punish it. Likewise, the GPU reset guery that reports which process is guilty and innocent will only be able to report unknown. Is that OK?
Marek
On Fri, May 28, 2021 at 10:41 AM Christian König < ckoenig.leichtzumerken@gmail.com> wrote:
Hi Marek,
well I don't think that implicit and explicit synchronization needs to be mutual exclusive.
What we should do is to have the ability to transport an synchronization object with each BO.
Implicit and explicit synchronization then basically become the same, they just transport the synchronization object differently.
The biggest problem are the sync_files for Android, since they are really not easy to support at all. If Android wants to support user queues we would probably have to do some changes there.
Regards, Christian.
Am 27.05.21 um 23:51 schrieb Marek Olšák:
Hi,
Since Christian believes that we can't deadlock the kernel with some changes there, we just need to make everything nice for userspace too. Instead of explaining how it will work, I will explain the cases where future hardware (and its kernel driver) will break existing userspace in order to protect everybody from deadlocks. Anything that uses implicit sync will be spared, so X and Wayland will be fine, assuming they don't import/export fences. Those use cases that do import/export fences might or might not work, depending on how the fences are used.
One of the necessities is that all fences will become future fences. The semantics of imported/exported fences will change completely and will have new restrictions on the usage. The restrictions are:
- Android sync files will be impossible to support, so won't be
supported. (they don't allow future fences)
- Implicit sync and explicit sync will be mutually exclusive between
process. A process can either use one or the other, but not both. This is meant to prevent a deadlock condition with future fences where any process can malevolently deadlock execution of any other process, even execution of a higher-privileged process. The kernel will impose the following restrictions to protect against the deadlock:
a) a process with an implicitly-sync'd imported/exported buffer can't import/export a fence from/to another process b) a process with an imported/exported fence can't import/export an implicitly-sync'd buffer from/to another process
Alternative: A higher-privileged process could enforce both restrictions instead of the kernel to protect itself from the deadlock, but this would be a can of worms for existing userspace. It would be better if the kernel just broke unsafe userspace on future hw, just like sync files.
If both implicit and explicit sync are allowed to occur simultaneously, sending a future fence that will never signal to any process will deadlock that process after it acquires the implicit sync lock, which is a sequence number that the process is required to write to memory and send an interrupt from the GPU in a finite time. This is how the deadlock can happen:
- The process gets sequence number N from the kernel for an
implicitly-sync'd buffer.
- The process inserts (into the GPU user-mapped queue) a wait for sequence
number N-1.
- The process inserts a wait for a fence, but it doesn't know that it will
never signal ==> deadlock. ...
- The process inserts a command to write sequence number N to a
predetermined memory location. (which will make the buffer idle and send an interrupt to the kernel) ...
- The kernel will terminate the process because it has never received the
interrupt. (i.e. a less-privileged process just killed a more-privileged process)
It's the interrupt for implicit sync that never arrived that caused the termination, and the only way another process can cause it is by sending a fence that will never signal. Thus, importing/exporting fences from/to other processes can't be allowed simultaneously with implicit sync.
- Compositors (and other privileged processes, and display flipping)
can't trust imported/exported fences. They need a timeout recovery mechanism from the beginning, and the following are some possible solutions to timeouts:
a) use a CPU wait with a small absolute timeout, and display the previous content on timeout b) use a GPU wait with a small absolute timeout, and conditional rendering will choose between the latest content (if signalled) and previous content (if timed out)
The result would be that the desktop can run close to 60 fps even if an app runs at 1 fps.
*Redefining imported/exported fences and breaking some users/OSs is the only way to have userspace GPU command submission, and the deadlock example here is the counterexample proving that there is no other way.*
So, what are the chances this is going to fly with the ecosystem?
Thanks, Marek
My first email can be ignored except for the sync files. Oh well.
I think I see what you mean, Christian. If we assume that an imported fence is always read only (the buffer with the sequence number is read only), only the process that created and exported the fence can signal it. If the fence is not signaled, the exporting process is guilty. The only thing the importing process must do when it's about to use the fence as a dependency is to notify the kernel about it. Thus, the kernel will always know the dependency graph. Then if the importing process times out, the kernel will blame any of the processes that passed it a fence that is still unsignaled. The kernel will blame the process that timed out only if all imported fences have been signaled. It seems pretty robust.
It's the same with implicit sync except that the buffer with the sequence number is writable. Any process that has an implicitly-sync'd buffer can set the sequence number to 0 or UINT64_MAX. 0 will cause a timeout for the next job, while UINT64_MAX might cause a timeout a little later. The timeout can be mitigated by the kernel because the kernel knows the greatest number that should be there, but it's not possible to know which process is guilty (all processes holding the buffer handle would be suspects).
Marek
On Fri, May 28, 2021 at 6:25 PM Marek Olšák maraeo@gmail.com wrote:
If both implicit and explicit synchronization are handled the same, then the kernel won't be able to identify the process that caused an implicit sync deadlock. The process that is stuck waiting for a fence can be innocent, and the kernel can't punish it. Likewise, the GPU reset guery that reports which process is guilty and innocent will only be able to report unknown. Is that OK?
Marek
On Fri, May 28, 2021 at 10:41 AM Christian König < ckoenig.leichtzumerken@gmail.com> wrote:
Hi Marek,
well I don't think that implicit and explicit synchronization needs to be mutual exclusive.
What we should do is to have the ability to transport an synchronization object with each BO.
Implicit and explicit synchronization then basically become the same, they just transport the synchronization object differently.
The biggest problem are the sync_files for Android, since they are really not easy to support at all. If Android wants to support user queues we would probably have to do some changes there.
Regards, Christian.
Am 27.05.21 um 23:51 schrieb Marek Olšák:
Hi,
Since Christian believes that we can't deadlock the kernel with some changes there, we just need to make everything nice for userspace too. Instead of explaining how it will work, I will explain the cases where future hardware (and its kernel driver) will break existing userspace in order to protect everybody from deadlocks. Anything that uses implicit sync will be spared, so X and Wayland will be fine, assuming they don't import/export fences. Those use cases that do import/export fences might or might not work, depending on how the fences are used.
One of the necessities is that all fences will become future fences. The semantics of imported/exported fences will change completely and will have new restrictions on the usage. The restrictions are:
- Android sync files will be impossible to support, so won't be
supported. (they don't allow future fences)
- Implicit sync and explicit sync will be mutually exclusive between
process. A process can either use one or the other, but not both. This is meant to prevent a deadlock condition with future fences where any process can malevolently deadlock execution of any other process, even execution of a higher-privileged process. The kernel will impose the following restrictions to protect against the deadlock:
a) a process with an implicitly-sync'd imported/exported buffer can't import/export a fence from/to another process b) a process with an imported/exported fence can't import/export an implicitly-sync'd buffer from/to another process
Alternative: A higher-privileged process could enforce both restrictions instead of the kernel to protect itself from the deadlock, but this would be a can of worms for existing userspace. It would be better if the kernel just broke unsafe userspace on future hw, just like sync files.
If both implicit and explicit sync are allowed to occur simultaneously, sending a future fence that will never signal to any process will deadlock that process after it acquires the implicit sync lock, which is a sequence number that the process is required to write to memory and send an interrupt from the GPU in a finite time. This is how the deadlock can happen:
- The process gets sequence number N from the kernel for an
implicitly-sync'd buffer.
- The process inserts (into the GPU user-mapped queue) a wait for
sequence number N-1.
- The process inserts a wait for a fence, but it doesn't know that it
will never signal ==> deadlock. ...
- The process inserts a command to write sequence number N to a
predetermined memory location. (which will make the buffer idle and send an interrupt to the kernel) ...
- The kernel will terminate the process because it has never received the
interrupt. (i.e. a less-privileged process just killed a more-privileged process)
It's the interrupt for implicit sync that never arrived that caused the termination, and the only way another process can cause it is by sending a fence that will never signal. Thus, importing/exporting fences from/to other processes can't be allowed simultaneously with implicit sync.
- Compositors (and other privileged processes, and display flipping)
can't trust imported/exported fences. They need a timeout recovery mechanism from the beginning, and the following are some possible solutions to timeouts:
a) use a CPU wait with a small absolute timeout, and display the previous content on timeout b) use a GPU wait with a small absolute timeout, and conditional rendering will choose between the latest content (if signalled) and previous content (if timed out)
The result would be that the desktop can run close to 60 fps even if an app runs at 1 fps.
*Redefining imported/exported fences and breaking some users/OSs is the only way to have userspace GPU command submission, and the deadlock example here is the counterexample proving that there is no other way.*
So, what are the chances this is going to fly with the ecosystem?
Thanks, Marek
Yes, exactly that's my thinking and also the reason why I'm pondering so hard on the requirement that the memory for shared user fences should not be modifiable by userspace directly.
Christian.
Am 29.05.21 um 05:33 schrieb Marek Olšák:
My first email can be ignored except for the sync files. Oh well.
I think I see what you mean, Christian. If we assume that an imported fence is always read only (the buffer with the sequence number is read only), only the process that created and exported the fence can signal it. If the fence is not signaled, the exporting process is guilty. The only thing the importing process must do when it's about to use the fence as a dependency is to notify the kernel about it. Thus, the kernel will always know the dependency graph. Then if the importing process times out, the kernel will blame any of the processes that passed it a fence that is still unsignaled. The kernel will blame the process that timed out only if all imported fences have been signaled. It seems pretty robust.
It's the same with implicit sync except that the buffer with the sequence number is writable. Any process that has an implicitly-sync'd buffer can set the sequence number to 0 or UINT64_MAX. 0 will cause a timeout for the next job, while UINT64_MAX might cause a timeout a little later. The timeout can be mitigated by the kernel because the kernel knows the greatest number that should be there, but it's not possible to know which process is guilty (all processes holding the buffer handle would be suspects).
Marek
On Fri, May 28, 2021 at 6:25 PM Marek Olšák <maraeo@gmail.com mailto:maraeo@gmail.com> wrote:
If both implicit and explicit synchronization are handled the same, then the kernel won't be able to identify the process that caused an implicit sync deadlock. The process that is stuck waiting for a fence can be innocent, and the kernel can't punish it. Likewise, the GPU reset guery that reports which process is guilty and innocent will only be able to report unknown. Is that OK? Marek On Fri, May 28, 2021 at 10:41 AM Christian König <ckoenig.leichtzumerken@gmail.com <mailto:ckoenig.leichtzumerken@gmail.com>> wrote: Hi Marek, well I don't think that implicit and explicit synchronization needs to be mutual exclusive. What we should do is to have the ability to transport an synchronization object with each BO. Implicit and explicit synchronization then basically become the same, they just transport the synchronization object differently. The biggest problem are the sync_files for Android, since they are really not easy to support at all. If Android wants to support user queues we would probably have to do some changes there. Regards, Christian. Am 27.05.21 um 23:51 schrieb Marek Olšák:
Hi, Since Christian believes that we can't deadlock the kernel with some changes there, we just need to make everything nice for userspace too. Instead of explaining how it will work, I will explain the cases where future hardware (and its kernel driver) will break existing userspace in order to protect everybody from deadlocks. Anything that uses implicit sync will be spared, so X and Wayland will be fine, assuming they don't import/export fences. Those use cases that do import/export fences might or might not work, depending on how the fences are used. One of the necessities is that all fences will become future fences. The semantics of imported/exported fences will change completely and will have new restrictions on the usage. The restrictions are: 1) Android sync files will be impossible to support, so won't be supported. (they don't allow future fences) 2) Implicit sync and explicit sync will be mutually exclusive between process. A process can either use one or the other, but not both. This is meant to prevent a deadlock condition with future fences where any process can malevolently deadlock execution of any other process, even execution of a higher-privileged process. The kernel will impose the following restrictions to protect against the deadlock: a) a process with an implicitly-sync'd imported/exported buffer can't import/export a fence from/to another process b) a process with an imported/exported fence can't import/export an implicitly-sync'd buffer from/to another process Alternative: A higher-privileged process could enforce both restrictions instead of the kernel to protect itself from the deadlock, but this would be a can of worms for existing userspace. It would be better if the kernel just broke unsafe userspace on future hw, just like sync files. If both implicit and explicit sync are allowed to occur simultaneously, sending a future fence that will never signal to any process will deadlock that process after it acquires the implicit sync lock, which is a sequence number that the process is required to write to memory and send an interrupt from the GPU in a finite time. This is how the deadlock can happen: * The process gets sequence number N from the kernel for an implicitly-sync'd buffer. * The process inserts (into the GPU user-mapped queue) a wait for sequence number N-1. * The process inserts a wait for a fence, but it doesn't know that it will never signal ==> deadlock. ... * The process inserts a command to write sequence number N to a predetermined memory location. (which will make the buffer idle and send an interrupt to the kernel) ... * The kernel will terminate the process because it has never received the interrupt. (i.e. a less-privileged process just killed a more-privileged process) It's the interrupt for implicit sync that never arrived that caused the termination, and the only way another process can cause it is by sending a fence that will never signal. Thus, importing/exporting fences from/to other processes can't be allowed simultaneously with implicit sync. 3) Compositors (and other privileged processes, and display flipping) can't trust imported/exported fences. They need a timeout recovery mechanism from the beginning, and the following are some possible solutions to timeouts: a) use a CPU wait with a small absolute timeout, and display the previous content on timeout b) use a GPU wait with a small absolute timeout, and conditional rendering will choose between the latest content (if signalled) and previous content (if timed out) The result would be that the desktop can run close to 60 fps even if an app runs at 1 fps. *Redefining imported/exported fences and breaking some users/OSs is the only way to have userspace GPU command submission, and the deadlock example here is the counterexample proving that there is no other way.* So, what are the chances this is going to fly with the ecosystem? Thanks, Marek
On 2021-05-27 11:51 p.m., Marek Olšák wrote:
- Compositors (and other privileged processes, and display flipping) can't trust imported/exported fences. They need a timeout recovery mechanism from the beginning, and the following are some possible solutions to timeouts:
a) use a CPU wait with a small absolute timeout, and display the previous content on timeout b) use a GPU wait with a small absolute timeout, and conditional rendering will choose between the latest content (if signalled) and previous content (if timed out)
The result would be that the desktop can run close to 60 fps even if an app runs at 1 fps.
FWIW, this is working with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even with implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to provide the same dma-buf poll semantics as other drivers and high priority GFX contexts via EGL_IMG_context_priority which can preempt lower priority ones).
Am 01.06.21 um 11:02 schrieb Michel Dänzer:
On 2021-05-27 11:51 p.m., Marek Olšák wrote:
- Compositors (and other privileged processes, and display flipping) can't trust imported/exported fences. They need a timeout recovery mechanism from the beginning, and the following are some possible solutions to timeouts:
a) use a CPU wait with a small absolute timeout, and display the previous content on timeout b) use a GPU wait with a small absolute timeout, and conditional rendering will choose between the latest content (if signalled) and previous content (if timed out)
The result would be that the desktop can run close to 60 fps even if an app runs at 1 fps.
FWIW, this is working with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even with implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to provide the same dma-buf poll semantics as other drivers and high priority GFX contexts via EGL_IMG_context_priority which can preempt lower priority ones).
Yeah, that is really nice to have.
One question is if you wait on the CPU or the GPU for the new surface to become available? The former is a bit bad for latency and power management.
Another question is if that is sufficient as security for the display server or if we need further handling down the road? I mean essentially we are moving the reliability problem into the display server.
Regards, Christian.
On 2021-06-01 12:21 p.m., Christian König wrote:
Am 01.06.21 um 11:02 schrieb Michel Dänzer:
On 2021-05-27 11:51 p.m., Marek Olšák wrote:
- Compositors (and other privileged processes, and display flipping) can't trust imported/exported fences. They need a timeout recovery mechanism from the beginning, and the following are some possible solutions to timeouts:
a) use a CPU wait with a small absolute timeout, and display the previous content on timeout b) use a GPU wait with a small absolute timeout, and conditional rendering will choose between the latest content (if signalled) and previous content (if timed out)
The result would be that the desktop can run close to 60 fps even if an app runs at 1 fps.
FWIW, this is working with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even with implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to provide the same dma-buf poll semantics as other drivers and high priority GFX contexts via EGL_IMG_context_priority which can preempt lower priority ones).
Yeah, that is really nice to have.
One question is if you wait on the CPU or the GPU for the new surface to become available?
It's based on polling dma-buf fds, i.e. CPU.
The former is a bit bad for latency and power management.
There isn't a choice for Wayland compositors in general, since there can be arbitrary other state which needs to be applied atomically together with the new buffer. (Though in theory, a compositor might get fancy and special-case surface commits which can be handled by waiting on the GPU)
Latency is largely a matter of scheduling in the compositor. The latency incurred by the compositor shouldn't have to be more than single-digit milliseconds. (I've seen total latency from when the client starts processing a (static) frame to when it starts being scanned out as low as ~6 ms with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620, lower than typical with Xorg)
Another question is if that is sufficient as security for the display server or if we need further handling down the road? I mean essentially we are moving the reliability problem into the display server.
Good question. This should generally protect the display server from freezing due to client fences never signalling, but there might still be corner cases.
Am 01.06.21 um 12:49 schrieb Michel Dänzer:
On 2021-06-01 12:21 p.m., Christian König wrote:
Am 01.06.21 um 11:02 schrieb Michel Dänzer:
On 2021-05-27 11:51 p.m., Marek Olšák wrote:
- Compositors (and other privileged processes, and display flipping) can't trust imported/exported fences. They need a timeout recovery mechanism from the beginning, and the following are some possible solutions to timeouts:
a) use a CPU wait with a small absolute timeout, and display the previous content on timeout b) use a GPU wait with a small absolute timeout, and conditional rendering will choose between the latest content (if signalled) and previous content (if timed out)
The result would be that the desktop can run close to 60 fps even if an app runs at 1 fps.
FWIW, this is working with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even with implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to provide the same dma-buf poll semantics as other drivers and high priority GFX contexts via EGL_IMG_context_priority which can preempt lower priority ones).
Yeah, that is really nice to have.
One question is if you wait on the CPU or the GPU for the new surface to become available?
It's based on polling dma-buf fds, i.e. CPU.
The former is a bit bad for latency and power management.
There isn't a choice for Wayland compositors in general, since there can be arbitrary other state which needs to be applied atomically together with the new buffer. (Though in theory, a compositor might get fancy and special-case surface commits which can be handled by waiting on the GPU)
Latency is largely a matter of scheduling in the compositor. The latency incurred by the compositor shouldn't have to be more than single-digit milliseconds. (I've seen total latency from when the client starts processing a (static) frame to when it starts being scanned out as low as ~6 ms with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620, lower than typical with Xorg)
Well let me describe it like this:
We have an use cases for 144 Hz guaranteed refresh rate. That essentially means that the client application needs to be able to spit out one frame/window content every ~6.9ms. That's tough, but doable.
When you now add 6ms latency in the compositor that means the client application has only .9ms left for it's frame which is basically impossible to do.
See for the user fences handling the display engine will learn to read sequence numbers from memory and decide on it's own if the old frame or the new one is scanned out. To get the latency there as low as possible.
Another question is if that is sufficient as security for the display server or if we need further handling down the road? I mean essentially we are moving the reliability problem into the display server.
Good question. This should generally protect the display server from freezing due to client fences never signalling, but there might still be corner cases.
On Tue, Jun 1, 2021 at 2:10 PM Christian König ckoenig.leichtzumerken@gmail.com wrote:
Am 01.06.21 um 12:49 schrieb Michel Dänzer:
On 2021-06-01 12:21 p.m., Christian König wrote:
Am 01.06.21 um 11:02 schrieb Michel Dänzer:
On 2021-05-27 11:51 p.m., Marek Olšák wrote:
- Compositors (and other privileged processes, and display flipping) can't trust imported/exported fences. They need a timeout recovery mechanism from the beginning, and the following are some possible solutions to timeouts:
a) use a CPU wait with a small absolute timeout, and display the previous content on timeout b) use a GPU wait with a small absolute timeout, and conditional rendering will choose between the latest content (if signalled) and previous content (if timed out)
The result would be that the desktop can run close to 60 fps even if an app runs at 1 fps.
FWIW, this is working with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even with implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to provide the same dma-buf poll semantics as other drivers and high priority GFX contexts via EGL_IMG_context_priority which can preempt lower priority ones).
Yeah, that is really nice to have.
One question is if you wait on the CPU or the GPU for the new surface to become available?
It's based on polling dma-buf fds, i.e. CPU.
The former is a bit bad for latency and power management.
There isn't a choice for Wayland compositors in general, since there can be arbitrary other state which needs to be applied atomically together with the new buffer. (Though in theory, a compositor might get fancy and special-case surface commits which can be handled by waiting on the GPU)
Latency is largely a matter of scheduling in the compositor. The latency incurred by the compositor shouldn't have to be more than single-digit milliseconds. (I've seen total latency from when the client starts processing a (static) frame to when it starts being scanned out as low as ~6 ms with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620, lower than typical with Xorg)
Well let me describe it like this:
We have an use cases for 144 Hz guaranteed refresh rate. That essentially means that the client application needs to be able to spit out one frame/window content every ~6.9ms. That's tough, but doable.
When you now add 6ms latency in the compositor that means the client application has only .9ms left for it's frame which is basically impossible to do.
See for the user fences handling the display engine will learn to read sequence numbers from memory and decide on it's own if the old frame or the new one is scanned out. To get the latency there as low as possible.
This won't work with implicit sync at all.
If you want to enable this use-case with driver magic and without the compositor being aware of what's going on, the solution is EGLStreams. Not sure we want to go there, but it's definitely a lot more feasible than trying to stuff eglstreams semantics into dma-buf implicit fencing support in a desperate attempt to not change compositors.
I still think the most reasonable approach here is that we wrap a dma_fence compat layer/mode over new hw for existing userspace/compositors. And then enable userspace memory fences and the fancy new features those allow with a new model that's built for them. Also even with dma_fence we could implement your model of staying with the previous buffer (or an older buffer at that's already rendered), but it needs explicit involvement of the compositor. At least without adding eglstreams fd to the kernel and wiring up all the relevant extensions. -Daniel
Another question is if that is sufficient as security for the display server or if we need further handling down the road? I mean essentially we are moving the reliability problem into the display server.
Good question. This should generally protect the display server from freezing due to client fences never signalling, but there might still be corner cases.
Am 01.06.21 um 14:30 schrieb Daniel Vetter:
On Tue, Jun 1, 2021 at 2:10 PM Christian König ckoenig.leichtzumerken@gmail.com wrote:
Am 01.06.21 um 12:49 schrieb Michel Dänzer:
On 2021-06-01 12:21 p.m., Christian König wrote:
Am 01.06.21 um 11:02 schrieb Michel Dänzer:
On 2021-05-27 11:51 p.m., Marek Olšák wrote:
- Compositors (and other privileged processes, and display flipping) can't trust imported/exported fences. They need a timeout recovery mechanism from the beginning, and the following are some possible solutions to timeouts:
a) use a CPU wait with a small absolute timeout, and display the previous content on timeout b) use a GPU wait with a small absolute timeout, and conditional rendering will choose between the latest content (if signalled) and previous content (if timed out)
The result would be that the desktop can run close to 60 fps even if an app runs at 1 fps.
FWIW, this is working with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even with implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to provide the same dma-buf poll semantics as other drivers and high priority GFX contexts via EGL_IMG_context_priority which can preempt lower priority ones).
Yeah, that is really nice to have.
One question is if you wait on the CPU or the GPU for the new surface to become available?
It's based on polling dma-buf fds, i.e. CPU.
The former is a bit bad for latency and power management.
There isn't a choice for Wayland compositors in general, since there can be arbitrary other state which needs to be applied atomically together with the new buffer. (Though in theory, a compositor might get fancy and special-case surface commits which can be handled by waiting on the GPU)
Latency is largely a matter of scheduling in the compositor. The latency incurred by the compositor shouldn't have to be more than single-digit milliseconds. (I've seen total latency from when the client starts processing a (static) frame to when it starts being scanned out as low as ~6 ms with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620, lower than typical with Xorg)
Well let me describe it like this:
We have an use cases for 144 Hz guaranteed refresh rate. That essentially means that the client application needs to be able to spit out one frame/window content every ~6.9ms. That's tough, but doable.
When you now add 6ms latency in the compositor that means the client application has only .9ms left for it's frame which is basically impossible to do.
See for the user fences handling the display engine will learn to read sequence numbers from memory and decide on it's own if the old frame or the new one is scanned out. To get the latency there as low as possible.
This won't work with implicit sync at all.
If you want to enable this use-case with driver magic and without the compositor being aware of what's going on, the solution is EGLStreams. Not sure we want to go there, but it's definitely a lot more feasible than trying to stuff eglstreams semantics into dma-buf implicit fencing support in a desperate attempt to not change compositors.
Well not changing compositors is certainly not something I would try with this use case.
Not changing compositors is more like ok we have Ubuntu 20.04 and need to support that we the newest hardware generation.
I still think the most reasonable approach here is that we wrap a dma_fence compat layer/mode over new hw for existing userspace/compositors. And then enable userspace memory fences and the fancy new features those allow with a new model that's built for them.
Yeah, that's basically the same direction I'm heading. Question is how to fix all those details.
Also even with dma_fence we could implement your model of staying with the previous buffer (or an older buffer at that's already rendered), but it needs explicit involvement of the compositor. At least without adding eglstreams fd to the kernel and wiring up all the relevant extensions.
Question is do we already have some extension which allows different textures to be selected on the fly depending on some state?
E.g. something like use new frame if it's available and old frame otherwise.
If you then apply this to the standard dma_fence based hardware or the new user fence based one is then pretty much irrelevant.
Regards, Christian.
-Daniel
Another question is if that is sufficient as security for the display server or if we need further handling down the road? I mean essentially we are moving the reliability problem into the display server.
Good question. This should generally protect the display server from freezing due to client fences never signalling, but there might still be corner cases.
On Tue., Jun. 1, 2021, 08:51 Christian König, < ckoenig.leichtzumerken@gmail.com> wrote:
Am 01.06.21 um 14:30 schrieb Daniel Vetter:
On Tue, Jun 1, 2021 at 2:10 PM Christian König ckoenig.leichtzumerken@gmail.com wrote:
Am 01.06.21 um 12:49 schrieb Michel Dänzer:
On 2021-06-01 12:21 p.m., Christian König wrote:
Am 01.06.21 um 11:02 schrieb Michel Dänzer:
On 2021-05-27 11:51 p.m., Marek Olšák wrote: > 3) Compositors (and other privileged processes, and display
flipping) can't trust imported/exported fences. They need a timeout recovery mechanism from the beginning, and the following are some possible solutions to timeouts:
> > a) use a CPU wait with a small absolute timeout, and display the
previous content on timeout
> b) use a GPU wait with a small absolute timeout, and conditional
rendering will choose between the latest content (if signalled) and previous content (if timed out)
> > The result would be that the desktop can run close to 60 fps even
if an app runs at 1 fps.
FWIW, this is working with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even
with implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to provide the same dma-buf poll semantics as other drivers and high priority GFX contexts via EGL_IMG_context_priority which can preempt lower priority ones).
Yeah, that is really nice to have.
One question is if you wait on the CPU or the GPU for the new surface
to become available?
It's based on polling dma-buf fds, i.e. CPU.
The former is a bit bad for latency and power management.
There isn't a choice for Wayland compositors in general, since there
can be arbitrary other state which needs to be applied atomically together with the new buffer. (Though in theory, a compositor might get fancy and special-case surface commits which can be handled by waiting on the GPU)
Latency is largely a matter of scheduling in the compositor. The
latency incurred by the compositor shouldn't have to be more than single-digit milliseconds. (I've seen total latency from when the client starts processing a (static) frame to when it starts being scanned out as low as ~6 ms with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620, lower than typical with Xorg)
Well let me describe it like this:
We have an use cases for 144 Hz guaranteed refresh rate. That essentially means that the client application needs to be able to spit out one frame/window content every ~6.9ms. That's tough, but doable.
When you now add 6ms latency in the compositor that means the client application has only .9ms left for it's frame which is basically impossible to do.
See for the user fences handling the display engine will learn to read sequence numbers from memory and decide on it's own if the old frame or the new one is scanned out. To get the latency there as low as possible.
This won't work with implicit sync at all.
If you want to enable this use-case with driver magic and without the compositor being aware of what's going on, the solution is EGLStreams. Not sure we want to go there, but it's definitely a lot more feasible than trying to stuff eglstreams semantics into dma-buf implicit fencing support in a desperate attempt to not change compositors.
Well not changing compositors is certainly not something I would try with this use case.
Not changing compositors is more like ok we have Ubuntu 20.04 and need to support that we the newest hardware generation.
I still think the most reasonable approach here is that we wrap a dma_fence compat layer/mode over new hw for existing userspace/compositors. And then enable userspace memory fences and the fancy new features those allow with a new model that's built for them.
Yeah, that's basically the same direction I'm heading. Question is how to fix all those details.
Also even with dma_fence we could implement your model of staying with the previous buffer (or an older buffer at that's already rendered), but it needs explicit involvement of the compositor. At least without adding eglstreams fd to the kernel and wiring up all the relevant extensions.
Question is do we already have some extension which allows different textures to be selected on the fly depending on some state?
There is no such extension for sync objects, but it can be done with queries, like occlusion queries. There is also no timeout option and it can only do "if" and "if not", but not "if .. else"
Marek
E.g. something like use new frame if it's available and old frame otherwise.
If you then apply this to the standard dma_fence based hardware or the new user fence based one is then pretty much irrelevant.
Regards, Christian.
-Daniel
Another question is if that is sufficient as security for the display
server or if we need further handling down the road? I mean essentially we are moving the reliability problem into the display server.
Good question. This should generally protect the display server from
freezing due to client fences never signalling, but there might still be corner cases.
On 2021-06-01 3:01 p.m., Marek Olšák wrote:
On Tue., Jun. 1, 2021, 08:51 Christian König, <ckoenig.leichtzumerken@gmail.com mailto:ckoenig.leichtzumerken@gmail.com> wrote:
Am 01.06.21 um 14:30 schrieb Daniel Vetter: > On Tue, Jun 1, 2021 at 2:10 PM Christian König > <ckoenig.leichtzumerken@gmail.com <mailto:ckoenig.leichtzumerken@gmail.com>> wrote: >> Am 01.06.21 um 12:49 schrieb Michel Dänzer: >>> On 2021-06-01 12:21 p.m., Christian König wrote: >>>> Am 01.06.21 um 11:02 schrieb Michel Dänzer: >>>>> On 2021-05-27 11:51 p.m., Marek Olšák wrote: >>>>>> 3) Compositors (and other privileged processes, and display flipping) can't trust imported/exported fences. They need a timeout recovery mechanism from the beginning, and the following are some possible solutions to timeouts: >>>>>> >>>>>> a) use a CPU wait with a small absolute timeout, and display the previous content on timeout >>>>>> b) use a GPU wait with a small absolute timeout, and conditional rendering will choose between the latest content (if signalled) and previous content (if timed out) >>>>>> >>>>>> The result would be that the desktop can run close to 60 fps even if an app runs at 1 fps. >>>>> FWIW, this is working with >>>>> https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 <https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880> , even with implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to provide the same dma-buf poll semantics as other drivers and high priority GFX contexts via EGL_IMG_context_priority which can preempt lower priority ones). >>>> Yeah, that is really nice to have. >>>> >>>> One question is if you wait on the CPU or the GPU for the new surface to become available? >>> It's based on polling dma-buf fds, i.e. CPU. >>> >>>> The former is a bit bad for latency and power management. >>> There isn't a choice for Wayland compositors in general, since there can be arbitrary other state which needs to be applied atomically together with the new buffer. (Though in theory, a compositor might get fancy and special-case surface commits which can be handled by waiting on the GPU) >>> >>> Latency is largely a matter of scheduling in the compositor. The latency incurred by the compositor shouldn't have to be more than single-digit milliseconds. (I've seen total latency from when the client starts processing a (static) frame to when it starts being scanned out as low as ~6 ms with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620 <https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620>, lower than typical with Xorg) >> Well let me describe it like this: >> >> We have an use cases for 144 Hz guaranteed refresh rate. That >> essentially means that the client application needs to be able to spit >> out one frame/window content every ~6.9ms. That's tough, but doable. >> >> When you now add 6ms latency in the compositor that means the client >> application has only .9ms left for it's frame which is basically >> impossible to do. >> >> See for the user fences handling the display engine will learn to read >> sequence numbers from memory and decide on it's own if the old frame or >> the new one is scanned out. To get the latency there as low as possible. > This won't work with implicit sync at all. > > If you want to enable this use-case with driver magic and without the > compositor being aware of what's going on, the solution is EGLStreams. > Not sure we want to go there, but it's definitely a lot more feasible > than trying to stuff eglstreams semantics into dma-buf implicit > fencing support in a desperate attempt to not change compositors.
EGLStreams are a red herring here. Wayland has atomic state transactions, similar to the atomic KMS API. These semantics could be achieved even with EGLStreams, at least via additional EGL extensions.
Any fancy technique we're discussing here would have to be completely between the Wayland compositor and the kernel, transparent to anything else.
> I still think the most reasonable approach here is that we wrap a > dma_fence compat layer/mode over new hw for existing > userspace/compositors. And then enable userspace memory fences and the > fancy new features those allow with a new model that's built for them. Yeah, that's basically the same direction I'm heading. Question is how to fix all those details. > Also even with dma_fence we could implement your model of staying with > the previous buffer (or an older buffer at that's already rendered), > but it needs explicit involvement of the compositor. At least without > adding eglstreams fd to the kernel and wiring up all the relevant > extensions. Question is do we already have some extension which allows different textures to be selected on the fly depending on some state?
There is no such extension for sync objects, but it can be done with queries, like occlusion queries. There is also no timeout option and it can only do "if" and "if not", but not "if .. else"
The Wayland compositor would also need to know which texture the shader ended up sampling from.
Hi Christian,
On Tue, 1 Jun 2021 at 13:51, Christian König ckoenig.leichtzumerken@gmail.com wrote:
Am 01.06.21 um 14:30 schrieb Daniel Vetter:
If you want to enable this use-case with driver magic and without the compositor being aware of what's going on, the solution is EGLStreams. Not sure we want to go there, but it's definitely a lot more feasible than trying to stuff eglstreams semantics into dma-buf implicit fencing support in a desperate attempt to not change compositors.
Well not changing compositors is certainly not something I would try with this use case.
Not changing compositors is more like ok we have Ubuntu 20.04 and need to support that we the newest hardware generation.
Serious question, have you talked to Canonical?
I mean there's a hell of a lot of effort being expended here, but it seems to all be predicated on the assumption that Ubuntu's LTS HWE/backport policy is totally immutable, and that we might need to make the kernel do backflips to work around that. But ... is it? Has anyone actually asked them how they feel about this?
I mean, my answer to the first email is 'no, absolutely not' from the technical perspective (the initial proposal totally breaks current and future userspace), from a design perspective (it breaks a lot of usecases which aren't single-vendor GPU+display+codec, or aren't just a simple desktop), and from a sustainability perspective (cutting Android adrift again isn't acceptable collateral damage to make it easier to backport things to last year's Ubuntu release).
But then again, I don't even know what I'm NAKing here ... ? The original email just lists a proposal to break a ton of things, with proposed replacements which aren't technically viable, and it's not clear why? Can we please have some more details and some reasoning behind them?
I don't mind that userspace (compositor, protocols, clients like Mesa as well as codec APIs) need to do a lot of work to support the new model. I do really care though that the hard-binary-switch model works fine enough for AMD but totally breaks heterogeneous systems and makes it impossible for userspace to do the right thing.
Cheers, Daniel
Yes, we can't break anything because we don't want to complicate things for us. It's pretty much all NAK'd already. We are trying to gather more knowledge and then make better decisions.
The idea we are considering is that we'll expose memory-based sync objects to userspace for read only, and the kernel or hw will strictly control the memory writes to those sync objects. The hole in that idea is that userspace can decide not to signal a job, so even if userspace can't overwrite memory-based sync object states arbitrarily, it can still decide not to signal them, and then a future fence is born.
Marek
On Wed, Jun 2, 2021 at 4:57 AM Daniel Stone daniel@fooishbar.org wrote:
Hi Christian,
On Tue, 1 Jun 2021 at 13:51, Christian König ckoenig.leichtzumerken@gmail.com wrote:
Am 01.06.21 um 14:30 schrieb Daniel Vetter:
If you want to enable this use-case with driver magic and without the compositor being aware of what's going on, the solution is EGLStreams. Not sure we want to go there, but it's definitely a lot more feasible than trying to stuff eglstreams semantics into dma-buf implicit fencing support in a desperate attempt to not change compositors.
Well not changing compositors is certainly not something I would try with this use case.
Not changing compositors is more like ok we have Ubuntu 20.04 and need to support that we the newest hardware generation.
Serious question, have you talked to Canonical?
I mean there's a hell of a lot of effort being expended here, but it seems to all be predicated on the assumption that Ubuntu's LTS HWE/backport policy is totally immutable, and that we might need to make the kernel do backflips to work around that. But ... is it? Has anyone actually asked them how they feel about this?
I mean, my answer to the first email is 'no, absolutely not' from the technical perspective (the initial proposal totally breaks current and future userspace), from a design perspective (it breaks a lot of usecases which aren't single-vendor GPU+display+codec, or aren't just a simple desktop), and from a sustainability perspective (cutting Android adrift again isn't acceptable collateral damage to make it easier to backport things to last year's Ubuntu release).
But then again, I don't even know what I'm NAKing here ... ? The original email just lists a proposal to break a ton of things, with proposed replacements which aren't technically viable, and it's not clear why? Can we please have some more details and some reasoning behind them?
I don't mind that userspace (compositor, protocols, clients like Mesa as well as codec APIs) need to do a lot of work to support the new model. I do really care though that the hard-binary-switch model works fine enough for AMD but totally breaks heterogeneous systems and makes it impossible for userspace to do the right thing.
Cheers, Daniel
On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák maraeo@gmail.com wrote:
Yes, we can't break anything because we don't want to complicate things for us. It's pretty much all NAK'd already. We are trying to gather more knowledge and then make better decisions.
The idea we are considering is that we'll expose memory-based sync objects to userspace for read only, and the kernel or hw will strictly control the memory writes to those sync objects. The hole in that idea is that userspace can decide not to signal a job, so even if userspace can't overwrite memory-based sync object states arbitrarily, it can still decide not to signal them, and then a future fence is born.
This would actually be treated as a GPU hang caused by that context, so it should be fine.
Marek
On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák maraeo@gmail.com wrote:
Yes, we can't break anything because we don't want to complicate things for us. It's pretty much all NAK'd already. We are trying to gather more knowledge and then make better decisions.
The idea we are considering is that we'll expose memory-based sync objects to userspace for read only, and the kernel or hw will strictly control the memory writes to those sync objects. The hole in that idea is that userspace can decide not to signal a job, so even if userspace can't overwrite memory-based sync object states arbitrarily, it can still decide not to signal them, and then a future fence is born.
This would actually be treated as a GPU hang caused by that context, so it should be fine.
This is practically what I proposed already, except your not doing it with dma_fence. And on the memory fence side this also doesn't actually give what you want for that compute model.
This seems like a bit a worst of both worlds approach to me? Tons of work in the kernel to hide these not-dma_fence-but-almost, and still pain to actually drive the hardware like it should be for compute or direct display.
Also maybe I've missed it, but I didn't see any replies to my suggestion how to fake the entire dma_fence stuff on top of new hw. Would be interesting to know what doesn't work there instead of amd folks going of into internal again and then coming back with another rfc from out of nowhere :-) -Daniel
Am 02.06.21 um 20:48 schrieb Daniel Vetter:
On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák maraeo@gmail.com wrote:
Yes, we can't break anything because we don't want to complicate things for us. It's pretty much all NAK'd already. We are trying to gather more knowledge and then make better decisions.
The idea we are considering is that we'll expose memory-based sync objects to userspace for read only, and the kernel or hw will strictly control the memory writes to those sync objects. The hole in that idea is that userspace can decide not to signal a job, so even if userspace can't overwrite memory-based sync object states arbitrarily, it can still decide not to signal them, and then a future fence is born.
This would actually be treated as a GPU hang caused by that context, so it should be fine.
This is practically what I proposed already, except your not doing it with dma_fence. And on the memory fence side this also doesn't actually give what you want for that compute model.
This seems like a bit a worst of both worlds approach to me? Tons of work in the kernel to hide these not-dma_fence-but-almost, and still pain to actually drive the hardware like it should be for compute or direct display.
Also maybe I've missed it, but I didn't see any replies to my suggestion how to fake the entire dma_fence stuff on top of new hw. Would be interesting to know what doesn't work there instead of amd folks going of into internal again and then coming back with another rfc from out of nowhere :-)
Well to be honest I would just push back on our hardware/firmware guys that we need to keep kernel queues forever before going down that route.
That syncfile and all that Android stuff isn't working out of the box with the new shiny user queue submission model (which in turn is mostly because of Windows) already raised some eyebrows here.
Christian.
-Daniel
On Wed, Jun 02, 2021 at 08:52:38PM +0200, Christian König wrote:
Am 02.06.21 um 20:48 schrieb Daniel Vetter:
On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák maraeo@gmail.com wrote:
Yes, we can't break anything because we don't want to complicate things for us. It's pretty much all NAK'd already. We are trying to gather more knowledge and then make better decisions.
The idea we are considering is that we'll expose memory-based sync objects to userspace for read only, and the kernel or hw will strictly control the memory writes to those sync objects. The hole in that idea is that userspace can decide not to signal a job, so even if userspace can't overwrite memory-based sync object states arbitrarily, it can still decide not to signal them, and then a future fence is born.
This would actually be treated as a GPU hang caused by that context, so it should be fine.
This is practically what I proposed already, except your not doing it with dma_fence. And on the memory fence side this also doesn't actually give what you want for that compute model.
This seems like a bit a worst of both worlds approach to me? Tons of work in the kernel to hide these not-dma_fence-but-almost, and still pain to actually drive the hardware like it should be for compute or direct display.
Also maybe I've missed it, but I didn't see any replies to my suggestion how to fake the entire dma_fence stuff on top of new hw. Would be interesting to know what doesn't work there instead of amd folks going of into internal again and then coming back with another rfc from out of nowhere :-)
Well to be honest I would just push back on our hardware/firmware guys that we need to keep kernel queues forever before going down that route.
I looked again, and you said the model wont work because preemption is way too slow, even when the context is idle.
I guess at that point I got maybe too fed up and just figured "not my problem", but if preempt is too slow as the unload fence, you can do it with pte removal and tlb shootdown too (that is hopefully not too slow, otherwise your hw is just garbage and wont even be fast for direct submit compute workloads).
The only thing that you need to do when you use pte clearing + tlb shootdown instad of preemption as the unload fence for buffers that get moved is that if you get any gpu page fault, you don't serve that, but instead treat it as a tdr and shot the context permanently.
So summarizing the model I proposed:
- you allow userspace to directly write into the ringbuffer, and also write the fences directly
- actual submit is done by the kernel, using drm/scheduler. The kernel blindly trusts userspace to set up everything else, and even just wraps dma_fences around the userspace memory fences.
- the only check is tdr. If a fence doesn't complete an tdr fires, a) the kernel shot the entire context and b) userspace recovers by setting up a new ringbuffer
- memory management is done using ttm only, you still need to supply the buffer list (ofc that list includes the always present ones, so CS will only get the list of special buffers like today). If you hw can't trun gpu page faults and you ever get one we pull up the same old solution: Kernel shots the entire context.
The important thing is that from the gpu pov memory management works exactly like compute workload with direct submit, except that you just terminate the context on _any_ page fault, instead of only those that go somewhere where there's really no mapping and repair the others.
Also I guess from reading the old thread this means you'd disable page fault retry because that is apparently also way too slow for anything.
- memory management uses an unload fence. That unload fences waits for all userspace memory fences (represented as dma_fence) to complete, with maybe some fudge to busy-spin until we've reached the actual end of the ringbuffer (maybe you have a IB tail there after the memory fence write, we have that on intel hw), and it waits for the memory to get "unloaded". This is either preemption, or pte clearing + tlb shootdown, or whatever else your hw provides which is a) used for dynamic memory management b) fast enough for actual memory management.
- any time a context dies we force-complete all it's pending fences, in-order ofc
So from hw pov this looks 99% like direct userspace submit, with the exact same mappings, command sequences and everything else. The only difference is that the rinbuffer head/tail updates happen from drm/scheduler, instead of directly from userspace.
None of this stuff needs funny tricks where the kernel controls the writes to memory fences, or where you need kernel ringbuffers, or anything like thist. Userspace is allowed to do anything stupid, the rules are guaranteed with:
- we rely on the hw isolation features to work, but _exactly_ like compute direct submit would too
- dying on any page fault captures memory management issues
- dying (without kernel recover, this is up to userspace if it cares) on any tdr makes sure fences complete still
That syncfile and all that Android stuff isn't working out of the box with the new shiny user queue submission model (which in turn is mostly because of Windows) already raised some eyebrows here.
I think if you really want to make sure the current linux stack doesn't break the _only_ option you have is provide a ctx mode that allows dma_fence and drm/scheduler to be used like today.
For everything else it sounds you're a few years too late, because even just huge kernel changes wont happen in time. Much less rewriting userspace protocols. -Daniel
Am 02.06.21 um 21:19 schrieb Daniel Vetter:
On Wed, Jun 02, 2021 at 08:52:38PM +0200, Christian König wrote:
Am 02.06.21 um 20:48 schrieb Daniel Vetter:
On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák maraeo@gmail.com wrote:
Yes, we can't break anything because we don't want to complicate things for us. It's pretty much all NAK'd already. We are trying to gather more knowledge and then make better decisions.
The idea we are considering is that we'll expose memory-based sync objects to userspace for read only, and the kernel or hw will strictly control the memory writes to those sync objects. The hole in that idea is that userspace can decide not to signal a job, so even if userspace can't overwrite memory-based sync object states arbitrarily, it can still decide not to signal them, and then a future fence is born.
This would actually be treated as a GPU hang caused by that context, so it should be fine.
This is practically what I proposed already, except your not doing it with dma_fence. And on the memory fence side this also doesn't actually give what you want for that compute model.
This seems like a bit a worst of both worlds approach to me? Tons of work in the kernel to hide these not-dma_fence-but-almost, and still pain to actually drive the hardware like it should be for compute or direct display.
Also maybe I've missed it, but I didn't see any replies to my suggestion how to fake the entire dma_fence stuff on top of new hw. Would be interesting to know what doesn't work there instead of amd folks going of into internal again and then coming back with another rfc from out of nowhere :-)
Well to be honest I would just push back on our hardware/firmware guys that we need to keep kernel queues forever before going down that route.
I looked again, and you said the model wont work because preemption is way too slow, even when the context is idle.
I guess at that point I got maybe too fed up and just figured "not my problem", but if preempt is too slow as the unload fence, you can do it with pte removal and tlb shootdown too (that is hopefully not too slow, otherwise your hw is just garbage and wont even be fast for direct submit compute workloads).
Have you seen that one here: https://www.spinics.net/lists/amd-gfx/msg63101.html :)
I've rejected it because I think polling for 6 seconds on a TLB flush which can block interrupts as well is just madness.
The only thing that you need to do when you use pte clearing + tlb shootdown instad of preemption as the unload fence for buffers that get moved is that if you get any gpu page fault, you don't serve that, but instead treat it as a tdr and shot the context permanently.
So summarizing the model I proposed:
you allow userspace to directly write into the ringbuffer, and also write the fences directly
actual submit is done by the kernel, using drm/scheduler. The kernel blindly trusts userspace to set up everything else, and even just wraps dma_fences around the userspace memory fences.
the only check is tdr. If a fence doesn't complete an tdr fires, a) the kernel shot the entire context and b) userspace recovers by setting up a new ringbuffer
memory management is done using ttm only, you still need to supply the buffer list (ofc that list includes the always present ones, so CS will only get the list of special buffers like today). If you hw can't trun gpu page faults and you ever get one we pull up the same old solution: Kernel shots the entire context.
The important thing is that from the gpu pov memory management works exactly like compute workload with direct submit, except that you just terminate the context on _any_ page fault, instead of only those that go somewhere where there's really no mapping and repair the others.
Also I guess from reading the old thread this means you'd disable page fault retry because that is apparently also way too slow for anything.
memory management uses an unload fence. That unload fences waits for all userspace memory fences (represented as dma_fence) to complete, with maybe some fudge to busy-spin until we've reached the actual end of the ringbuffer (maybe you have a IB tail there after the memory fence write, we have that on intel hw), and it waits for the memory to get "unloaded". This is either preemption, or pte clearing + tlb shootdown, or whatever else your hw provides which is a) used for dynamic memory management b) fast enough for actual memory management.
any time a context dies we force-complete all it's pending fences, in-order ofc
So from hw pov this looks 99% like direct userspace submit, with the exact same mappings, command sequences and everything else. The only difference is that the rinbuffer head/tail updates happen from drm/scheduler, instead of directly from userspace.
None of this stuff needs funny tricks where the kernel controls the writes to memory fences, or where you need kernel ringbuffers, or anything like thist. Userspace is allowed to do anything stupid, the rules are guaranteed with:
we rely on the hw isolation features to work, but _exactly_ like compute direct submit would too
dying on any page fault captures memory management issues
dying (without kernel recover, this is up to userspace if it cares) on any tdr makes sure fences complete still
That syncfile and all that Android stuff isn't working out of the box with the new shiny user queue submission model (which in turn is mostly because of Windows) already raised some eyebrows here.
I think if you really want to make sure the current linux stack doesn't break the _only_ option you have is provide a ctx mode that allows dma_fence and drm/scheduler to be used like today.
Yeah, but I still can just tell our hw/fw guys that we really really need to keep kernel queues or the whole Linux/Android infrastructure needs to get a compatibility layer like you describe above.
For everything else it sounds you're a few years too late, because even just huge kernel changes wont happen in time. Much less rewriting userspace protocols.
Seconded, question is rather if we are going to start migrating at some point or if we should keep pushing on our hw/fw guys.
Christian.
-Daniel
On Fri, Jun 04, 2021 at 09:00:31AM +0200, Christian König wrote:
Am 02.06.21 um 21:19 schrieb Daniel Vetter:
On Wed, Jun 02, 2021 at 08:52:38PM +0200, Christian König wrote:
Am 02.06.21 um 20:48 schrieb Daniel Vetter:
On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák maraeo@gmail.com wrote:
Yes, we can't break anything because we don't want to complicate things for us. It's pretty much all NAK'd already. We are trying to gather more knowledge and then make better decisions.
The idea we are considering is that we'll expose memory-based sync objects to userspace for read only, and the kernel or hw will strictly control the memory writes to those sync objects. The hole in that idea is that userspace can decide not to signal a job, so even if userspace can't overwrite memory-based sync object states arbitrarily, it can still decide not to signal them, and then a future fence is born.
This would actually be treated as a GPU hang caused by that context, so it should be fine.
This is practically what I proposed already, except your not doing it with dma_fence. And on the memory fence side this also doesn't actually give what you want for that compute model.
This seems like a bit a worst of both worlds approach to me? Tons of work in the kernel to hide these not-dma_fence-but-almost, and still pain to actually drive the hardware like it should be for compute or direct display.
Also maybe I've missed it, but I didn't see any replies to my suggestion how to fake the entire dma_fence stuff on top of new hw. Would be interesting to know what doesn't work there instead of amd folks going of into internal again and then coming back with another rfc from out of nowhere :-)
Well to be honest I would just push back on our hardware/firmware guys that we need to keep kernel queues forever before going down that route.
I looked again, and you said the model wont work because preemption is way too slow, even when the context is idle.
I guess at that point I got maybe too fed up and just figured "not my problem", but if preempt is too slow as the unload fence, you can do it with pte removal and tlb shootdown too (that is hopefully not too slow, otherwise your hw is just garbage and wont even be fast for direct submit compute workloads).
Have you seen that one here: https://www.spinics.net/lists/amd-gfx/msg63101.html :)
I've rejected it because I think polling for 6 seconds on a TLB flush which can block interrupts as well is just madness.
Hm but I thought you had like 2 tlb flush modes, the shitty one (with retrying page faults) and the not so shitty one?
But yeah at that point I think you just have to bite one of the bullets.
The thing is with hmm/userspace memory fence model this will be even worse, because you will _have_ to do this tlb flush deep down in core mm functions, so this is going to be userptr, but worse.
With the dma_resv/dma_fence bo memory management model you can at least wrap that tlb flush into a dma_fence and push the waiting/pinging onto a separate thread or something like that. If the hw really is that slow.
Somewhat aside: Personally I think that sriov needs to move over to the compute model, i.e. indefinite timeouts, no tdr, because everything takes too long. At least looking around sriov timeouts tend to be 10x bare metal, across the board.
But for stuff like cloud gaming that's serious amounts of heavy lifting since it brings us right back "the entire linux/android 3d stack is built on top of dma_fence right now".
The only thing that you need to do when you use pte clearing + tlb shootdown instad of preemption as the unload fence for buffers that get moved is that if you get any gpu page fault, you don't serve that, but instead treat it as a tdr and shot the context permanently.
So summarizing the model I proposed:
you allow userspace to directly write into the ringbuffer, and also write the fences directly
actual submit is done by the kernel, using drm/scheduler. The kernel blindly trusts userspace to set up everything else, and even just wraps dma_fences around the userspace memory fences.
the only check is tdr. If a fence doesn't complete an tdr fires, a) the kernel shot the entire context and b) userspace recovers by setting up a new ringbuffer
memory management is done using ttm only, you still need to supply the buffer list (ofc that list includes the always present ones, so CS will only get the list of special buffers like today). If you hw can't trun gpu page faults and you ever get one we pull up the same old solution: Kernel shots the entire context.
The important thing is that from the gpu pov memory management works exactly like compute workload with direct submit, except that you just terminate the context on _any_ page fault, instead of only those that go somewhere where there's really no mapping and repair the others.
Also I guess from reading the old thread this means you'd disable page fault retry because that is apparently also way too slow for anything.
memory management uses an unload fence. That unload fences waits for all userspace memory fences (represented as dma_fence) to complete, with maybe some fudge to busy-spin until we've reached the actual end of the ringbuffer (maybe you have a IB tail there after the memory fence write, we have that on intel hw), and it waits for the memory to get "unloaded". This is either preemption, or pte clearing + tlb shootdown, or whatever else your hw provides which is a) used for dynamic memory management b) fast enough for actual memory management.
any time a context dies we force-complete all it's pending fences, in-order ofc
So from hw pov this looks 99% like direct userspace submit, with the exact same mappings, command sequences and everything else. The only difference is that the rinbuffer head/tail updates happen from drm/scheduler, instead of directly from userspace.
None of this stuff needs funny tricks where the kernel controls the writes to memory fences, or where you need kernel ringbuffers, or anything like thist. Userspace is allowed to do anything stupid, the rules are guaranteed with:
we rely on the hw isolation features to work, but _exactly_ like compute direct submit would too
dying on any page fault captures memory management issues
dying (without kernel recover, this is up to userspace if it cares) on any tdr makes sure fences complete still
That syncfile and all that Android stuff isn't working out of the box with the new shiny user queue submission model (which in turn is mostly because of Windows) already raised some eyebrows here.
I think if you really want to make sure the current linux stack doesn't break the _only_ option you have is provide a ctx mode that allows dma_fence and drm/scheduler to be used like today.
Yeah, but I still can just tell our hw/fw guys that we really really need to keep kernel queues or the whole Linux/Android infrastructure needs to get a compatibility layer like you describe above.
For everything else it sounds you're a few years too late, because even just huge kernel changes wont happen in time. Much less rewriting userspace protocols.
Seconded, question is rather if we are going to start migrating at some point or if we should keep pushing on our hw/fw guys.
So from what I'm hearing other hw might gain the sw compat layer too. Plus I'm hoping that with the sw compat layer it'd be easier to smooth over userspace to the new model (because there will be a long time where we have to support both, maybe even with a runtime switch from userspace memory fences to dma_fence kernel stuff).
But in the end it's up to you what makes more sense between sw work and hw/fw work involved. -Daniel
Am 04.06.21 um 10:57 schrieb Daniel Vetter:
On Fri, Jun 04, 2021 at 09:00:31AM +0200, Christian König wrote:
Am 02.06.21 um 21:19 schrieb Daniel Vetter:
On Wed, Jun 02, 2021 at 08:52:38PM +0200, Christian König wrote:
Am 02.06.21 um 20:48 schrieb Daniel Vetter:
On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák maraeo@gmail.com wrote:
> Yes, we can't break anything because we don't want to complicate things > for us. It's pretty much all NAK'd already. We are trying to gather more > knowledge and then make better decisions. > > The idea we are considering is that we'll expose memory-based sync objects > to userspace for read only, and the kernel or hw will strictly control the > memory writes to those sync objects. The hole in that idea is that > userspace can decide not to signal a job, so even if userspace can't > overwrite memory-based sync object states arbitrarily, it can still decide > not to signal them, and then a future fence is born. > This would actually be treated as a GPU hang caused by that context, so it should be fine.
This is practically what I proposed already, except your not doing it with dma_fence. And on the memory fence side this also doesn't actually give what you want for that compute model.
This seems like a bit a worst of both worlds approach to me? Tons of work in the kernel to hide these not-dma_fence-but-almost, and still pain to actually drive the hardware like it should be for compute or direct display.
Also maybe I've missed it, but I didn't see any replies to my suggestion how to fake the entire dma_fence stuff on top of new hw. Would be interesting to know what doesn't work there instead of amd folks going of into internal again and then coming back with another rfc from out of nowhere :-)
Well to be honest I would just push back on our hardware/firmware guys that we need to keep kernel queues forever before going down that route.
I looked again, and you said the model wont work because preemption is way too slow, even when the context is idle.
I guess at that point I got maybe too fed up and just figured "not my problem", but if preempt is too slow as the unload fence, you can do it with pte removal and tlb shootdown too (that is hopefully not too slow, otherwise your hw is just garbage and wont even be fast for direct submit compute workloads).
Have you seen that one here: https://www.spinics.net/lists/amd-gfx/msg63101.html :)
I've rejected it because I think polling for 6 seconds on a TLB flush which can block interrupts as well is just madness.
Hm but I thought you had like 2 tlb flush modes, the shitty one (with retrying page faults) and the not so shitty one?
Yeah, we call this the lightweight and the heavyweight tlb flush.
The lighweight can be used when you are sure that you don't have any of the PTEs currently in flight in the 3D/DMA engine and you just need to invalidate the TLB.
The heavyweight must be used when you need to invalidate the TLB *AND* make sure that no concurrently operation moves new stuff into the TLB.
The problem is for this use case we have to use the heavyweight one.
But yeah at that point I think you just have to bite one of the bullets.
Yeah, completely agree. We can choose which way we want to die, but it's certainly not going to be nice whatever we do.
The thing is with hmm/userspace memory fence model this will be even worse, because you will _have_ to do this tlb flush deep down in core mm functions, so this is going to be userptr, but worse.
With the dma_resv/dma_fence bo memory management model you can at least wrap that tlb flush into a dma_fence and push the waiting/pinging onto a separate thread or something like that. If the hw really is that slow.
Somewhat aside: Personally I think that sriov needs to move over to the compute model, i.e. indefinite timeouts, no tdr, because everything takes too long. At least looking around sriov timeouts tend to be 10x bare metal, across the board.
But for stuff like cloud gaming that's serious amounts of heavy lifting since it brings us right back "the entire linux/android 3d stack is built on top of dma_fence right now".
The only thing that you need to do when you use pte clearing + tlb shootdown instad of preemption as the unload fence for buffers that get moved is that if you get any gpu page fault, you don't serve that, but instead treat it as a tdr and shot the context permanently.
So summarizing the model I proposed:
you allow userspace to directly write into the ringbuffer, and also write the fences directly
actual submit is done by the kernel, using drm/scheduler. The kernel blindly trusts userspace to set up everything else, and even just wraps dma_fences around the userspace memory fences.
the only check is tdr. If a fence doesn't complete an tdr fires, a) the kernel shot the entire context and b) userspace recovers by setting up a new ringbuffer
memory management is done using ttm only, you still need to supply the buffer list (ofc that list includes the always present ones, so CS will only get the list of special buffers like today). If you hw can't trun gpu page faults and you ever get one we pull up the same old solution: Kernel shots the entire context.
The important thing is that from the gpu pov memory management works exactly like compute workload with direct submit, except that you just terminate the context on _any_ page fault, instead of only those that go somewhere where there's really no mapping and repair the others.
Also I guess from reading the old thread this means you'd disable page fault retry because that is apparently also way too slow for anything.
memory management uses an unload fence. That unload fences waits for all userspace memory fences (represented as dma_fence) to complete, with maybe some fudge to busy-spin until we've reached the actual end of the ringbuffer (maybe you have a IB tail there after the memory fence write, we have that on intel hw), and it waits for the memory to get "unloaded". This is either preemption, or pte clearing + tlb shootdown, or whatever else your hw provides which is a) used for dynamic memory management b) fast enough for actual memory management.
any time a context dies we force-complete all it's pending fences, in-order ofc
So from hw pov this looks 99% like direct userspace submit, with the exact same mappings, command sequences and everything else. The only difference is that the rinbuffer head/tail updates happen from drm/scheduler, instead of directly from userspace.
None of this stuff needs funny tricks where the kernel controls the writes to memory fences, or where you need kernel ringbuffers, or anything like thist. Userspace is allowed to do anything stupid, the rules are guaranteed with:
we rely on the hw isolation features to work, but _exactly_ like compute direct submit would too
dying on any page fault captures memory management issues
dying (without kernel recover, this is up to userspace if it cares) on any tdr makes sure fences complete still
That syncfile and all that Android stuff isn't working out of the box with the new shiny user queue submission model (which in turn is mostly because of Windows) already raised some eyebrows here.
I think if you really want to make sure the current linux stack doesn't break the _only_ option you have is provide a ctx mode that allows dma_fence and drm/scheduler to be used like today.
Yeah, but I still can just tell our hw/fw guys that we really really need to keep kernel queues or the whole Linux/Android infrastructure needs to get a compatibility layer like you describe above.
For everything else it sounds you're a few years too late, because even just huge kernel changes wont happen in time. Much less rewriting userspace protocols.
Seconded, question is rather if we are going to start migrating at some point or if we should keep pushing on our hw/fw guys.
So from what I'm hearing other hw might gain the sw compat layer too. Plus I'm hoping that with the sw compat layer it'd be easier to smooth over userspace to the new model (because there will be a long time where we have to support both, maybe even with a runtime switch from userspace memory fences to dma_fence kernel stuff).
But in the end it's up to you what makes more sense between sw work and hw/fw work involved.
I'm currently entertaining my head a bit with the idea of implementing the HW scheduling on the CPU.
The only obstacle that I can really see is that we might need to unmap a page in the CPU page table when a queue becomes idle.
But apart from that it would give us the required functionality, just without the hardware scheduler.
Christian.
-Daniel
On Fri, Jun 04, 2021 at 01:27:15PM +0200, Christian König wrote:
Am 04.06.21 um 10:57 schrieb Daniel Vetter:
On Fri, Jun 04, 2021 at 09:00:31AM +0200, Christian König wrote:
Am 02.06.21 um 21:19 schrieb Daniel Vetter:
On Wed, Jun 02, 2021 at 08:52:38PM +0200, Christian König wrote:
Am 02.06.21 um 20:48 schrieb Daniel Vetter:
On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote: > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák maraeo@gmail.com wrote: > > > Yes, we can't break anything because we don't want to complicate things > > for us. It's pretty much all NAK'd already. We are trying to gather more > > knowledge and then make better decisions. > > > > The idea we are considering is that we'll expose memory-based sync objects > > to userspace for read only, and the kernel or hw will strictly control the > > memory writes to those sync objects. The hole in that idea is that > > userspace can decide not to signal a job, so even if userspace can't > > overwrite memory-based sync object states arbitrarily, it can still decide > > not to signal them, and then a future fence is born. > > > This would actually be treated as a GPU hang caused by that context, so it > should be fine. This is practically what I proposed already, except your not doing it with dma_fence. And on the memory fence side this also doesn't actually give what you want for that compute model.
This seems like a bit a worst of both worlds approach to me? Tons of work in the kernel to hide these not-dma_fence-but-almost, and still pain to actually drive the hardware like it should be for compute or direct display.
Also maybe I've missed it, but I didn't see any replies to my suggestion how to fake the entire dma_fence stuff on top of new hw. Would be interesting to know what doesn't work there instead of amd folks going of into internal again and then coming back with another rfc from out of nowhere :-)
Well to be honest I would just push back on our hardware/firmware guys that we need to keep kernel queues forever before going down that route.
I looked again, and you said the model wont work because preemption is way too slow, even when the context is idle.
I guess at that point I got maybe too fed up and just figured "not my problem", but if preempt is too slow as the unload fence, you can do it with pte removal and tlb shootdown too (that is hopefully not too slow, otherwise your hw is just garbage and wont even be fast for direct submit compute workloads).
Have you seen that one here: https://www.spinics.net/lists/amd-gfx/msg63101.html :)
I've rejected it because I think polling for 6 seconds on a TLB flush which can block interrupts as well is just madness.
Hm but I thought you had like 2 tlb flush modes, the shitty one (with retrying page faults) and the not so shitty one?
Yeah, we call this the lightweight and the heavyweight tlb flush.
The lighweight can be used when you are sure that you don't have any of the PTEs currently in flight in the 3D/DMA engine and you just need to invalidate the TLB.
The heavyweight must be used when you need to invalidate the TLB *AND* make sure that no concurrently operation moves new stuff into the TLB.
The problem is for this use case we have to use the heavyweight one.
Just for my own curiosity: So the lightweight flush is only for in-between CS when you know access is idle? Or does that also not work if userspace has a CS on a dma engine going at the same time because the tlb aren't isolated enough between engines? -Daniel
But yeah at that point I think you just have to bite one of the bullets.
Yeah, completely agree. We can choose which way we want to die, but it's certainly not going to be nice whatever we do.
The thing is with hmm/userspace memory fence model this will be even worse, because you will _have_ to do this tlb flush deep down in core mm functions, so this is going to be userptr, but worse.
With the dma_resv/dma_fence bo memory management model you can at least wrap that tlb flush into a dma_fence and push the waiting/pinging onto a separate thread or something like that. If the hw really is that slow.
Somewhat aside: Personally I think that sriov needs to move over to the compute model, i.e. indefinite timeouts, no tdr, because everything takes too long. At least looking around sriov timeouts tend to be 10x bare metal, across the board.
But for stuff like cloud gaming that's serious amounts of heavy lifting since it brings us right back "the entire linux/android 3d stack is built on top of dma_fence right now".
The only thing that you need to do when you use pte clearing + tlb shootdown instad of preemption as the unload fence for buffers that get moved is that if you get any gpu page fault, you don't serve that, but instead treat it as a tdr and shot the context permanently.
So summarizing the model I proposed:
you allow userspace to directly write into the ringbuffer, and also write the fences directly
actual submit is done by the kernel, using drm/scheduler. The kernel blindly trusts userspace to set up everything else, and even just wraps dma_fences around the userspace memory fences.
the only check is tdr. If a fence doesn't complete an tdr fires, a) the kernel shot the entire context and b) userspace recovers by setting up a new ringbuffer
memory management is done using ttm only, you still need to supply the buffer list (ofc that list includes the always present ones, so CS will only get the list of special buffers like today). If you hw can't trun gpu page faults and you ever get one we pull up the same old solution: Kernel shots the entire context.
The important thing is that from the gpu pov memory management works exactly like compute workload with direct submit, except that you just terminate the context on _any_ page fault, instead of only those that go somewhere where there's really no mapping and repair the others.
Also I guess from reading the old thread this means you'd disable page fault retry because that is apparently also way too slow for anything.
memory management uses an unload fence. That unload fences waits for all userspace memory fences (represented as dma_fence) to complete, with maybe some fudge to busy-spin until we've reached the actual end of the ringbuffer (maybe you have a IB tail there after the memory fence write, we have that on intel hw), and it waits for the memory to get "unloaded". This is either preemption, or pte clearing + tlb shootdown, or whatever else your hw provides which is a) used for dynamic memory management b) fast enough for actual memory management.
any time a context dies we force-complete all it's pending fences, in-order ofc
So from hw pov this looks 99% like direct userspace submit, with the exact same mappings, command sequences and everything else. The only difference is that the rinbuffer head/tail updates happen from drm/scheduler, instead of directly from userspace.
None of this stuff needs funny tricks where the kernel controls the writes to memory fences, or where you need kernel ringbuffers, or anything like thist. Userspace is allowed to do anything stupid, the rules are guaranteed with:
we rely on the hw isolation features to work, but _exactly_ like compute direct submit would too
dying on any page fault captures memory management issues
dying (without kernel recover, this is up to userspace if it cares) on any tdr makes sure fences complete still
That syncfile and all that Android stuff isn't working out of the box with the new shiny user queue submission model (which in turn is mostly because of Windows) already raised some eyebrows here.
I think if you really want to make sure the current linux stack doesn't break the _only_ option you have is provide a ctx mode that allows dma_fence and drm/scheduler to be used like today.
Yeah, but I still can just tell our hw/fw guys that we really really need to keep kernel queues or the whole Linux/Android infrastructure needs to get a compatibility layer like you describe above.
For everything else it sounds you're a few years too late, because even just huge kernel changes wont happen in time. Much less rewriting userspace protocols.
Seconded, question is rather if we are going to start migrating at some point or if we should keep pushing on our hw/fw guys.
So from what I'm hearing other hw might gain the sw compat layer too. Plus I'm hoping that with the sw compat layer it'd be easier to smooth over userspace to the new model (because there will be a long time where we have to support both, maybe even with a runtime switch from userspace memory fences to dma_fence kernel stuff).
But in the end it's up to you what makes more sense between sw work and hw/fw work involved.
I'm currently entertaining my head a bit with the idea of implementing the HW scheduling on the CPU.
The only obstacle that I can really see is that we might need to unmap a page in the CPU page table when a queue becomes idle.
But apart from that it would give us the required functionality, just without the hardware scheduler.
Christian.
-Daniel
Am 09.06.21 um 15:19 schrieb Daniel Vetter:
[SNIP]
Yeah, we call this the lightweight and the heavyweight tlb flush.
The lighweight can be used when you are sure that you don't have any of the PTEs currently in flight in the 3D/DMA engine and you just need to invalidate the TLB.
The heavyweight must be used when you need to invalidate the TLB *AND* make sure that no concurrently operation moves new stuff into the TLB.
The problem is for this use case we have to use the heavyweight one.
Just for my own curiosity: So the lightweight flush is only for in-between CS when you know access is idle? Or does that also not work if userspace has a CS on a dma engine going at the same time because the tlb aren't isolated enough between engines?
More or less correct, yes.
The problem is a lightweight flush only invalidates the TLB, but doesn't take care of entries which have been handed out to the different engines.
In other words what can happen is the following:
1. Shader asks TLB to resolve address X. 2. TLB looks into its cache and can't find address X so it asks the walker to resolve. 3. Walker comes back with result for address X and TLB puts that into its cache and gives it to Shader. 4. Shader starts doing some operation using result for address X. 5. You send lightweight TLB invalidate and TLB throws away cached values for address X. 6. Shader happily still uses whatever the TLB gave to it in step 3 to accesses address X
See it like the shader has their own 1 entry L0 TLB cache which is not affected by the lightweight flush.
The heavyweight flush on the other hand sends out a broadcast signal to everybody and only comes back when we are sure that an address is not in use any more.
Christian.
-Daniel
On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König wrote:
Am 09.06.21 um 15:19 schrieb Daniel Vetter:
[SNIP]
Yeah, we call this the lightweight and the heavyweight tlb flush.
The lighweight can be used when you are sure that you don't have any of the PTEs currently in flight in the 3D/DMA engine and you just need to invalidate the TLB.
The heavyweight must be used when you need to invalidate the TLB *AND* make sure that no concurrently operation moves new stuff into the TLB.
The problem is for this use case we have to use the heavyweight one.
Just for my own curiosity: So the lightweight flush is only for in-between CS when you know access is idle? Or does that also not work if userspace has a CS on a dma engine going at the same time because the tlb aren't isolated enough between engines?
More or less correct, yes.
The problem is a lightweight flush only invalidates the TLB, but doesn't take care of entries which have been handed out to the different engines.
In other words what can happen is the following:
- Shader asks TLB to resolve address X.
- TLB looks into its cache and can't find address X so it asks the walker
to resolve. 3. Walker comes back with result for address X and TLB puts that into its cache and gives it to Shader. 4. Shader starts doing some operation using result for address X. 5. You send lightweight TLB invalidate and TLB throws away cached values for address X. 6. Shader happily still uses whatever the TLB gave to it in step 3 to accesses address X
See it like the shader has their own 1 entry L0 TLB cache which is not affected by the lightweight flush.
The heavyweight flush on the other hand sends out a broadcast signal to everybody and only comes back when we are sure that an address is not in use any more.
Ah makes sense. On intel the shaders only operate in VA, everything goes around as explicit async messages to IO blocks. So we don't have this, the only difference in tlb flushes is between tlb flush in the IB and an mmio one which is independent for anything currently being executed on an egine. -Daniel
Hi Daniel,
We just talked about this whole topic internally and we came up to the conclusion that the hardware needs to understand sync object handles and have high-level wait and signal operations in the command stream. Sync objects will be backed by memory, but they won't be readable or writable by processes directly. The hardware will log all accesses to sync objects and will send the log to the kernel periodically. The kernel will identify malicious behavior.
Example of a hardware command stream: ... ImplicitSyncWait(syncObjHandle, sequenceNumber); // the sequence number is assigned by the kernel Draw(); ImplicitSyncSignalWhenDone(syncObjHandle); ...
I'm afraid we have no other choice because of the TLB invalidation overhead.
Marek
On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter daniel@ffwll.ch wrote:
On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König wrote:
Am 09.06.21 um 15:19 schrieb Daniel Vetter:
[SNIP]
Yeah, we call this the lightweight and the heavyweight tlb flush.
The lighweight can be used when you are sure that you don't have any
of the
PTEs currently in flight in the 3D/DMA engine and you just need to invalidate the TLB.
The heavyweight must be used when you need to invalidate the TLB
*AND* make
sure that no concurrently operation moves new stuff into the TLB.
The problem is for this use case we have to use the heavyweight one.
Just for my own curiosity: So the lightweight flush is only for
in-between
CS when you know access is idle? Or does that also not work if
userspace
has a CS on a dma engine going at the same time because the tlb aren't isolated enough between engines?
More or less correct, yes.
The problem is a lightweight flush only invalidates the TLB, but doesn't take care of entries which have been handed out to the different engines.
In other words what can happen is the following:
- Shader asks TLB to resolve address X.
- TLB looks into its cache and can't find address X so it asks the
walker
to resolve. 3. Walker comes back with result for address X and TLB puts that into its cache and gives it to Shader. 4. Shader starts doing some operation using result for address X. 5. You send lightweight TLB invalidate and TLB throws away cached values
for
address X. 6. Shader happily still uses whatever the TLB gave to it in step 3 to accesses address X
See it like the shader has their own 1 entry L0 TLB cache which is not affected by the lightweight flush.
The heavyweight flush on the other hand sends out a broadcast signal to everybody and only comes back when we are sure that an address is not in
use
any more.
Ah makes sense. On intel the shaders only operate in VA, everything goes around as explicit async messages to IO blocks. So we don't have this, the only difference in tlb flushes is between tlb flush in the IB and an mmio one which is independent for anything currently being executed on an egine.
-Daniel
Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
Hi guys,
maybe soften that a bit. Reading from the shared memory of the user fence is ok for everybody. What we need to take more care of is the writing side.
So my current thinking is that we allow read only access, but writing a new sequence value needs to go through the scheduler/kernel.
So when the CPU wants to signal a timeline fence it needs to call an IOCTL. When the GPU wants to signal the timeline fence it needs to hand that of to the hardware scheduler.
If we lockup the kernel can check with the hardware who did the last write and what value was written.
That together with an IOCTL to give out sequence number for implicit sync to applications should be sufficient for the kernel to track who is responsible if something bad happens.
In other words when the hardware says that the shader wrote stuff like 0xdeadbeef 0x0 or 0xffffffff into memory we kill the process who did that.
If the hardware says that seq - 1 was written fine, but seq is missing then the kernel blames whoever was supposed to write seq.
Just pieping the write through a privileged instance should be fine to make sure that we don't run into issues.
Christian.
Am 10.06.21 um 17:59 schrieb Marek Olšák:
Hi Daniel,
We just talked about this whole topic internally and we came up to the conclusion that the hardware needs to understand sync object handles and have high-level wait and signal operations in the command stream. Sync objects will be backed by memory, but they won't be readable or writable by processes directly. The hardware will log all accesses to sync objects and will send the log to the kernel periodically. The kernel will identify malicious behavior.
Example of a hardware command stream: ... ImplicitSyncWait(syncObjHandle, sequenceNumber); // the sequence number is assigned by the kernel Draw(); ImplicitSyncSignalWhenDone(syncObjHandle); ...
I'm afraid we have no other choice because of the TLB invalidation overhead.
Marek
On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter <daniel@ffwll.ch mailto:daniel@ffwll.ch> wrote:
On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König wrote: > Am 09.06.21 um 15:19 schrieb Daniel Vetter: > > [SNIP] > > > Yeah, we call this the lightweight and the heavyweight tlb flush. > > > > > > The lighweight can be used when you are sure that you don't have any of the > > > PTEs currently in flight in the 3D/DMA engine and you just need to > > > invalidate the TLB. > > > > > > The heavyweight must be used when you need to invalidate the TLB *AND* make > > > sure that no concurrently operation moves new stuff into the TLB. > > > > > > The problem is for this use case we have to use the heavyweight one. > > Just for my own curiosity: So the lightweight flush is only for in-between > > CS when you know access is idle? Or does that also not work if userspace > > has a CS on a dma engine going at the same time because the tlb aren't > > isolated enough between engines? > > More or less correct, yes. > > The problem is a lightweight flush only invalidates the TLB, but doesn't > take care of entries which have been handed out to the different engines. > > In other words what can happen is the following: > > 1. Shader asks TLB to resolve address X. > 2. TLB looks into its cache and can't find address X so it asks the walker > to resolve. > 3. Walker comes back with result for address X and TLB puts that into its > cache and gives it to Shader. > 4. Shader starts doing some operation using result for address X. > 5. You send lightweight TLB invalidate and TLB throws away cached values for > address X. > 6. Shader happily still uses whatever the TLB gave to it in step 3 to > accesses address X > > See it like the shader has their own 1 entry L0 TLB cache which is not > affected by the lightweight flush. > > The heavyweight flush on the other hand sends out a broadcast signal to > everybody and only comes back when we are sure that an address is not in use > any more. Ah makes sense. On intel the shaders only operate in VA, everything goes around as explicit async messages to IO blocks. So we don't have this, the only difference in tlb flushes is between tlb flush in the IB and an mmio one which is independent for anything currently being executed on an egine. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch <http://blog.ffwll.ch>
The call to the hw scheduler has a limitation on the size of all parameters combined. I think we can only pass a 32-bit sequence number and a ~16-bit global (per-GPU) syncobj handle in one call and not much else.
The syncobj handle can be an element index in a global (per-GPU) syncobj table and it's read only for all processes with the exception of the signal command. Syncobjs can either have per VMID write access flags for the signal command (slow), or any process can write to any syncobjs and only rely on the kernel checking the write log (fast).
In any case, we can execute the memory write in the queue engine and only use the hw scheduler for logging, which would be perfect.
Marek
On Thu, Jun 10, 2021 at 12:33 PM Christian König < ckoenig.leichtzumerken@gmail.com> wrote:
Hi guys,
maybe soften that a bit. Reading from the shared memory of the user fence is ok for everybody. What we need to take more care of is the writing side.
So my current thinking is that we allow read only access, but writing a new sequence value needs to go through the scheduler/kernel.
So when the CPU wants to signal a timeline fence it needs to call an IOCTL. When the GPU wants to signal the timeline fence it needs to hand that of to the hardware scheduler.
If we lockup the kernel can check with the hardware who did the last write and what value was written.
That together with an IOCTL to give out sequence number for implicit sync to applications should be sufficient for the kernel to track who is responsible if something bad happens.
In other words when the hardware says that the shader wrote stuff like 0xdeadbeef 0x0 or 0xffffffff into memory we kill the process who did that.
If the hardware says that seq - 1 was written fine, but seq is missing then the kernel blames whoever was supposed to write seq.
Just pieping the write through a privileged instance should be fine to make sure that we don't run into issues.
Christian.
Am 10.06.21 um 17:59 schrieb Marek Olšák:
Hi Daniel,
We just talked about this whole topic internally and we came up to the conclusion that the hardware needs to understand sync object handles and have high-level wait and signal operations in the command stream. Sync objects will be backed by memory, but they won't be readable or writable by processes directly. The hardware will log all accesses to sync objects and will send the log to the kernel periodically. The kernel will identify malicious behavior.
Example of a hardware command stream: ... ImplicitSyncWait(syncObjHandle, sequenceNumber); // the sequence number is assigned by the kernel Draw(); ImplicitSyncSignalWhenDone(syncObjHandle); ...
I'm afraid we have no other choice because of the TLB invalidation overhead.
Marek
On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter daniel@ffwll.ch wrote:
On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König wrote:
Am 09.06.21 um 15:19 schrieb Daniel Vetter:
[SNIP]
Yeah, we call this the lightweight and the heavyweight tlb flush.
The lighweight can be used when you are sure that you don't have
any of the
PTEs currently in flight in the 3D/DMA engine and you just need to invalidate the TLB.
The heavyweight must be used when you need to invalidate the TLB
*AND* make
sure that no concurrently operation moves new stuff into the TLB.
The problem is for this use case we have to use the heavyweight one.
Just for my own curiosity: So the lightweight flush is only for
in-between
CS when you know access is idle? Or does that also not work if
userspace
has a CS on a dma engine going at the same time because the tlb aren't isolated enough between engines?
More or less correct, yes.
The problem is a lightweight flush only invalidates the TLB, but doesn't take care of entries which have been handed out to the different
engines.
In other words what can happen is the following:
- Shader asks TLB to resolve address X.
- TLB looks into its cache and can't find address X so it asks the
walker
to resolve. 3. Walker comes back with result for address X and TLB puts that into
its
cache and gives it to Shader. 4. Shader starts doing some operation using result for address X. 5. You send lightweight TLB invalidate and TLB throws away cached
values for
address X. 6. Shader happily still uses whatever the TLB gave to it in step 3 to accesses address X
See it like the shader has their own 1 entry L0 TLB cache which is not affected by the lightweight flush.
The heavyweight flush on the other hand sends out a broadcast signal to everybody and only comes back when we are sure that an address is not
in use
any more.
Ah makes sense. On intel the shaders only operate in VA, everything goes around as explicit async messages to IO blocks. So we don't have this, the only difference in tlb flushes is between tlb flush in the IB and an mmio one which is independent for anything currently being executed on an egine.
-Daniel
Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
As long as we can figure out who touched to a certain sync object last that would indeed work, yes.
Christian.
Am 14.06.21 um 19:10 schrieb Marek Olšák:
The call to the hw scheduler has a limitation on the size of all parameters combined. I think we can only pass a 32-bit sequence number and a ~16-bit global (per-GPU) syncobj handle in one call and not much else.
The syncobj handle can be an element index in a global (per-GPU) syncobj table and it's read only for all processes with the exception of the signal command. Syncobjs can either have per VMID write access flags for the signal command (slow), or any process can write to any syncobjs and only rely on the kernel checking the write log (fast).
In any case, we can execute the memory write in the queue engine and only use the hw scheduler for logging, which would be perfect.
Marek
On Thu, Jun 10, 2021 at 12:33 PM Christian König <ckoenig.leichtzumerken@gmail.com mailto:ckoenig.leichtzumerken@gmail.com> wrote:
Hi guys, maybe soften that a bit. Reading from the shared memory of the user fence is ok for everybody. What we need to take more care of is the writing side. So my current thinking is that we allow read only access, but writing a new sequence value needs to go through the scheduler/kernel. So when the CPU wants to signal a timeline fence it needs to call an IOCTL. When the GPU wants to signal the timeline fence it needs to hand that of to the hardware scheduler. If we lockup the kernel can check with the hardware who did the last write and what value was written. That together with an IOCTL to give out sequence number for implicit sync to applications should be sufficient for the kernel to track who is responsible if something bad happens. In other words when the hardware says that the shader wrote stuff like 0xdeadbeef 0x0 or 0xffffffff into memory we kill the process who did that. If the hardware says that seq - 1 was written fine, but seq is missing then the kernel blames whoever was supposed to write seq. Just pieping the write through a privileged instance should be fine to make sure that we don't run into issues. Christian. Am 10.06.21 um 17:59 schrieb Marek Olšák:
Hi Daniel, We just talked about this whole topic internally and we came up to the conclusion that the hardware needs to understand sync object handles and have high-level wait and signal operations in the command stream. Sync objects will be backed by memory, but they won't be readable or writable by processes directly. The hardware will log all accesses to sync objects and will send the log to the kernel periodically. The kernel will identify malicious behavior. Example of a hardware command stream: ... ImplicitSyncWait(syncObjHandle, sequenceNumber); // the sequence number is assigned by the kernel Draw(); ImplicitSyncSignalWhenDone(syncObjHandle); ... I'm afraid we have no other choice because of the TLB invalidation overhead. Marek On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter <daniel@ffwll.ch <mailto:daniel@ffwll.ch>> wrote: On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König wrote: > Am 09.06.21 um 15:19 schrieb Daniel Vetter: > > [SNIP] > > > Yeah, we call this the lightweight and the heavyweight tlb flush. > > > > > > The lighweight can be used when you are sure that you don't have any of the > > > PTEs currently in flight in the 3D/DMA engine and you just need to > > > invalidate the TLB. > > > > > > The heavyweight must be used when you need to invalidate the TLB *AND* make > > > sure that no concurrently operation moves new stuff into the TLB. > > > > > > The problem is for this use case we have to use the heavyweight one. > > Just for my own curiosity: So the lightweight flush is only for in-between > > CS when you know access is idle? Or does that also not work if userspace > > has a CS on a dma engine going at the same time because the tlb aren't > > isolated enough between engines? > > More or less correct, yes. > > The problem is a lightweight flush only invalidates the TLB, but doesn't > take care of entries which have been handed out to the different engines. > > In other words what can happen is the following: > > 1. Shader asks TLB to resolve address X. > 2. TLB looks into its cache and can't find address X so it asks the walker > to resolve. > 3. Walker comes back with result for address X and TLB puts that into its > cache and gives it to Shader. > 4. Shader starts doing some operation using result for address X. > 5. You send lightweight TLB invalidate and TLB throws away cached values for > address X. > 6. Shader happily still uses whatever the TLB gave to it in step 3 to > accesses address X > > See it like the shader has their own 1 entry L0 TLB cache which is not > affected by the lightweight flush. > > The heavyweight flush on the other hand sends out a broadcast signal to > everybody and only comes back when we are sure that an address is not in use > any more. Ah makes sense. On intel the shaders only operate in VA, everything goes around as explicit async messages to IO blocks. So we don't have this, the only difference in tlb flushes is between tlb flush in the IB and an mmio one which is independent for anything currently being executed on an egine. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch <http://blog.ffwll.ch>
On Mon, Jun 14, 2021 at 07:13:00PM +0200, Christian König wrote:
As long as we can figure out who touched to a certain sync object last that would indeed work, yes.
Don't you need to know who will touch it next, i.e. who is holding up your fence? Or maybe I'm just again totally confused. -Daniel
Christian.
Am 14.06.21 um 19:10 schrieb Marek Olšák:
The call to the hw scheduler has a limitation on the size of all parameters combined. I think we can only pass a 32-bit sequence number and a ~16-bit global (per-GPU) syncobj handle in one call and not much else.
The syncobj handle can be an element index in a global (per-GPU) syncobj table and it's read only for all processes with the exception of the signal command. Syncobjs can either have per VMID write access flags for the signal command (slow), or any process can write to any syncobjs and only rely on the kernel checking the write log (fast).
In any case, we can execute the memory write in the queue engine and only use the hw scheduler for logging, which would be perfect.
Marek
On Thu, Jun 10, 2021 at 12:33 PM Christian König <ckoenig.leichtzumerken@gmail.com mailto:ckoenig.leichtzumerken@gmail.com> wrote:
Hi guys, maybe soften that a bit. Reading from the shared memory of the user fence is ok for everybody. What we need to take more care of is the writing side. So my current thinking is that we allow read only access, but writing a new sequence value needs to go through the scheduler/kernel. So when the CPU wants to signal a timeline fence it needs to call an IOCTL. When the GPU wants to signal the timeline fence it needs to hand that of to the hardware scheduler. If we lockup the kernel can check with the hardware who did the last write and what value was written. That together with an IOCTL to give out sequence number for implicit sync to applications should be sufficient for the kernel to track who is responsible if something bad happens. In other words when the hardware says that the shader wrote stuff like 0xdeadbeef 0x0 or 0xffffffff into memory we kill the process who did that. If the hardware says that seq - 1 was written fine, but seq is missing then the kernel blames whoever was supposed to write seq. Just pieping the write through a privileged instance should be fine to make sure that we don't run into issues. Christian. Am 10.06.21 um 17:59 schrieb Marek Olšák:
Hi Daniel, We just talked about this whole topic internally and we came up to the conclusion that the hardware needs to understand sync object handles and have high-level wait and signal operations in the command stream. Sync objects will be backed by memory, but they won't be readable or writable by processes directly. The hardware will log all accesses to sync objects and will send the log to the kernel periodically. The kernel will identify malicious behavior. Example of a hardware command stream: ... ImplicitSyncWait(syncObjHandle, sequenceNumber); // the sequence number is assigned by the kernel Draw(); ImplicitSyncSignalWhenDone(syncObjHandle); ... I'm afraid we have no other choice because of the TLB invalidation overhead. Marek On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter <daniel@ffwll.ch <mailto:daniel@ffwll.ch>> wrote: On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König wrote: > Am 09.06.21 um 15:19 schrieb Daniel Vetter: > > [SNIP] > > > Yeah, we call this the lightweight and the heavyweight tlb flush. > > > > > > The lighweight can be used when you are sure that you don't have any of the > > > PTEs currently in flight in the 3D/DMA engine and you just need to > > > invalidate the TLB. > > > > > > The heavyweight must be used when you need to invalidate the TLB *AND* make > > > sure that no concurrently operation moves new stuff into the TLB. > > > > > > The problem is for this use case we have to use the heavyweight one. > > Just for my own curiosity: So the lightweight flush is only for in-between > > CS when you know access is idle? Or does that also not work if userspace > > has a CS on a dma engine going at the same time because the tlb aren't > > isolated enough between engines? > > More or less correct, yes. > > The problem is a lightweight flush only invalidates the TLB, but doesn't > take care of entries which have been handed out to the different engines. > > In other words what can happen is the following: > > 1. Shader asks TLB to resolve address X. > 2. TLB looks into its cache and can't find address X so it asks the walker > to resolve. > 3. Walker comes back with result for address X and TLB puts that into its > cache and gives it to Shader. > 4. Shader starts doing some operation using result for address X. > 5. You send lightweight TLB invalidate and TLB throws away cached values for > address X. > 6. Shader happily still uses whatever the TLB gave to it in step 3 to > accesses address X > > See it like the shader has their own 1 entry L0 TLB cache which is not > affected by the lightweight flush. > > The heavyweight flush on the other hand sends out a broadcast signal to > everybody and only comes back when we are sure that an address is not in use > any more. Ah makes sense. On intel the shaders only operate in VA, everything goes around as explicit async messages to IO blocks. So we don't have this, the only difference in tlb flushes is between tlb flush in the IB and an mmio one which is independent for anything currently being executed on an egine. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch <http://blog.ffwll.ch>
The kernel will know who should touch the implicit-sync semaphore next, and at the same time, the copy of all write requests to the implicit-sync semaphore will be forwarded to the kernel for monitoring and bo_wait.
Syncobjs could either use the same monitored access as implicit sync or be completely unmonitored. We haven't decided yet.
Syncfiles could either use one of the above or wait for a syncobj to go idle before converting to a syncfile.
Marek
On Thu, Jun 17, 2021 at 12:48 PM Daniel Vetter daniel@ffwll.ch wrote:
On Mon, Jun 14, 2021 at 07:13:00PM +0200, Christian König wrote:
As long as we can figure out who touched to a certain sync object last
that
would indeed work, yes.
Don't you need to know who will touch it next, i.e. who is holding up your fence? Or maybe I'm just again totally confused. -Daniel
Christian.
Am 14.06.21 um 19:10 schrieb Marek Olšák:
The call to the hw scheduler has a limitation on the size of all parameters combined. I think we can only pass a 32-bit sequence number and a ~16-bit global (per-GPU) syncobj handle in one call and not much else.
The syncobj handle can be an element index in a global (per-GPU)
syncobj
table and it's read only for all processes with the exception of the signal command. Syncobjs can either have per VMID write access flags
for
the signal command (slow), or any process can write to any syncobjs and only rely on the kernel checking the write log (fast).
In any case, we can execute the memory write in the queue engine and only use the hw scheduler for logging, which would be perfect.
Marek
On Thu, Jun 10, 2021 at 12:33 PM Christian König <ckoenig.leichtzumerken@gmail.com mailto:ckoenig.leichtzumerken@gmail.com> wrote:
Hi guys, maybe soften that a bit. Reading from the shared memory of the user fence is ok for everybody. What we need to take more care of is the writing side. So my current thinking is that we allow read only access, but writing a new sequence value needs to go through the
scheduler/kernel.
So when the CPU wants to signal a timeline fence it needs to call an IOCTL. When the GPU wants to signal the timeline fence it needs to hand that of to the hardware scheduler. If we lockup the kernel can check with the hardware who did the last write and what value was written. That together with an IOCTL to give out sequence number for implicit sync to applications should be sufficient for the kernel to track who is responsible if something bad happens. In other words when the hardware says that the shader wrote stuff like 0xdeadbeef 0x0 or 0xffffffff into memory we kill the process who did that. If the hardware says that seq - 1 was written fine, but seq is missing then the kernel blames whoever was supposed to write seq. Just pieping the write through a privileged instance should be fine to make sure that we don't run into issues. Christian. Am 10.06.21 um 17:59 schrieb Marek Olšák:
Hi Daniel, We just talked about this whole topic internally and we came up to the conclusion that the hardware needs to understand sync object handles and have high-level wait and signal operations in the command stream. Sync objects will be backed by memory, but they won't be readable or writable by processes directly. The hardware will log all accesses to sync objects and will send the log to the kernel periodically. The kernel will identify malicious behavior. Example of a hardware command stream: ... ImplicitSyncWait(syncObjHandle, sequenceNumber); // the sequence number is assigned by the kernel Draw(); ImplicitSyncSignalWhenDone(syncObjHandle); ... I'm afraid we have no other choice because of the TLB invalidation overhead. Marek On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter <daniel@ffwll.ch <mailto:daniel@ffwll.ch>> wrote: On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König
wrote:
> Am 09.06.21 um 15:19 schrieb Daniel Vetter: > > [SNIP] > > > Yeah, we call this the lightweight and the heavyweight tlb flush. > > > > > > The lighweight can be used when you are sure that you don't have any of the > > > PTEs currently in flight in the 3D/DMA engine and you just need to > > > invalidate the TLB. > > > > > > The heavyweight must be used when you need to invalidate the TLB *AND* make > > > sure that no concurrently operation moves new stuff into the TLB. > > > > > > The problem is for this use case we have to use the heavyweight one. > > Just for my own curiosity: So the lightweight flush is only for in-between > > CS when you know access is idle? Or does that also not work if userspace > > has a CS on a dma engine going at the same time because the tlb aren't > > isolated enough between engines? > > More or less correct, yes. > > The problem is a lightweight flush only invalidates the TLB, but doesn't > take care of entries which have been handed out to the different engines. > > In other words what can happen is the following: > > 1. Shader asks TLB to resolve address X. > 2. TLB looks into its cache and can't find address X so it asks the walker > to resolve. > 3. Walker comes back with result for address X and TLB puts that into its > cache and gives it to Shader. > 4. Shader starts doing some operation using result for address X. > 5. You send lightweight TLB invalidate and TLB throws away cached values for > address X. > 6. Shader happily still uses whatever the TLB gave to it in step 3 to > accesses address X > > See it like the shader has their own 1 entry L0 TLB cache which is not > affected by the lightweight flush. > > The heavyweight flush on the other hand sends out a broadcast signal to > everybody and only comes back when we are sure that an address is not in use > any more. Ah makes sense. On intel the shaders only operate in VA, everything goes around as explicit async messages to IO blocks. So we don't have this, the only difference in tlb flushes is between tlb flush in the IB and an mmio one which is independent for anything currently being executed on an egine. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch <http://blog.ffwll.ch>
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Thu, Jun 17, 2021 at 02:28:06PM -0400, Marek Olšák wrote:
The kernel will know who should touch the implicit-sync semaphore next, and at the same time, the copy of all write requests to the implicit-sync semaphore will be forwarded to the kernel for monitoring and bo_wait.
Syncobjs could either use the same monitored access as implicit sync or be completely unmonitored. We haven't decided yet.
Syncfiles could either use one of the above or wait for a syncobj to go idle before converting to a syncfile.
Hm this sounds all like you're planning to completely rewrap everything ... I'm assuming the plan is still that this is going to be largely wrapped in dma_fence? Maybe with timeline objects being a bit more optimized, but I'm not sure how much you can optimize without breaking the interfaces. -Daniel
Marek
On Thu, Jun 17, 2021 at 12:48 PM Daniel Vetter daniel@ffwll.ch wrote:
On Mon, Jun 14, 2021 at 07:13:00PM +0200, Christian König wrote:
As long as we can figure out who touched to a certain sync object last
that
would indeed work, yes.
Don't you need to know who will touch it next, i.e. who is holding up your fence? Or maybe I'm just again totally confused. -Daniel
Christian.
Am 14.06.21 um 19:10 schrieb Marek Olšák:
The call to the hw scheduler has a limitation on the size of all parameters combined. I think we can only pass a 32-bit sequence number and a ~16-bit global (per-GPU) syncobj handle in one call and not much else.
The syncobj handle can be an element index in a global (per-GPU)
syncobj
table and it's read only for all processes with the exception of the signal command. Syncobjs can either have per VMID write access flags
for
the signal command (slow), or any process can write to any syncobjs and only rely on the kernel checking the write log (fast).
In any case, we can execute the memory write in the queue engine and only use the hw scheduler for logging, which would be perfect.
Marek
On Thu, Jun 10, 2021 at 12:33 PM Christian König <ckoenig.leichtzumerken@gmail.com mailto:ckoenig.leichtzumerken@gmail.com> wrote:
Hi guys, maybe soften that a bit. Reading from the shared memory of the user fence is ok for everybody. What we need to take more care of is the writing side. So my current thinking is that we allow read only access, but writing a new sequence value needs to go through the
scheduler/kernel.
So when the CPU wants to signal a timeline fence it needs to call an IOCTL. When the GPU wants to signal the timeline fence it needs to hand that of to the hardware scheduler. If we lockup the kernel can check with the hardware who did the last write and what value was written. That together with an IOCTL to give out sequence number for implicit sync to applications should be sufficient for the kernel to track who is responsible if something bad happens. In other words when the hardware says that the shader wrote stuff like 0xdeadbeef 0x0 or 0xffffffff into memory we kill the process who did that. If the hardware says that seq - 1 was written fine, but seq is missing then the kernel blames whoever was supposed to write seq. Just pieping the write through a privileged instance should be fine to make sure that we don't run into issues. Christian. Am 10.06.21 um 17:59 schrieb Marek Olšák:
Hi Daniel, We just talked about this whole topic internally and we came up to the conclusion that the hardware needs to understand sync object handles and have high-level wait and signal operations in the command stream. Sync objects will be backed by memory, but they won't be readable or writable by processes directly. The hardware will log all accesses to sync objects and will send the log to the kernel periodically. The kernel will identify malicious behavior. Example of a hardware command stream: ... ImplicitSyncWait(syncObjHandle, sequenceNumber); // the sequence number is assigned by the kernel Draw(); ImplicitSyncSignalWhenDone(syncObjHandle); ... I'm afraid we have no other choice because of the TLB invalidation overhead. Marek On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter <daniel@ffwll.ch <mailto:daniel@ffwll.ch>> wrote: On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König
wrote:
> Am 09.06.21 um 15:19 schrieb Daniel Vetter: > > [SNIP] > > > Yeah, we call this the lightweight and the heavyweight tlb flush. > > > > > > The lighweight can be used when you are sure that you don't have any of the > > > PTEs currently in flight in the 3D/DMA engine and you just need to > > > invalidate the TLB. > > > > > > The heavyweight must be used when you need to invalidate the TLB *AND* make > > > sure that no concurrently operation moves new stuff into the TLB. > > > > > > The problem is for this use case we have to use the heavyweight one. > > Just for my own curiosity: So the lightweight flush is only for in-between > > CS when you know access is idle? Or does that also not work if userspace > > has a CS on a dma engine going at the same time because the tlb aren't > > isolated enough between engines? > > More or less correct, yes. > > The problem is a lightweight flush only invalidates the TLB, but doesn't > take care of entries which have been handed out to the different engines. > > In other words what can happen is the following: > > 1. Shader asks TLB to resolve address X. > 2. TLB looks into its cache and can't find address X so it asks the walker > to resolve. > 3. Walker comes back with result for address X and TLB puts that into its > cache and gives it to Shader. > 4. Shader starts doing some operation using result for address X. > 5. You send lightweight TLB invalidate and TLB throws away cached values for > address X. > 6. Shader happily still uses whatever the TLB gave to it in step 3 to > accesses address X > > See it like the shader has their own 1 entry L0 TLB cache which is not > affected by the lightweight flush. > > The heavyweight flush on the other hand sends out a broadcast signal to > everybody and only comes back when we are sure that an address is not in use > any more. Ah makes sense. On intel the shaders only operate in VA, everything goes around as explicit async messages to IO blocks. So we don't have this, the only difference in tlb flushes is between tlb flush in the IB and an mmio one which is independent for anything currently being executed on an egine. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch <http://blog.ffwll.ch>
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
Timeline semaphore waits (polling on memory) will be unmonitored and as fast as the roundtrip to memory. Semaphore writes will be slower because the copy of those write requests will also be forwarded to the kernel. Arbitrary writes are not protected by the hw but the kernel will take action against such behavior because it will receive them too.
I don't know if that would work with dma_fence.
Marek
On Thu, Jun 17, 2021 at 3:04 PM Daniel Vetter daniel@ffwll.ch wrote:
On Thu, Jun 17, 2021 at 02:28:06PM -0400, Marek Olšák wrote:
The kernel will know who should touch the implicit-sync semaphore next,
and
at the same time, the copy of all write requests to the implicit-sync semaphore will be forwarded to the kernel for monitoring and bo_wait.
Syncobjs could either use the same monitored access as implicit sync or
be
completely unmonitored. We haven't decided yet.
Syncfiles could either use one of the above or wait for a syncobj to go idle before converting to a syncfile.
Hm this sounds all like you're planning to completely rewrap everything ... I'm assuming the plan is still that this is going to be largely wrapped in dma_fence? Maybe with timeline objects being a bit more optimized, but I'm not sure how much you can optimize without breaking the interfaces. -Daniel
Marek
On Thu, Jun 17, 2021 at 12:48 PM Daniel Vetter daniel@ffwll.ch wrote:
On Mon, Jun 14, 2021 at 07:13:00PM +0200, Christian König wrote:
As long as we can figure out who touched to a certain sync object
last
that
would indeed work, yes.
Don't you need to know who will touch it next, i.e. who is holding up
your
fence? Or maybe I'm just again totally confused. -Daniel
Christian.
Am 14.06.21 um 19:10 schrieb Marek Olšák:
The call to the hw scheduler has a limitation on the size of all parameters combined. I think we can only pass a 32-bit sequence
number
and a ~16-bit global (per-GPU) syncobj handle in one call and not
much
else.
The syncobj handle can be an element index in a global (per-GPU)
syncobj
table and it's read only for all processes with the exception of
the
signal command. Syncobjs can either have per VMID write access
flags
for
the signal command (slow), or any process can write to any
syncobjs and
only rely on the kernel checking the write log (fast).
In any case, we can execute the memory write in the queue engine
and
only use the hw scheduler for logging, which would be perfect.
Marek
On Thu, Jun 10, 2021 at 12:33 PM Christian König <ckoenig.leichtzumerken@gmail.com mailto:ckoenig.leichtzumerken@gmail.com> wrote:
Hi guys, maybe soften that a bit. Reading from the shared memory of the user fence is ok for everybody. What we need to take more care
of
is the writing side. So my current thinking is that we allow read only access, but writing a new sequence value needs to go through the
scheduler/kernel.
So when the CPU wants to signal a timeline fence it needs to
call
an IOCTL. When the GPU wants to signal the timeline fence it
needs
to hand that of to the hardware scheduler. If we lockup the kernel can check with the hardware who did the last write and what value was written. That together with an IOCTL to give out sequence number for implicit sync to applications should be sufficient for the
kernel
to track who is responsible if something bad happens. In other words when the hardware says that the shader wrote
stuff
like 0xdeadbeef 0x0 or 0xffffffff into memory we kill the
process
who did that. If the hardware says that seq - 1 was written fine, but seq is missing then the kernel blames whoever was supposed to write
seq.
Just pieping the write through a privileged instance should be fine to make sure that we don't run into issues. Christian. Am 10.06.21 um 17:59 schrieb Marek Olšák:
Hi Daniel, We just talked about this whole topic internally and we came
up
to the conclusion that the hardware needs to understand sync object handles and have high-level wait and signal
operations in
the command stream. Sync objects will be backed by memory,
but
they won't be readable or writable by processes directly. The hardware will log all accesses to sync objects and will send
the
log to the kernel periodically. The kernel will identify malicious behavior. Example of a hardware command stream: ... ImplicitSyncWait(syncObjHandle, sequenceNumber); // the
sequence
number is assigned by the kernel Draw(); ImplicitSyncSignalWhenDone(syncObjHandle); ... I'm afraid we have no other choice because of the TLB invalidation overhead. Marek On Wed, Jun 9, 2021 at 2:31 PM Daniel Vetter <
daniel@ffwll.ch
<mailto:daniel@ffwll.ch>> wrote: On Wed, Jun 09, 2021 at 03:58:26PM +0200, Christian König
wrote:
> Am 09.06.21 um 15:19 schrieb Daniel Vetter: > > [SNIP] > > > Yeah, we call this the lightweight and the
heavyweight
tlb flush. > > > > > > The lighweight can be used when you are sure that
you
don't have any of the > > > PTEs currently in flight in the 3D/DMA engine and
you
just need to > > > invalidate the TLB. > > > > > > The heavyweight must be used when you need to invalidate the TLB *AND* make > > > sure that no concurrently operation moves new stuff into the TLB. > > > > > > The problem is for this use case we have to use the heavyweight one. > > Just for my own curiosity: So the lightweight flush
is
only for in-between > > CS when you know access is idle? Or does that also
not
work if userspace > > has a CS on a dma engine going at the same time
because
the tlb aren't > > isolated enough between engines? > > More or less correct, yes. > > The problem is a lightweight flush only invalidates the TLB, but doesn't > take care of entries which have been handed out to the different engines. > > In other words what can happen is the following: > > 1. Shader asks TLB to resolve address X. > 2. TLB looks into its cache and can't find address X
so it
asks the walker > to resolve. > 3. Walker comes back with result for address X and TLB
puts
that into its > cache and gives it to Shader. > 4. Shader starts doing some operation using result for address X. > 5. You send lightweight TLB invalidate and TLB throws
away
cached values for > address X. > 6. Shader happily still uses whatever the TLB gave to
it in
step 3 to > accesses address X > > See it like the shader has their own 1 entry L0 TLB
cache
which is not > affected by the lightweight flush. > > The heavyweight flush on the other hand sends out a broadcast signal to > everybody and only comes back when we are sure that an address is not in use > any more. Ah makes sense. On intel the shaders only operate in VA, everything goes around as explicit async messages to IO blocks. So we
don't
have this, the only difference in tlb flushes is between tlb flush in
the IB
and an mmio one which is independent for anything currently being executed on an egine. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch <http://blog.ffwll.ch>
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter daniel@ffwll.ch wrote:
On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák maraeo@gmail.com wrote:
Yes, we can't break anything because we don't want to complicate things for us. It's pretty much all NAK'd already. We are trying to gather
more
knowledge and then make better decisions.
The idea we are considering is that we'll expose memory-based sync
objects
to userspace for read only, and the kernel or hw will strictly control
the
memory writes to those sync objects. The hole in that idea is that userspace can decide not to signal a job, so even if userspace can't overwrite memory-based sync object states arbitrarily, it can still
decide
not to signal them, and then a future fence is born.
This would actually be treated as a GPU hang caused by that context, so
it
should be fine.
This is practically what I proposed already, except your not doing it with dma_fence. And on the memory fence side this also doesn't actually give what you want for that compute model.
This seems like a bit a worst of both worlds approach to me? Tons of work in the kernel to hide these not-dma_fence-but-almost, and still pain to actually drive the hardware like it should be for compute or direct display.
Also maybe I've missed it, but I didn't see any replies to my suggestion how to fake the entire dma_fence stuff on top of new hw. Would be interesting to know what doesn't work there instead of amd folks going of into internal again and then coming back with another rfc from out of nowhere :-)
Going internal again is probably a good idea to spare you the long discussions and not waste your time, but we haven't talked about the dma_fence stuff internally other than acknowledging that it can be solved.
The compute use case already uses the hw as-is with no inter-process sharing, which mostly keeps the kernel out of the picture. It uses glFinish to sync with GL.
The gfx use case needs new hardware logic to support implicit and explicit sync. When we propose a solution, it's usually torn apart the next day by ourselves.
Since we are talking about next hw or next next hw, preemption should be better.
user queue = user-mapped ring buffer
For implicit sync, we will only let userspace lock access to a buffer via a user queue, which waits for the per-buffer sequence counter in memory to be
= the number assigned by the kernel, and later unlock the access with
another command, which increments the per-buffer sequence counter in memory with atomic_inc regardless of the number assigned by the kernel. The kernel counter and the counter in memory can be out-of-sync, and I'll explain why it's OK. If a process increments the kernel counter but not the memory counter, that's its problem and it's the same as a GPU hang caused by that process. If a process increments the memory counter but not the kernel counter, the ">=" condition alongside atomic_inc guarantee that signaling n will signal n+1, so it will never deadlock but also it will effectively disable synchronization. This method of disabling synchronization is similar to a process corrupting the buffer, which should be fine. Can you find any flaw in it? I can't find any.
The explicit submit can be done by userspace (if there is no synchronization), but we plan to use the kernel to do it for implicit sync. Essentially, the kernel will receive a buffer list and addresses of wait commands in the user queue. It will assign new sequence numbers to all buffers and write those numbers into the wait commands, and ring the hw doorbell to start execution of that queue.
Marek
On Wed, Jun 02, 2021 at 11:16:39PM -0400, Marek Olšák wrote:
On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter daniel@ffwll.ch wrote:
On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák maraeo@gmail.com wrote:
Yes, we can't break anything because we don't want to complicate things for us. It's pretty much all NAK'd already. We are trying to gather
more
knowledge and then make better decisions.
The idea we are considering is that we'll expose memory-based sync
objects
to userspace for read only, and the kernel or hw will strictly control
the
memory writes to those sync objects. The hole in that idea is that userspace can decide not to signal a job, so even if userspace can't overwrite memory-based sync object states arbitrarily, it can still
decide
not to signal them, and then a future fence is born.
This would actually be treated as a GPU hang caused by that context, so
it
should be fine.
This is practically what I proposed already, except your not doing it with dma_fence. And on the memory fence side this also doesn't actually give what you want for that compute model.
This seems like a bit a worst of both worlds approach to me? Tons of work in the kernel to hide these not-dma_fence-but-almost, and still pain to actually drive the hardware like it should be for compute or direct display.
Also maybe I've missed it, but I didn't see any replies to my suggestion how to fake the entire dma_fence stuff on top of new hw. Would be interesting to know what doesn't work there instead of amd folks going of into internal again and then coming back with another rfc from out of nowhere :-)
Going internal again is probably a good idea to spare you the long discussions and not waste your time, but we haven't talked about the dma_fence stuff internally other than acknowledging that it can be solved.
The compute use case already uses the hw as-is with no inter-process sharing, which mostly keeps the kernel out of the picture. It uses glFinish to sync with GL.
The gfx use case needs new hardware logic to support implicit and explicit sync. When we propose a solution, it's usually torn apart the next day by ourselves.
Since we are talking about next hw or next next hw, preemption should be better.
user queue = user-mapped ring buffer
For implicit sync, we will only let userspace lock access to a buffer via a user queue, which waits for the per-buffer sequence counter in memory to be
= the number assigned by the kernel, and later unlock the access with
another command, which increments the per-buffer sequence counter in memory with atomic_inc regardless of the number assigned by the kernel. The kernel counter and the counter in memory can be out-of-sync, and I'll explain why it's OK. If a process increments the kernel counter but not the memory counter, that's its problem and it's the same as a GPU hang caused by that process. If a process increments the memory counter but not the kernel counter, the ">=" condition alongside atomic_inc guarantee that signaling n will signal n+1, so it will never deadlock but also it will effectively disable synchronization. This method of disabling synchronization is similar to a process corrupting the buffer, which should be fine. Can you find any flaw in it? I can't find any.
Hm maybe I misunderstood what exactly you wanted to do earlier. That kind of "we let userspace free-wheel whatever it wants, kernel ensures correctness of the resulting chain of dma_fence with reset the entire context" is what I proposed too.
Like you say, userspace is allowed to render garbage already.
The explicit submit can be done by userspace (if there is no synchronization), but we plan to use the kernel to do it for implicit sync. Essentially, the kernel will receive a buffer list and addresses of wait commands in the user queue. It will assign new sequence numbers to all buffers and write those numbers into the wait commands, and ring the hw doorbell to start execution of that queue.
Yeah for implicit sync I think kernel and using drm/scheduler to sort out the dma_fence dependencies is probably best. Since you can filter out which dma_fence you hand to the scheduler for dependency tracking you can filter out your own ones and let the hw handle those directly (depending how much your hw can do an all that). On i915 we might do that to be able to use MI_SEMAPHORE_WAIT/SIGNAL functionality in the hw and fw scheduler.
For buffer tracking with implicit sync I think cleanest is probably to still keep them wrapped as dma_fence and stuffed into dma_resv, but conceptually it's the same. If we let every driver reinvent their own buffer tracking just because the hw works a bit different it'll be a mess.
Wrt wait commands: I'm honestly not sure why you'd do that. Userspace gets to keep the pieces if it gets it wrong. You do still need to handle external dma_fence though, hence drm/scheduler frontend to sort these out. -Daniel
On Thu, Jun 3, 2021 at 3:47 AM Daniel Vetter daniel@ffwll.ch wrote:
On Wed, Jun 02, 2021 at 11:16:39PM -0400, Marek Olšák wrote:
On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter daniel@ffwll.ch wrote:
On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák maraeo@gmail.com wrote:
Yes, we can't break anything because we don't want to complicate
things
for us. It's pretty much all NAK'd already. We are trying to gather
more
knowledge and then make better decisions.
The idea we are considering is that we'll expose memory-based sync
objects
to userspace for read only, and the kernel or hw will strictly
control
the
memory writes to those sync objects. The hole in that idea is that userspace can decide not to signal a job, so even if userspace
can't
overwrite memory-based sync object states arbitrarily, it can still
decide
not to signal them, and then a future fence is born.
This would actually be treated as a GPU hang caused by that context,
so
it
should be fine.
This is practically what I proposed already, except your not doing it
with
dma_fence. And on the memory fence side this also doesn't actually give what you want for that compute model.
This seems like a bit a worst of both worlds approach to me? Tons of
work
in the kernel to hide these not-dma_fence-but-almost, and still pain to actually drive the hardware like it should be for compute or direct display.
Also maybe I've missed it, but I didn't see any replies to my
suggestion
how to fake the entire dma_fence stuff on top of new hw. Would be interesting to know what doesn't work there instead of amd folks going
of
into internal again and then coming back with another rfc from out of nowhere :-)
Going internal again is probably a good idea to spare you the long discussions and not waste your time, but we haven't talked about the dma_fence stuff internally other than acknowledging that it can be
solved.
The compute use case already uses the hw as-is with no inter-process sharing, which mostly keeps the kernel out of the picture. It uses
glFinish
to sync with GL.
The gfx use case needs new hardware logic to support implicit and
explicit
sync. When we propose a solution, it's usually torn apart the next day by ourselves.
Since we are talking about next hw or next next hw, preemption should be better.
user queue = user-mapped ring buffer
For implicit sync, we will only let userspace lock access to a buffer
via a
user queue, which waits for the per-buffer sequence counter in memory to
be
= the number assigned by the kernel, and later unlock the access with
another command, which increments the per-buffer sequence counter in
memory
with atomic_inc regardless of the number assigned by the kernel. The
kernel
counter and the counter in memory can be out-of-sync, and I'll explain
why
it's OK. If a process increments the kernel counter but not the memory counter, that's its problem and it's the same as a GPU hang caused by
that
process. If a process increments the memory counter but not the kernel counter, the ">=" condition alongside atomic_inc guarantee that
signaling n
will signal n+1, so it will never deadlock but also it will effectively disable synchronization. This method of disabling synchronization is similar to a process corrupting the buffer, which should be fine. Can you find any flaw in it? I can't find any.
Hm maybe I misunderstood what exactly you wanted to do earlier. That kind of "we let userspace free-wheel whatever it wants, kernel ensures correctness of the resulting chain of dma_fence with reset the entire context" is what I proposed too.
Like you say, userspace is allowed to render garbage already.
The explicit submit can be done by userspace (if there is no synchronization), but we plan to use the kernel to do it for implicit
sync.
Essentially, the kernel will receive a buffer list and addresses of wait commands in the user queue. It will assign new sequence numbers to all buffers and write those numbers into the wait commands, and ring the hw doorbell to start execution of that queue.
Yeah for implicit sync I think kernel and using drm/scheduler to sort out the dma_fence dependencies is probably best. Since you can filter out which dma_fence you hand to the scheduler for dependency tracking you can filter out your own ones and let the hw handle those directly (depending how much your hw can do an all that). On i915 we might do that to be able to use MI_SEMAPHORE_WAIT/SIGNAL functionality in the hw and fw scheduler.
For buffer tracking with implicit sync I think cleanest is probably to still keep them wrapped as dma_fence and stuffed into dma_resv, but conceptually it's the same. If we let every driver reinvent their own buffer tracking just because the hw works a bit different it'll be a mess.
Wrt wait commands: I'm honestly not sure why you'd do that. Userspace gets to keep the pieces if it gets it wrong. You do still need to handle external dma_fence though, hence drm/scheduler frontend to sort these out.
The reason is to disallow lower-privileged process to deadlock/hang a higher-privileged process where the kernel can't tell who did it. If the implicit-sync sequence counter is read only to userspace and only incrementable by the unlock-signal command after the lock-wait command appeared in the same queue (both together forming a critical section), userspace can't manipulate it arbitrarily and we get almost the exact same behavior as implicit sync has today. That means any implicitly-sync'd buffer from any process can be fully trusted by a compositor to signal in a finite time, and possibly even trusted by the kernel. The only thing that's different is that a malicious process can disable implicit sync for a buffer in all processes/kernel, but it can't hang other processes/kernel (it can only hang itself and the kernel will be notified). So I'm a happy panda now. :)
Marek
On Thu, Jun 03, 2021 at 04:20:18AM -0400, Marek Olšák wrote:
On Thu, Jun 3, 2021 at 3:47 AM Daniel Vetter daniel@ffwll.ch wrote:
On Wed, Jun 02, 2021 at 11:16:39PM -0400, Marek Olšák wrote:
On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter daniel@ffwll.ch wrote:
On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák maraeo@gmail.com wrote:
Yes, we can't break anything because we don't want to complicate
things
for us. It's pretty much all NAK'd already. We are trying to gather
more
knowledge and then make better decisions.
The idea we are considering is that we'll expose memory-based sync
objects
to userspace for read only, and the kernel or hw will strictly
control
the
memory writes to those sync objects. The hole in that idea is that userspace can decide not to signal a job, so even if userspace
can't
overwrite memory-based sync object states arbitrarily, it can still
decide
not to signal them, and then a future fence is born.
This would actually be treated as a GPU hang caused by that context,
so
it
should be fine.
This is practically what I proposed already, except your not doing it
with
dma_fence. And on the memory fence side this also doesn't actually give what you want for that compute model.
This seems like a bit a worst of both worlds approach to me? Tons of
work
in the kernel to hide these not-dma_fence-but-almost, and still pain to actually drive the hardware like it should be for compute or direct display.
Also maybe I've missed it, but I didn't see any replies to my
suggestion
how to fake the entire dma_fence stuff on top of new hw. Would be interesting to know what doesn't work there instead of amd folks going
of
into internal again and then coming back with another rfc from out of nowhere :-)
Going internal again is probably a good idea to spare you the long discussions and not waste your time, but we haven't talked about the dma_fence stuff internally other than acknowledging that it can be
solved.
The compute use case already uses the hw as-is with no inter-process sharing, which mostly keeps the kernel out of the picture. It uses
glFinish
to sync with GL.
The gfx use case needs new hardware logic to support implicit and
explicit
sync. When we propose a solution, it's usually torn apart the next day by ourselves.
Since we are talking about next hw or next next hw, preemption should be better.
user queue = user-mapped ring buffer
For implicit sync, we will only let userspace lock access to a buffer
via a
user queue, which waits for the per-buffer sequence counter in memory to
be
= the number assigned by the kernel, and later unlock the access with
another command, which increments the per-buffer sequence counter in
memory
with atomic_inc regardless of the number assigned by the kernel. The
kernel
counter and the counter in memory can be out-of-sync, and I'll explain
why
it's OK. If a process increments the kernel counter but not the memory counter, that's its problem and it's the same as a GPU hang caused by
that
process. If a process increments the memory counter but not the kernel counter, the ">=" condition alongside atomic_inc guarantee that
signaling n
will signal n+1, so it will never deadlock but also it will effectively disable synchronization. This method of disabling synchronization is similar to a process corrupting the buffer, which should be fine. Can you find any flaw in it? I can't find any.
Hm maybe I misunderstood what exactly you wanted to do earlier. That kind of "we let userspace free-wheel whatever it wants, kernel ensures correctness of the resulting chain of dma_fence with reset the entire context" is what I proposed too.
Like you say, userspace is allowed to render garbage already.
The explicit submit can be done by userspace (if there is no synchronization), but we plan to use the kernel to do it for implicit
sync.
Essentially, the kernel will receive a buffer list and addresses of wait commands in the user queue. It will assign new sequence numbers to all buffers and write those numbers into the wait commands, and ring the hw doorbell to start execution of that queue.
Yeah for implicit sync I think kernel and using drm/scheduler to sort out the dma_fence dependencies is probably best. Since you can filter out which dma_fence you hand to the scheduler for dependency tracking you can filter out your own ones and let the hw handle those directly (depending how much your hw can do an all that). On i915 we might do that to be able to use MI_SEMAPHORE_WAIT/SIGNAL functionality in the hw and fw scheduler.
For buffer tracking with implicit sync I think cleanest is probably to still keep them wrapped as dma_fence and stuffed into dma_resv, but conceptually it's the same. If we let every driver reinvent their own buffer tracking just because the hw works a bit different it'll be a mess.
Wrt wait commands: I'm honestly not sure why you'd do that. Userspace gets to keep the pieces if it gets it wrong. You do still need to handle external dma_fence though, hence drm/scheduler frontend to sort these out.
The reason is to disallow lower-privileged process to deadlock/hang a higher-privileged process where the kernel can't tell who did it. If the implicit-sync sequence counter is read only to userspace and only incrementable by the unlock-signal command after the lock-wait command appeared in the same queue (both together forming a critical section), userspace can't manipulate it arbitrarily and we get almost the exact same behavior as implicit sync has today. That means any implicitly-sync'd buffer from any process can be fully trusted by a compositor to signal in a finite time, and possibly even trusted by the kernel. The only thing that's different is that a malicious process can disable implicit sync for a buffer in all processes/kernel, but it can't hang other processes/kernel (it can only hang itself and the kernel will be notified). So I'm a happy panda now. :)
Yeah I think that's not going to work too well, and is too many piled up hacks. Within a drm_file fd you can do whatever you feel like, since it's just one client.
But once implicit sync kicks in I think you need to go with dma_fence and drm/scheduler to handle the dependencies, and tdr kicking it. With the dma_fence you do know who's the offender - you might not know why, but that doesn't matter, you just shred the entire context and let that userspace figure out the details.
I think trying to make memory fences work as implicit sync directly, without wrapping them in a dma_fence and assorted guarantees, will just not work.
And once you do wrap them in dma_fence, then all the other problems go away: cross-driver sync, syncfiles, ... So I really don't see the benefit of this half-way approach.
Yes there's going to be a tad bit of overhead, but that's already there in the current model. And it can't hurt to have a bit of motivation for compositors to switch over to userspace memory fences properly. -Daniel
On Thu., Jun. 3, 2021, 06:03 Daniel Vetter, daniel@ffwll.ch wrote:
On Thu, Jun 03, 2021 at 04:20:18AM -0400, Marek Olšák wrote:
On Thu, Jun 3, 2021 at 3:47 AM Daniel Vetter daniel@ffwll.ch wrote:
On Wed, Jun 02, 2021 at 11:16:39PM -0400, Marek Olšák wrote:
On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter daniel@ffwll.ch
wrote:
On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák maraeo@gmail.com
wrote:
> Yes, we can't break anything because we don't want to
complicate
things
> for us. It's pretty much all NAK'd already. We are trying to
gather
more
> knowledge and then make better decisions. > > The idea we are considering is that we'll expose memory-based
sync
objects
> to userspace for read only, and the kernel or hw will strictly
control
the
> memory writes to those sync objects. The hole in that idea is
that
> userspace can decide not to signal a job, so even if userspace
can't
> overwrite memory-based sync object states arbitrarily, it can
still
decide
> not to signal them, and then a future fence is born. >
This would actually be treated as a GPU hang caused by that
context,
so
it
should be fine.
This is practically what I proposed already, except your not doing
it
with
dma_fence. And on the memory fence side this also doesn't actually
give
what you want for that compute model.
This seems like a bit a worst of both worlds approach to me? Tons
of
work
in the kernel to hide these not-dma_fence-but-almost, and still
pain to
actually drive the hardware like it should be for compute or direct display.
Also maybe I've missed it, but I didn't see any replies to my
suggestion
how to fake the entire dma_fence stuff on top of new hw. Would be interesting to know what doesn't work there instead of amd folks
going
of
into internal again and then coming back with another rfc from out
of
nowhere :-)
Going internal again is probably a good idea to spare you the long discussions and not waste your time, but we haven't talked about the dma_fence stuff internally other than acknowledging that it can be
solved.
The compute use case already uses the hw as-is with no inter-process sharing, which mostly keeps the kernel out of the picture. It uses
glFinish
to sync with GL.
The gfx use case needs new hardware logic to support implicit and
explicit
sync. When we propose a solution, it's usually torn apart the next
day by
ourselves.
Since we are talking about next hw or next next hw, preemption
should be
better.
user queue = user-mapped ring buffer
For implicit sync, we will only let userspace lock access to a buffer
via a
user queue, which waits for the per-buffer sequence counter in
memory to
be
= the number assigned by the kernel, and later unlock the access
with
another command, which increments the per-buffer sequence counter in
memory
with atomic_inc regardless of the number assigned by the kernel. The
kernel
counter and the counter in memory can be out-of-sync, and I'll
explain
why
it's OK. If a process increments the kernel counter but not the
memory
counter, that's its problem and it's the same as a GPU hang caused by
that
process. If a process increments the memory counter but not the
kernel
counter, the ">=" condition alongside atomic_inc guarantee that
signaling n
will signal n+1, so it will never deadlock but also it will
effectively
disable synchronization. This method of disabling synchronization is similar to a process corrupting the buffer, which should be fine.
Can you
find any flaw in it? I can't find any.
Hm maybe I misunderstood what exactly you wanted to do earlier. That
kind
of "we let userspace free-wheel whatever it wants, kernel ensures correctness of the resulting chain of dma_fence with reset the entire context" is what I proposed too.
Like you say, userspace is allowed to render garbage already.
The explicit submit can be done by userspace (if there is no synchronization), but we plan to use the kernel to do it for implicit
sync.
Essentially, the kernel will receive a buffer list and addresses of
wait
commands in the user queue. It will assign new sequence numbers to
all
buffers and write those numbers into the wait commands, and ring the
hw
doorbell to start execution of that queue.
Yeah for implicit sync I think kernel and using drm/scheduler to sort
out
the dma_fence dependencies is probably best. Since you can filter out which dma_fence you hand to the scheduler for dependency tracking you
can
filter out your own ones and let the hw handle those directly
(depending
how much your hw can do an all that). On i915 we might do that to be
able
to use MI_SEMAPHORE_WAIT/SIGNAL functionality in the hw and fw
scheduler.
For buffer tracking with implicit sync I think cleanest is probably to still keep them wrapped as dma_fence and stuffed into dma_resv, but conceptually it's the same. If we let every driver reinvent their own buffer tracking just because the hw works a bit different it'll be a
mess.
Wrt wait commands: I'm honestly not sure why you'd do that. Userspace
gets
to keep the pieces if it gets it wrong. You do still need to handle external dma_fence though, hence drm/scheduler frontend to sort these
out.
The reason is to disallow lower-privileged process to deadlock/hang a higher-privileged process where the kernel can't tell who did it. If the implicit-sync sequence counter is read only to userspace and only incrementable by the unlock-signal command after the lock-wait command appeared in the same queue (both together forming a critical section), userspace can't manipulate it arbitrarily and we get almost the exact
same
behavior as implicit sync has today. That means any implicitly-sync'd buffer from any process can be fully trusted by a compositor to signal
in a
finite time, and possibly even trusted by the kernel. The only thing
that's
different is that a malicious process can disable implicit sync for a buffer in all processes/kernel, but it can't hang other processes/kernel (it can only hang itself and the kernel will be notified). So I'm a happy panda now. :)
Yeah I think that's not going to work too well, and is too many piled up hacks. Within a drm_file fd you can do whatever you feel like, since it's just one client.
But once implicit sync kicks in I think you need to go with dma_fence and drm/scheduler to handle the dependencies, and tdr kicking it. With the dma_fence you do know who's the offender - you might not know why, but that doesn't matter, you just shred the entire context and let that userspace figure out the details.
I think trying to make memory fences work as implicit sync directly, without wrapping them in a dma_fence and assorted guarantees, will just not work.
And once you do wrap them in dma_fence, then all the other problems go away: cross-driver sync, syncfiles, ... So I really don't see the benefit of this half-way approach.
Yes there's going to be a tad bit of overhead, but that's already there in the current model. And it can't hurt to have a bit of motivation for compositors to switch over to userspace memory fences properly.
Well, Christian thinks that we need a high level synchronization primitive in hw. I don't know myself and you may be right. A software scheduler with user queues might be one option. My part is only to find out how much of the scheduler logic can be moved to the hardware.
We plan to have memory timeline semaphores, or simply monotonic counters, and a fence will be represented by the counter address and a constant sequence number for the <= comparison. One counter can represent up to 2^64 different fences. Giving any process write access to a fence is the same as giving it the power to manipulate the signalled state of a sequence of up to 2^64 fences. That could mess up a lot of things. However, if the hardware had a high level synchronization primitive with access rights and a limited set of clearly defined operations such that we can formally prove whether it's safe for everybody, we could have a solution where we don't have to involve the software scheduler and just let the hardware do everything.
Marek
-Daniel
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Thu, Jun 3, 2021 at 12:55 PM Marek Olšák maraeo@gmail.com wrote:
On Thu., Jun. 3, 2021, 06:03 Daniel Vetter, daniel@ffwll.ch wrote:
On Thu, Jun 03, 2021 at 04:20:18AM -0400, Marek Olšák wrote:
On Thu, Jun 3, 2021 at 3:47 AM Daniel Vetter daniel@ffwll.ch wrote:
On Wed, Jun 02, 2021 at 11:16:39PM -0400, Marek Olšák wrote:
On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter daniel@ffwll.ch wrote:
On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote: > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák maraeo@gmail.com wrote: > > > Yes, we can't break anything because we don't want to complicate
things
> > for us. It's pretty much all NAK'd already. We are trying to gather more > > knowledge and then make better decisions. > > > > The idea we are considering is that we'll expose memory-based sync objects > > to userspace for read only, and the kernel or hw will strictly
control
the > > memory writes to those sync objects. The hole in that idea is that > > userspace can decide not to signal a job, so even if userspace
can't
> > overwrite memory-based sync object states arbitrarily, it can still decide > > not to signal them, and then a future fence is born. > > > > This would actually be treated as a GPU hang caused by that context,
so
it > should be fine.
This is practically what I proposed already, except your not doing it
with
dma_fence. And on the memory fence side this also doesn't actually give what you want for that compute model.
This seems like a bit a worst of both worlds approach to me? Tons of
work
in the kernel to hide these not-dma_fence-but-almost, and still pain to actually drive the hardware like it should be for compute or direct display.
Also maybe I've missed it, but I didn't see any replies to my
suggestion
how to fake the entire dma_fence stuff on top of new hw. Would be interesting to know what doesn't work there instead of amd folks going
of
into internal again and then coming back with another rfc from out of nowhere :-)
Going internal again is probably a good idea to spare you the long discussions and not waste your time, but we haven't talked about the dma_fence stuff internally other than acknowledging that it can be
solved.
The compute use case already uses the hw as-is with no inter-process sharing, which mostly keeps the kernel out of the picture. It uses
glFinish
to sync with GL.
The gfx use case needs new hardware logic to support implicit and
explicit
sync. When we propose a solution, it's usually torn apart the next day by ourselves.
Since we are talking about next hw or next next hw, preemption should be better.
user queue = user-mapped ring buffer
For implicit sync, we will only let userspace lock access to a buffer
via a
user queue, which waits for the per-buffer sequence counter in memory to
be
= the number assigned by the kernel, and later unlock the access with
another command, which increments the per-buffer sequence counter in
memory
with atomic_inc regardless of the number assigned by the kernel. The
kernel
counter and the counter in memory can be out-of-sync, and I'll explain
why
it's OK. If a process increments the kernel counter but not the memory counter, that's its problem and it's the same as a GPU hang caused by
that
process. If a process increments the memory counter but not the kernel counter, the ">=" condition alongside atomic_inc guarantee that
signaling n
will signal n+1, so it will never deadlock but also it will effectively disable synchronization. This method of disabling synchronization is similar to a process corrupting the buffer, which should be fine. Can you find any flaw in it? I can't find any.
Hm maybe I misunderstood what exactly you wanted to do earlier. That kind of "we let userspace free-wheel whatever it wants, kernel ensures correctness of the resulting chain of dma_fence with reset the entire context" is what I proposed too.
Like you say, userspace is allowed to render garbage already.
The explicit submit can be done by userspace (if there is no synchronization), but we plan to use the kernel to do it for implicit
sync.
Essentially, the kernel will receive a buffer list and addresses of wait commands in the user queue. It will assign new sequence numbers to all buffers and write those numbers into the wait commands, and ring the hw doorbell to start execution of that queue.
Yeah for implicit sync I think kernel and using drm/scheduler to sort out the dma_fence dependencies is probably best. Since you can filter out which dma_fence you hand to the scheduler for dependency tracking you can filter out your own ones and let the hw handle those directly (depending how much your hw can do an all that). On i915 we might do that to be able to use MI_SEMAPHORE_WAIT/SIGNAL functionality in the hw and fw scheduler.
For buffer tracking with implicit sync I think cleanest is probably to still keep them wrapped as dma_fence and stuffed into dma_resv, but conceptually it's the same. If we let every driver reinvent their own buffer tracking just because the hw works a bit different it'll be a mess.
Wrt wait commands: I'm honestly not sure why you'd do that. Userspace gets to keep the pieces if it gets it wrong. You do still need to handle external dma_fence though, hence drm/scheduler frontend to sort these out.
The reason is to disallow lower-privileged process to deadlock/hang a higher-privileged process where the kernel can't tell who did it. If the implicit-sync sequence counter is read only to userspace and only incrementable by the unlock-signal command after the lock-wait command appeared in the same queue (both together forming a critical section), userspace can't manipulate it arbitrarily and we get almost the exact same behavior as implicit sync has today. That means any implicitly-sync'd buffer from any process can be fully trusted by a compositor to signal in a finite time, and possibly even trusted by the kernel. The only thing that's different is that a malicious process can disable implicit sync for a buffer in all processes/kernel, but it can't hang other processes/kernel (it can only hang itself and the kernel will be notified). So I'm a happy panda now. :)
Yeah I think that's not going to work too well, and is too many piled up hacks. Within a drm_file fd you can do whatever you feel like, since it's just one client.
But once implicit sync kicks in I think you need to go with dma_fence and drm/scheduler to handle the dependencies, and tdr kicking it. With the dma_fence you do know who's the offender - you might not know why, but that doesn't matter, you just shred the entire context and let that userspace figure out the details.
I think trying to make memory fences work as implicit sync directly, without wrapping them in a dma_fence and assorted guarantees, will just not work.
And once you do wrap them in dma_fence, then all the other problems go away: cross-driver sync, syncfiles, ... So I really don't see the benefit of this half-way approach.
Yes there's going to be a tad bit of overhead, but that's already there in the current model. And it can't hurt to have a bit of motivation for compositors to switch over to userspace memory fences properly.
Well, Christian thinks that we need a high level synchronization primitive in hw. I don't know myself and you may be right. A software scheduler with user queues might be one option. My part is only to find out how much of the scheduler logic can be moved to the hardware.
We plan to have memory timeline semaphores, or simply monotonic counters, and a fence will be represented by the counter address and a constant sequence number for the <= comparison. One counter can represent up to 2^64 different fences. Giving any process write access to a fence is the same as giving it the power to manipulate the signalled state of a sequence of up to 2^64 fences. That could mess up a lot of things. However, if the hardware had a high level synchronization primitive with access rights and a limited set of clearly defined operations such that we can formally prove whether it's safe for everybody, we could have a solution where we don't have to involve the software scheduler and just let the hardware do everything.
I don't think hw access rights control on memory fences makes sense. There's two cases:
- brave new world of native userspace memory fences. Currently that's compute, maybe direct display vk, hopefully/eventually compositors and desktops too. If you get an untrusted fence, you need to have fallback logic no matter what, and by design. vk is explicit in stating that if things hang, you get to keep all the pieces. So the compositor needs to _always_ treat userspace memory fences as hostile, wrap them in a timeout, and have a fallback frame/scene in its rendering path. Probably same for the kernel on display side, maybe down to the display hw picking the "right" frame depending upon the fence value right before scanout as we discussed earlier. There's no point in hw access rights because by design, even if no one tampers with your fence, it might never signal. So you have to cope with a hostile fence from untrusted sources anyway (and within an app it's trusted and you just die as in stuck in an endless loop until someon sends a SIGKILL when you deadlock or get it wrong some other way).
- old compat mode where we need to use dma_fence, otherwise we end up with another round of "amdgpu redefines implicit sync in incompatible ways", and Christian&me don't even know yet how to fix the current round without breaking use-cases badly yet. So it has to be dma_fence, and it has to be the same rules as on old hw, or it's just not going to work. This means you need to force in-order retiring of fences in the kernel, and you need to enforce timeout. None of this needs hw access rights control, since once more it's just software constructs in the kernel. As a first appromixation, drm/scheduler + the fence chain we already have in syncobj is probably good enough for this. E.g. if userspace rolls the fence backwards then the kernel just ignores that, because its internal dma_fence has signalled, and dma_fences never unsignal (it's a bit in struct dma_fence, once it's set we stop checking hw). And if it doesn't signal in time, then tdr jumps in and fixes the mess. Hw access control doesn't fix anything here, because you have to deal with timeout and ordering already anyway, or the dma_fence contract is broken.
So in both cases hw access control gains you nothing (at least I'm not seeing anything), it just makes the design more tricky. "Userspace can manipulate the fences" is _intentionally_ how these things work, we need a design that works with that hw design, not against it and somehow tries to get us back to the old world, but only halfway (i.e. not useful at all, since old userspace needs us to go all the way back to dma_fence, and new userspace wants to fully exploit userspace memory fences without artificial limitations for no reason). -Daniel
Daniel, I think what you are suggesting is that we need to enable user queues with the drm scheduler and dma_fence first, and once that works, we can investigate how much of that kernel logic can be moved to the hw. Would that work? In theory it shouldn't matter whether the kernel does it or the hw does it. It's the same code, just in a different place.
Thanks, Marek
On Thu, Jun 3, 2021 at 7:22 AM Daniel Vetter daniel@ffwll.ch wrote:
On Thu, Jun 3, 2021 at 12:55 PM Marek Olšák maraeo@gmail.com wrote:
On Thu., Jun. 3, 2021, 06:03 Daniel Vetter, daniel@ffwll.ch wrote:
On Thu, Jun 03, 2021 at 04:20:18AM -0400, Marek Olšák wrote:
On Thu, Jun 3, 2021 at 3:47 AM Daniel Vetter daniel@ffwll.ch wrote:
On Wed, Jun 02, 2021 at 11:16:39PM -0400, Marek Olšák wrote:
On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter daniel@ffwll.ch
wrote:
> On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote: > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák maraeo@gmail.com
wrote:
> > > > > Yes, we can't break anything because we don't want to
complicate
things
> > > for us. It's pretty much all NAK'd already. We are trying
to gather
> more > > > knowledge and then make better decisions. > > > > > > The idea we are considering is that we'll expose
memory-based sync
> objects > > > to userspace for read only, and the kernel or hw will
strictly
control
> the > > > memory writes to those sync objects. The hole in that idea
is that
> > > userspace can decide not to signal a job, so even if
userspace
can't
> > > overwrite memory-based sync object states arbitrarily, it
can still
> decide > > > not to signal them, and then a future fence is born. > > > > > > > This would actually be treated as a GPU hang caused by that
context,
so
> it > > should be fine. > > This is practically what I proposed already, except your not
doing it
with
> dma_fence. And on the memory fence side this also doesn't
actually give
> what you want for that compute model. > > This seems like a bit a worst of both worlds approach to me?
Tons of
work
> in the kernel to hide these not-dma_fence-but-almost, and still
pain to
> actually drive the hardware like it should be for compute or
direct
> display. > > Also maybe I've missed it, but I didn't see any replies to my
suggestion
> how to fake the entire dma_fence stuff on top of new hw. Would
be
> interesting to know what doesn't work there instead of amd
folks going
of
> into internal again and then coming back with another rfc from
out of
> nowhere :-) >
Going internal again is probably a good idea to spare you the long discussions and not waste your time, but we haven't talked about
the
dma_fence stuff internally other than acknowledging that it can be
solved.
The compute use case already uses the hw as-is with no
inter-process
sharing, which mostly keeps the kernel out of the picture. It uses
glFinish
to sync with GL.
The gfx use case needs new hardware logic to support implicit and
explicit
sync. When we propose a solution, it's usually torn apart the
next day by
ourselves.
Since we are talking about next hw or next next hw, preemption
should be
better.
user queue = user-mapped ring buffer
For implicit sync, we will only let userspace lock access to a
buffer
via a
user queue, which waits for the per-buffer sequence counter in
memory to
be
>= the number assigned by the kernel, and later unlock the access
with
another command, which increments the per-buffer sequence counter
in
memory
with atomic_inc regardless of the number assigned by the kernel.
The
kernel
counter and the counter in memory can be out-of-sync, and I'll
explain
why
it's OK. If a process increments the kernel counter but not the
memory
counter, that's its problem and it's the same as a GPU hang
caused by
that
process. If a process increments the memory counter but not the
kernel
counter, the ">=" condition alongside atomic_inc guarantee that
signaling n
will signal n+1, so it will never deadlock but also it will
effectively
disable synchronization. This method of disabling synchronization
is
similar to a process corrupting the buffer, which should be fine.
Can you
find any flaw in it? I can't find any.
Hm maybe I misunderstood what exactly you wanted to do earlier.
That kind
of "we let userspace free-wheel whatever it wants, kernel ensures correctness of the resulting chain of dma_fence with reset the
entire
context" is what I proposed too.
Like you say, userspace is allowed to render garbage already.
The explicit submit can be done by userspace (if there is no synchronization), but we plan to use the kernel to do it for
implicit
sync.
Essentially, the kernel will receive a buffer list and addresses
of wait
commands in the user queue. It will assign new sequence numbers
to all
buffers and write those numbers into the wait commands, and ring
the hw
doorbell to start execution of that queue.
Yeah for implicit sync I think kernel and using drm/scheduler to
sort out
the dma_fence dependencies is probably best. Since you can filter
out
which dma_fence you hand to the scheduler for dependency tracking
you can
filter out your own ones and let the hw handle those directly
(depending
how much your hw can do an all that). On i915 we might do that to
be able
to use MI_SEMAPHORE_WAIT/SIGNAL functionality in the hw and fw
scheduler.
For buffer tracking with implicit sync I think cleanest is probably
to
still keep them wrapped as dma_fence and stuffed into dma_resv, but conceptually it's the same. If we let every driver reinvent their
own
buffer tracking just because the hw works a bit different it'll be
a mess.
Wrt wait commands: I'm honestly not sure why you'd do that.
Userspace gets
to keep the pieces if it gets it wrong. You do still need to handle external dma_fence though, hence drm/scheduler frontend to sort
these out.
The reason is to disallow lower-privileged process to deadlock/hang a higher-privileged process where the kernel can't tell who did it. If
the
implicit-sync sequence counter is read only to userspace and only incrementable by the unlock-signal command after the lock-wait command appeared in the same queue (both together forming a critical section), userspace can't manipulate it arbitrarily and we get almost the exact
same
behavior as implicit sync has today. That means any implicitly-sync'd buffer from any process can be fully trusted by a compositor to
signal in a
finite time, and possibly even trusted by the kernel. The only thing
that's
different is that a malicious process can disable implicit sync for a buffer in all processes/kernel, but it can't hang other
processes/kernel
(it can only hang itself and the kernel will be notified). So I'm a
happy
panda now. :)
Yeah I think that's not going to work too well, and is too many piled up hacks. Within a drm_file fd you can do whatever you feel like, since
it's
just one client.
But once implicit sync kicks in I think you need to go with dma_fence
and
drm/scheduler to handle the dependencies, and tdr kicking it. With the dma_fence you do know who's the offender - you might not know why, but that doesn't matter, you just shred the entire context and let that userspace figure out the details.
I think trying to make memory fences work as implicit sync directly, without wrapping them in a dma_fence and assorted guarantees, will just not work.
And once you do wrap them in dma_fence, then all the other problems go away: cross-driver sync, syncfiles, ... So I really don't see the
benefit
of this half-way approach.
Yes there's going to be a tad bit of overhead, but that's already there
in
the current model. And it can't hurt to have a bit of motivation for compositors to switch over to userspace memory fences properly.
Well, Christian thinks that we need a high level synchronization
primitive in hw. I don't know myself and you may be right. A software scheduler with user queues might be one option. My part is only to find out how much of the scheduler logic can be moved to the hardware.
We plan to have memory timeline semaphores, or simply monotonic
counters, and a fence will be represented by the counter address and a constant sequence number for the <= comparison. One counter can represent up to 2^64 different fences. Giving any process write access to a fence is the same as giving it the power to manipulate the signalled state of a sequence of up to 2^64 fences. That could mess up a lot of things. However, if the hardware had a high level synchronization primitive with access rights and a limited set of clearly defined operations such that we can formally prove whether it's safe for everybody, we could have a solution where we don't have to involve the software scheduler and just let the hardware do everything.
I don't think hw access rights control on memory fences makes sense. There's two cases:
- brave new world of native userspace memory fences. Currently that's
compute, maybe direct display vk, hopefully/eventually compositors and desktops too. If you get an untrusted fence, you need to have fallback logic no matter what, and by design. vk is explicit in stating that if things hang, you get to keep all the pieces. So the compositor needs to _always_ treat userspace memory fences as hostile, wrap them in a timeout, and have a fallback frame/scene in its rendering path. Probably same for the kernel on display side, maybe down to the display hw picking the "right" frame depending upon the fence value right before scanout as we discussed earlier. There's no point in hw access rights because by design, even if no one tampers with your fence, it might never signal. So you have to cope with a hostile fence from untrusted sources anyway (and within an app it's trusted and you just die as in stuck in an endless loop until someon sends a SIGKILL when you deadlock or get it wrong some other way).
- old compat mode where we need to use dma_fence, otherwise we end up
with another round of "amdgpu redefines implicit sync in incompatible ways", and Christian&me don't even know yet how to fix the current round without breaking use-cases badly yet. So it has to be dma_fence, and it has to be the same rules as on old hw, or it's just not going to work. This means you need to force in-order retiring of fences in the kernel, and you need to enforce timeout. None of this needs hw access rights control, since once more it's just software constructs in the kernel. As a first appromixation, drm/scheduler + the fence chain we already have in syncobj is probably good enough for this. E.g. if userspace rolls the fence backwards then the kernel just ignores that, because its internal dma_fence has signalled, and dma_fences never unsignal (it's a bit in struct dma_fence, once it's set we stop checking hw). And if it doesn't signal in time, then tdr jumps in and fixes the mess. Hw access control doesn't fix anything here, because you have to deal with timeout and ordering already anyway, or the dma_fence contract is broken.
So in both cases hw access control gains you nothing (at least I'm not seeing anything), it just makes the design more tricky. "Userspace can manipulate the fences" is _intentionally_ how these things work, we need a design that works with that hw design, not against it and somehow tries to get us back to the old world, but only halfway (i.e. not useful at all, since old userspace needs us to go all the way back to dma_fence, and new userspace wants to fully exploit userspace memory fences without artificial limitations for no reason).
-Daniel
Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Thu, Jun 3, 2021 at 7:53 PM Marek Olšák maraeo@gmail.com wrote:
Daniel, I think what you are suggesting is that we need to enable user queues with the drm scheduler and dma_fence first, and once that works, we can investigate how much of that kernel logic can be moved to the hw. Would that work? In theory it shouldn't matter whether the kernel does it or the hw does it. It's the same code, just in a different place.
Yeah I guess that's another way to look at it. Maybe in practice you'll just move it from the kernel to userspace, which then programs the hw waits directly into its IB. That's at least how I'd do it on i915, assuming I'd have such hw. So these fences that userspace programs directly (to sync within itself) won't even show up as dependencies in the kernel.
And then yes on the other side you can lift work from the drm/scheduler wrt dependencies you get in the kernel (whether explicit sync with sync_file, or implicit sync fished out of dma_resv) and program the hw directly that way. That would mean that userspace wont fill the ringbuffer directly, but the kernel would do that, so that you have space to stuff in the additional waits. Again assuming i915 hw model, maybe works differently on amd. Iirc we have some of that already in the i915 scheduler, but I'd need to recheck how much it really uses the hw semaphores. -Daniel
Thanks, Marek
On Thu, Jun 3, 2021 at 7:22 AM Daniel Vetter daniel@ffwll.ch wrote:
On Thu, Jun 3, 2021 at 12:55 PM Marek Olšák maraeo@gmail.com wrote:
On Thu., Jun. 3, 2021, 06:03 Daniel Vetter, daniel@ffwll.ch wrote:
On Thu, Jun 03, 2021 at 04:20:18AM -0400, Marek Olšák wrote:
On Thu, Jun 3, 2021 at 3:47 AM Daniel Vetter daniel@ffwll.ch wrote:
On Wed, Jun 02, 2021 at 11:16:39PM -0400, Marek Olšák wrote: > On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter daniel@ffwll.ch wrote: > > > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote: > > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák maraeo@gmail.com wrote: > > > > > > > Yes, we can't break anything because we don't want to complicate things > > > > for us. It's pretty much all NAK'd already. We are trying to gather > > more > > > > knowledge and then make better decisions. > > > > > > > > The idea we are considering is that we'll expose memory-based sync > > objects > > > > to userspace for read only, and the kernel or hw will strictly control > > the > > > > memory writes to those sync objects. The hole in that idea is that > > > > userspace can decide not to signal a job, so even if userspace can't > > > > overwrite memory-based sync object states arbitrarily, it can still > > decide > > > > not to signal them, and then a future fence is born. > > > > > > > > > > This would actually be treated as a GPU hang caused by that context, so > > it > > > should be fine. > > > > This is practically what I proposed already, except your not doing it with > > dma_fence. And on the memory fence side this also doesn't actually give > > what you want for that compute model. > > > > This seems like a bit a worst of both worlds approach to me? Tons of work > > in the kernel to hide these not-dma_fence-but-almost, and still pain to > > actually drive the hardware like it should be for compute or direct > > display. > > > > Also maybe I've missed it, but I didn't see any replies to my suggestion > > how to fake the entire dma_fence stuff on top of new hw. Would be > > interesting to know what doesn't work there instead of amd folks going of > > into internal again and then coming back with another rfc from out of > > nowhere :-) > > > > Going internal again is probably a good idea to spare you the long > discussions and not waste your time, but we haven't talked about the > dma_fence stuff internally other than acknowledging that it can be solved. > > The compute use case already uses the hw as-is with no inter-process > sharing, which mostly keeps the kernel out of the picture. It uses glFinish > to sync with GL. > > The gfx use case needs new hardware logic to support implicit and explicit > sync. When we propose a solution, it's usually torn apart the next day by > ourselves. > > Since we are talking about next hw or next next hw, preemption should be > better. > > user queue = user-mapped ring buffer > > For implicit sync, we will only let userspace lock access to a buffer via a > user queue, which waits for the per-buffer sequence counter in memory to be > >= the number assigned by the kernel, and later unlock the access with > another command, which increments the per-buffer sequence counter in memory > with atomic_inc regardless of the number assigned by the kernel. The kernel > counter and the counter in memory can be out-of-sync, and I'll explain why > it's OK. If a process increments the kernel counter but not the memory > counter, that's its problem and it's the same as a GPU hang caused by that > process. If a process increments the memory counter but not the kernel > counter, the ">=" condition alongside atomic_inc guarantee that signaling n > will signal n+1, so it will never deadlock but also it will effectively > disable synchronization. This method of disabling synchronization is > similar to a process corrupting the buffer, which should be fine. Can you > find any flaw in it? I can't find any.
Hm maybe I misunderstood what exactly you wanted to do earlier. That kind of "we let userspace free-wheel whatever it wants, kernel ensures correctness of the resulting chain of dma_fence with reset the entire context" is what I proposed too.
Like you say, userspace is allowed to render garbage already.
> The explicit submit can be done by userspace (if there is no > synchronization), but we plan to use the kernel to do it for implicit sync. > Essentially, the kernel will receive a buffer list and addresses of wait > commands in the user queue. It will assign new sequence numbers to all > buffers and write those numbers into the wait commands, and ring the hw > doorbell to start execution of that queue.
Yeah for implicit sync I think kernel and using drm/scheduler to sort out the dma_fence dependencies is probably best. Since you can filter out which dma_fence you hand to the scheduler for dependency tracking you can filter out your own ones and let the hw handle those directly (depending how much your hw can do an all that). On i915 we might do that to be able to use MI_SEMAPHORE_WAIT/SIGNAL functionality in the hw and fw scheduler.
For buffer tracking with implicit sync I think cleanest is probably to still keep them wrapped as dma_fence and stuffed into dma_resv, but conceptually it's the same. If we let every driver reinvent their own buffer tracking just because the hw works a bit different it'll be a mess.
Wrt wait commands: I'm honestly not sure why you'd do that. Userspace gets to keep the pieces if it gets it wrong. You do still need to handle external dma_fence though, hence drm/scheduler frontend to sort these out.
The reason is to disallow lower-privileged process to deadlock/hang a higher-privileged process where the kernel can't tell who did it. If the implicit-sync sequence counter is read only to userspace and only incrementable by the unlock-signal command after the lock-wait command appeared in the same queue (both together forming a critical section), userspace can't manipulate it arbitrarily and we get almost the exact same behavior as implicit sync has today. That means any implicitly-sync'd buffer from any process can be fully trusted by a compositor to signal in a finite time, and possibly even trusted by the kernel. The only thing that's different is that a malicious process can disable implicit sync for a buffer in all processes/kernel, but it can't hang other processes/kernel (it can only hang itself and the kernel will be notified). So I'm a happy panda now. :)
Yeah I think that's not going to work too well, and is too many piled up hacks. Within a drm_file fd you can do whatever you feel like, since it's just one client.
But once implicit sync kicks in I think you need to go with dma_fence and drm/scheduler to handle the dependencies, and tdr kicking it. With the dma_fence you do know who's the offender - you might not know why, but that doesn't matter, you just shred the entire context and let that userspace figure out the details.
I think trying to make memory fences work as implicit sync directly, without wrapping them in a dma_fence and assorted guarantees, will just not work.
And once you do wrap them in dma_fence, then all the other problems go away: cross-driver sync, syncfiles, ... So I really don't see the benefit of this half-way approach.
Yes there's going to be a tad bit of overhead, but that's already there in the current model. And it can't hurt to have a bit of motivation for compositors to switch over to userspace memory fences properly.
Well, Christian thinks that we need a high level synchronization primitive in hw. I don't know myself and you may be right. A software scheduler with user queues might be one option. My part is only to find out how much of the scheduler logic can be moved to the hardware.
We plan to have memory timeline semaphores, or simply monotonic counters, and a fence will be represented by the counter address and a constant sequence number for the <= comparison. One counter can represent up to 2^64 different fences. Giving any process write access to a fence is the same as giving it the power to manipulate the signalled state of a sequence of up to 2^64 fences. That could mess up a lot of things. However, if the hardware had a high level synchronization primitive with access rights and a limited set of clearly defined operations such that we can formally prove whether it's safe for everybody, we could have a solution where we don't have to involve the software scheduler and just let the hardware do everything.
I don't think hw access rights control on memory fences makes sense. There's two cases:
- brave new world of native userspace memory fences. Currently that's
compute, maybe direct display vk, hopefully/eventually compositors and desktops too. If you get an untrusted fence, you need to have fallback logic no matter what, and by design. vk is explicit in stating that if things hang, you get to keep all the pieces. So the compositor needs to _always_ treat userspace memory fences as hostile, wrap them in a timeout, and have a fallback frame/scene in its rendering path. Probably same for the kernel on display side, maybe down to the display hw picking the "right" frame depending upon the fence value right before scanout as we discussed earlier. There's no point in hw access rights because by design, even if no one tampers with your fence, it might never signal. So you have to cope with a hostile fence from untrusted sources anyway (and within an app it's trusted and you just die as in stuck in an endless loop until someon sends a SIGKILL when you deadlock or get it wrong some other way).
- old compat mode where we need to use dma_fence, otherwise we end up
with another round of "amdgpu redefines implicit sync in incompatible ways", and Christian&me don't even know yet how to fix the current round without breaking use-cases badly yet. So it has to be dma_fence, and it has to be the same rules as on old hw, or it's just not going to work. This means you need to force in-order retiring of fences in the kernel, and you need to enforce timeout. None of this needs hw access rights control, since once more it's just software constructs in the kernel. As a first appromixation, drm/scheduler + the fence chain we already have in syncobj is probably good enough for this. E.g. if userspace rolls the fence backwards then the kernel just ignores that, because its internal dma_fence has signalled, and dma_fences never unsignal (it's a bit in struct dma_fence, once it's set we stop checking hw). And if it doesn't signal in time, then tdr jumps in and fixes the mess. Hw access control doesn't fix anything here, because you have to deal with timeout and ordering already anyway, or the dma_fence contract is broken.
So in both cases hw access control gains you nothing (at least I'm not seeing anything), it just makes the design more tricky. "Userspace can manipulate the fences" is _intentionally_ how these things work, we need a design that works with that hw design, not against it and somehow tries to get us back to the old world, but only halfway (i.e. not useful at all, since old userspace needs us to go all the way back to dma_fence, and new userspace wants to fully exploit userspace memory fences without artificial limitations for no reason).
-Daniel
Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Thu., Jun. 3, 2021, 15:18 Daniel Vetter, daniel@ffwll.ch wrote:
On Thu, Jun 3, 2021 at 7:53 PM Marek Olšák maraeo@gmail.com wrote:
Daniel, I think what you are suggesting is that we need to enable user
queues with the drm scheduler and dma_fence first, and once that works, we can investigate how much of that kernel logic can be moved to the hw. Would that work? In theory it shouldn't matter whether the kernel does it or the hw does it. It's the same code, just in a different place.
Yeah I guess that's another way to look at it. Maybe in practice you'll just move it from the kernel to userspace, which then programs the hw waits directly into its IB. That's at least how I'd do it on i915, assuming I'd have such hw. So these fences that userspace programs directly (to sync within itself) won't even show up as dependencies in the kernel.
And then yes on the other side you can lift work from the drm/scheduler wrt dependencies you get in the kernel (whether explicit sync with sync_file, or implicit sync fished out of dma_resv) and program the hw directly that way. That would mean that userspace wont fill the ringbuffer directly, but the kernel would do that, so that you have space to stuff in the additional waits. Again assuming i915 hw model, maybe works differently on amd. Iirc we have some of that already in the i915 scheduler, but I'd need to recheck how much it really uses the hw semaphores.
I was thinking we would pass per process syncobj handles and buffer handles into commands in the user queue, or something equivalent. We do have a large degree of programmability in the hw that we can do something like that. The question is whether this high level user->hw interface would have any advantage over trivial polling on memory, etc. My impression is no because the kernel would be robust enough that it wouldn't matter what userspace does, but I don't know. Anyway, all we need is user queues and what your proposed seems totally sufficient.
Marek
-Daniel
Thanks, Marek
On Thu, Jun 3, 2021 at 7:22 AM Daniel Vetter daniel@ffwll.ch wrote:
On Thu, Jun 3, 2021 at 12:55 PM Marek Olšák maraeo@gmail.com wrote:
On Thu., Jun. 3, 2021, 06:03 Daniel Vetter, daniel@ffwll.ch wrote:
On Thu, Jun 03, 2021 at 04:20:18AM -0400, Marek Olšák wrote:
On Thu, Jun 3, 2021 at 3:47 AM Daniel Vetter daniel@ffwll.ch
wrote:
> On Wed, Jun 02, 2021 at 11:16:39PM -0400, Marek Olšák wrote: > > On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter daniel@ffwll.ch
wrote:
> > > > > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote: > > > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák <
maraeo@gmail.com> wrote:
> > > > > > > > > Yes, we can't break anything because we don't want to
complicate
> things > > > > > for us. It's pretty much all NAK'd already. We are
trying to gather
> > > more > > > > > knowledge and then make better decisions. > > > > > > > > > > The idea we are considering is that we'll expose
memory-based sync
> > > objects > > > > > to userspace for read only, and the kernel or hw will
strictly
> control > > > the > > > > > memory writes to those sync objects. The hole in that
idea is that
> > > > > userspace can decide not to signal a job, so even if
userspace
> can't > > > > > overwrite memory-based sync object states arbitrarily,
it can still
> > > decide > > > > > not to signal them, and then a future fence is born. > > > > > > > > > > > > > This would actually be treated as a GPU hang caused by
that context,
> so > > > it > > > > should be fine. > > > > > > This is practically what I proposed already, except your not
doing it
> with > > > dma_fence. And on the memory fence side this also doesn't
actually give
> > > what you want for that compute model. > > > > > > This seems like a bit a worst of both worlds approach to me?
Tons of
> work > > > in the kernel to hide these not-dma_fence-but-almost, and
still pain to
> > > actually drive the hardware like it should be for compute or
direct
> > > display. > > > > > > Also maybe I've missed it, but I didn't see any replies to my > suggestion > > > how to fake the entire dma_fence stuff on top of new hw.
Would be
> > > interesting to know what doesn't work there instead of amd
folks going
> of > > > into internal again and then coming back with another rfc
from out of
> > > nowhere :-) > > > > > > > Going internal again is probably a good idea to spare you the
long
> > discussions and not waste your time, but we haven't talked
about the
> > dma_fence stuff internally other than acknowledging that it
can be
> solved. > > > > The compute use case already uses the hw as-is with no
inter-process
> > sharing, which mostly keeps the kernel out of the picture. It
uses
> glFinish > > to sync with GL. > > > > The gfx use case needs new hardware logic to support implicit
and
> explicit > > sync. When we propose a solution, it's usually torn apart the
next day by
> > ourselves. > > > > Since we are talking about next hw or next next hw, preemption
should be
> > better. > > > > user queue = user-mapped ring buffer > > > > For implicit sync, we will only let userspace lock access to a
buffer
> via a > > user queue, which waits for the per-buffer sequence counter in
memory to
> be > > >= the number assigned by the kernel, and later unlock the
access with
> > another command, which increments the per-buffer sequence
counter in
> memory > > with atomic_inc regardless of the number assigned by the
kernel. The
> kernel > > counter and the counter in memory can be out-of-sync, and I'll
explain
> why > > it's OK. If a process increments the kernel counter but not
the memory
> > counter, that's its problem and it's the same as a GPU hang
caused by
> that > > process. If a process increments the memory counter but not
the kernel
> > counter, the ">=" condition alongside atomic_inc guarantee that > signaling n > > will signal n+1, so it will never deadlock but also it will
effectively
> > disable synchronization. This method of disabling
synchronization is
> > similar to a process corrupting the buffer, which should be
fine. Can you
> > find any flaw in it? I can't find any. > > Hm maybe I misunderstood what exactly you wanted to do earlier.
That kind
> of "we let userspace free-wheel whatever it wants, kernel ensures > correctness of the resulting chain of dma_fence with reset the
entire
> context" is what I proposed too. > > Like you say, userspace is allowed to render garbage already. > > > The explicit submit can be done by userspace (if there is no > > synchronization), but we plan to use the kernel to do it for
implicit
> sync. > > Essentially, the kernel will receive a buffer list and
addresses of wait
> > commands in the user queue. It will assign new sequence
numbers to all
> > buffers and write those numbers into the wait commands, and
ring the hw
> > doorbell to start execution of that queue. > > Yeah for implicit sync I think kernel and using drm/scheduler to
sort out
> the dma_fence dependencies is probably best. Since you can
filter out
> which dma_fence you hand to the scheduler for dependency
tracking you can
> filter out your own ones and let the hw handle those directly
(depending
> how much your hw can do an all that). On i915 we might do that
to be able
> to use MI_SEMAPHORE_WAIT/SIGNAL functionality in the hw and fw
scheduler.
> > For buffer tracking with implicit sync I think cleanest is
probably to
> still keep them wrapped as dma_fence and stuffed into dma_resv,
but
> conceptually it's the same. If we let every driver reinvent
their own
> buffer tracking just because the hw works a bit different it'll
be a mess.
> > Wrt wait commands: I'm honestly not sure why you'd do that.
Userspace gets
> to keep the pieces if it gets it wrong. You do still need to
handle
> external dma_fence though, hence drm/scheduler frontend to sort
these out.
>
The reason is to disallow lower-privileged process to
deadlock/hang a
higher-privileged process where the kernel can't tell who did it.
If the
implicit-sync sequence counter is read only to userspace and only incrementable by the unlock-signal command after the lock-wait
command
appeared in the same queue (both together forming a critical
section),
userspace can't manipulate it arbitrarily and we get almost the
exact same
behavior as implicit sync has today. That means any
implicitly-sync'd
buffer from any process can be fully trusted by a compositor to
signal in a
finite time, and possibly even trusted by the kernel. The only
thing that's
different is that a malicious process can disable implicit sync
for a
buffer in all processes/kernel, but it can't hang other
processes/kernel
(it can only hang itself and the kernel will be notified). So I'm
a happy
panda now. :)
Yeah I think that's not going to work too well, and is too many
piled up
hacks. Within a drm_file fd you can do whatever you feel like, since
it's
just one client.
But once implicit sync kicks in I think you need to go with
dma_fence and
drm/scheduler to handle the dependencies, and tdr kicking it. With
the
dma_fence you do know who's the offender - you might not know why,
but
that doesn't matter, you just shred the entire context and let that userspace figure out the details.
I think trying to make memory fences work as implicit sync directly, without wrapping them in a dma_fence and assorted guarantees, will
just
not work.
And once you do wrap them in dma_fence, then all the other problems
go
away: cross-driver sync, syncfiles, ... So I really don't see the
benefit
of this half-way approach.
Yes there's going to be a tad bit of overhead, but that's already
there in
the current model. And it can't hurt to have a bit of motivation for compositors to switch over to userspace memory fences properly.
Well, Christian thinks that we need a high level synchronization
primitive in hw. I don't know myself and you may be right. A software scheduler with user queues might be one option. My part is only to find out how much of the scheduler logic can be moved to the hardware.
We plan to have memory timeline semaphores, or simply monotonic
counters, and a fence will be represented by the counter address and a constant sequence number for the <= comparison. One counter can represent up to 2^64 different fences. Giving any process write access to a fence is the same as giving it the power to manipulate the signalled state of a sequence of up to 2^64 fences. That could mess up a lot of things. However, if the hardware had a high level synchronization primitive with access rights and a limited set of clearly defined operations such that we can formally prove whether it's safe for everybody, we could have a solution where we don't have to involve the software scheduler and just let the hardware do everything.
I don't think hw access rights control on memory fences makes sense. There's two cases:
- brave new world of native userspace memory fences. Currently that's
compute, maybe direct display vk, hopefully/eventually compositors and desktops too. If you get an untrusted fence, you need to have fallback logic no matter what, and by design. vk is explicit in stating that if things hang, you get to keep all the pieces. So the compositor needs to _always_ treat userspace memory fences as hostile, wrap them in a timeout, and have a fallback frame/scene in its rendering path. Probably same for the kernel on display side, maybe down to the display hw picking the "right" frame depending upon the fence value right before scanout as we discussed earlier. There's no point in hw access rights because by design, even if no one tampers with your fence, it might never signal. So you have to cope with a hostile fence from untrusted sources anyway (and within an app it's trusted and you just die as in stuck in an endless loop until someon sends a SIGKILL when you deadlock or get it wrong some other way).
- old compat mode where we need to use dma_fence, otherwise we end up
with another round of "amdgpu redefines implicit sync in incompatible ways", and Christian&me don't even know yet how to fix the current round without breaking use-cases badly yet. So it has to be dma_fence, and it has to be the same rules as on old hw, or it's just not going to work. This means you need to force in-order retiring of fences in the kernel, and you need to enforce timeout. None of this needs hw access rights control, since once more it's just software constructs in the kernel. As a first appromixation, drm/scheduler + the fence chain we already have in syncobj is probably good enough for this. E.g. if userspace rolls the fence backwards then the kernel just ignores that, because its internal dma_fence has signalled, and dma_fences never unsignal (it's a bit in struct dma_fence, once it's set we stop checking hw). And if it doesn't signal in time, then tdr jumps in and fixes the mess. Hw access control doesn't fix anything here, because you have to deal with timeout and ordering already anyway, or the dma_fence contract is broken.
So in both cases hw access control gains you nothing (at least I'm not seeing anything), it just makes the design more tricky. "Userspace can manipulate the fences" is _intentionally_ how these things work, we need a design that works with that hw design, not against it and somehow tries to get us back to the old world, but only halfway (i.e. not useful at all, since old userspace needs us to go all the way back to dma_fence, and new userspace wants to fully exploit userspace memory fences without artificial limitations for no reason).
-Daniel
Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
Am 02.06.21 um 10:57 schrieb Daniel Stone:
Hi Christian,
On Tue, 1 Jun 2021 at 13:51, Christian König ckoenig.leichtzumerken@gmail.com wrote:
Am 01.06.21 um 14:30 schrieb Daniel Vetter:
If you want to enable this use-case with driver magic and without the compositor being aware of what's going on, the solution is EGLStreams. Not sure we want to go there, but it's definitely a lot more feasible than trying to stuff eglstreams semantics into dma-buf implicit fencing support in a desperate attempt to not change compositors.
Well not changing compositors is certainly not something I would try with this use case.
Not changing compositors is more like ok we have Ubuntu 20.04 and need to support that we the newest hardware generation.
Serious question, have you talked to Canonical?
I mean there's a hell of a lot of effort being expended here, but it seems to all be predicated on the assumption that Ubuntu's LTS HWE/backport policy is totally immutable, and that we might need to make the kernel do backflips to work around that. But ... is it? Has anyone actually asked them how they feel about this?
This was merely an example. What I wanted to say is that we need to support system already deployed.
In other words our customers won't accept that they need to replace the compositor just because they switch to a new hardware generation.
I mean, my answer to the first email is 'no, absolutely not' from the technical perspective (the initial proposal totally breaks current and future userspace), from a design perspective (it breaks a lot of usecases which aren't single-vendor GPU+display+codec, or aren't just a simple desktop), and from a sustainability perspective (cutting Android adrift again isn't acceptable collateral damage to make it easier to backport things to last year's Ubuntu release).
But then again, I don't even know what I'm NAKing here ... ? The original email just lists a proposal to break a ton of things, with proposed replacements which aren't technically viable, and it's not clear why? Can we please have some more details and some reasoning behind them?
I don't mind that userspace (compositor, protocols, clients like Mesa as well as codec APIs) need to do a lot of work to support the new model. I do really care though that the hard-binary-switch model works fine enough for AMD but totally breaks heterogeneous systems and makes it impossible for userspace to do the right thing.
Well how the handling for new Android, distributions etc... is going to look like is a completely different story.
And I completely agree with both Daniel Vetter and you that we need to keep this in mind when designing the compatibility with older software.
For Android I'm really not sure what to do. In general Android is already trying to do the right thing by using explicit sync, the problem is that this is build around the idea that this explicit sync is syncfile kernel based.
Either we need to change Android and come up with something that works with user fences as well or we somehow invent a compatibility layer for syncfile as well.
Regards, Christian.
Cheers, Daniel
On Wed, Jun 2, 2021 at 5:44 AM Christian König < ckoenig.leichtzumerken@gmail.com> wrote:
Am 02.06.21 um 10:57 schrieb Daniel Stone:
Hi Christian,
On Tue, 1 Jun 2021 at 13:51, Christian König ckoenig.leichtzumerken@gmail.com wrote:
Am 01.06.21 um 14:30 schrieb Daniel Vetter:
If you want to enable this use-case with driver magic and without the compositor being aware of what's going on, the solution is EGLStreams. Not sure we want to go there, but it's definitely a lot more feasible than trying to stuff eglstreams semantics into dma-buf implicit fencing support in a desperate attempt to not change compositors.
Well not changing compositors is certainly not something I would try with this use case.
Not changing compositors is more like ok we have Ubuntu 20.04 and need to support that we the newest hardware generation.
Serious question, have you talked to Canonical?
I mean there's a hell of a lot of effort being expended here, but it seems to all be predicated on the assumption that Ubuntu's LTS HWE/backport policy is totally immutable, and that we might need to make the kernel do backflips to work around that. But ... is it? Has anyone actually asked them how they feel about this?
This was merely an example. What I wanted to say is that we need to support system already deployed.
In other words our customers won't accept that they need to replace the compositor just because they switch to a new hardware generation.
I mean, my answer to the first email is 'no, absolutely not' from the technical perspective (the initial proposal totally breaks current and future userspace), from a design perspective (it breaks a lot of usecases which aren't single-vendor GPU+display+codec, or aren't just a simple desktop), and from a sustainability perspective (cutting Android adrift again isn't acceptable collateral damage to make it easier to backport things to last year's Ubuntu release).
But then again, I don't even know what I'm NAKing here ... ? The original email just lists a proposal to break a ton of things, with proposed replacements which aren't technically viable, and it's not clear why? Can we please have some more details and some reasoning behind them?
I don't mind that userspace (compositor, protocols, clients like Mesa as well as codec APIs) need to do a lot of work to support the new model. I do really care though that the hard-binary-switch model works fine enough for AMD but totally breaks heterogeneous systems and makes it impossible for userspace to do the right thing.
Well how the handling for new Android, distributions etc... is going to look like is a completely different story.
And I completely agree with both Daniel Vetter and you that we need to keep this in mind when designing the compatibility with older software.
For Android I'm really not sure what to do. In general Android is already trying to do the right thing by using explicit sync, the problem is that this is build around the idea that this explicit sync is syncfile kernel based.
Either we need to change Android and come up with something that works with user fences as well or we somehow invent a compatibility layer for syncfile as well.
What's the issue with syncfiles that syncobjs don't suffer from?
Marek
Am 02.06.21 um 11:58 schrieb Marek Olšák:
On Wed, Jun 2, 2021 at 5:44 AM Christian König <ckoenig.leichtzumerken@gmail.com mailto:ckoenig.leichtzumerken@gmail.com> wrote:
Am 02.06.21 um 10:57 schrieb Daniel Stone: > Hi Christian, > > On Tue, 1 Jun 2021 at 13:51, Christian König > <ckoenig.leichtzumerken@gmail.com <mailto:ckoenig.leichtzumerken@gmail.com>> wrote: >> Am 01.06.21 um 14:30 schrieb Daniel Vetter: >>> If you want to enable this use-case with driver magic and without the >>> compositor being aware of what's going on, the solution is EGLStreams. >>> Not sure we want to go there, but it's definitely a lot more feasible >>> than trying to stuff eglstreams semantics into dma-buf implicit >>> fencing support in a desperate attempt to not change compositors. >> Well not changing compositors is certainly not something I would try >> with this use case. >> >> Not changing compositors is more like ok we have Ubuntu 20.04 and need >> to support that we the newest hardware generation. > Serious question, have you talked to Canonical? > > I mean there's a hell of a lot of effort being expended here, but it > seems to all be predicated on the assumption that Ubuntu's LTS > HWE/backport policy is totally immutable, and that we might need to > make the kernel do backflips to work around that. But ... is it? Has > anyone actually asked them how they feel about this? This was merely an example. What I wanted to say is that we need to support system already deployed. In other words our customers won't accept that they need to replace the compositor just because they switch to a new hardware generation. > I mean, my answer to the first email is 'no, absolutely not' from the > technical perspective (the initial proposal totally breaks current and > future userspace), from a design perspective (it breaks a lot of > usecases which aren't single-vendor GPU+display+codec, or aren't just > a simple desktop), and from a sustainability perspective (cutting > Android adrift again isn't acceptable collateral damage to make it > easier to backport things to last year's Ubuntu release). > > But then again, I don't even know what I'm NAKing here ... ? The > original email just lists a proposal to break a ton of things, with > proposed replacements which aren't technically viable, and it's not > clear why? Can we please have some more details and some reasoning > behind them? > > I don't mind that userspace (compositor, protocols, clients like Mesa > as well as codec APIs) need to do a lot of work to support the new > model. I do really care though that the hard-binary-switch model works > fine enough for AMD but totally breaks heterogeneous systems and makes > it impossible for userspace to do the right thing. Well how the handling for new Android, distributions etc... is going to look like is a completely different story. And I completely agree with both Daniel Vetter and you that we need to keep this in mind when designing the compatibility with older software. For Android I'm really not sure what to do. In general Android is already trying to do the right thing by using explicit sync, the problem is that this is build around the idea that this explicit sync is syncfile kernel based. Either we need to change Android and come up with something that works with user fences as well or we somehow invent a compatibility layer for syncfile as well.
What's the issue with syncfiles that syncobjs don't suffer from?
Syncobjs where designed with future fences in mind. In other words we already have the ability to wait for a future submission to appear with all the nasty locking implications.
Syncfile on the other hand are just a container for up to N kernel fences and since we don't have kernel fences any more that is rather tricky to keep working.
Going to look into the uAPI around syncfiles once more and see if we can somehow use that for user fences as well.
Christian.
Marek
On 2021-06-01 2:10 p.m., Christian König wrote:
Am 01.06.21 um 12:49 schrieb Michel Dänzer:
On 2021-06-01 12:21 p.m., Christian König wrote:
Am 01.06.21 um 11:02 schrieb Michel Dänzer:
On 2021-05-27 11:51 p.m., Marek Olšák wrote:
- Compositors (and other privileged processes, and display flipping) can't trust imported/exported fences. They need a timeout recovery mechanism from the beginning, and the following are some possible solutions to timeouts:
a) use a CPU wait with a small absolute timeout, and display the previous content on timeout b) use a GPU wait with a small absolute timeout, and conditional rendering will choose between the latest content (if signalled) and previous content (if timed out)
The result would be that the desktop can run close to 60 fps even if an app runs at 1 fps.
FWIW, this is working with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even with implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to provide the same dma-buf poll semantics as other drivers and high priority GFX contexts via EGL_IMG_context_priority which can preempt lower priority ones).
Yeah, that is really nice to have.
One question is if you wait on the CPU or the GPU for the new surface to become available?
It's based on polling dma-buf fds, i.e. CPU.
The former is a bit bad for latency and power management.
There isn't a choice for Wayland compositors in general, since there can be arbitrary other state which needs to be applied atomically together with the new buffer. (Though in theory, a compositor might get fancy and special-case surface commits which can be handled by waiting on the GPU)
Latency is largely a matter of scheduling in the compositor. The latency incurred by the compositor shouldn't have to be more than single-digit milliseconds. (I've seen total latency from when the client starts processing a (static) frame to when it starts being scanned out as low as ~6 ms with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620, lower than typical with Xorg)
Well let me describe it like this:
We have an use cases for 144 Hz guaranteed refresh rate. That essentially means that the client application needs to be able to spit out one frame/window content every ~6.9ms. That's tough, but doable.
When you now add 6ms latency in the compositor that means the client application has only .9ms left for it's frame which is basically impossible to do.
You misunderstood me. 6 ms is the lowest possible end-to-end latency from client to scanout, but the client can start as early as it wants/needs to. It's a trade-off between latency and the risk of missing a scanout cycle.
On 2021-06-01 3:18 p.m., Michel Dänzer wrote:
On 2021-06-01 2:10 p.m., Christian König wrote:
Am 01.06.21 um 12:49 schrieb Michel Dänzer:
On 2021-06-01 12:21 p.m., Christian König wrote:
Am 01.06.21 um 11:02 schrieb Michel Dänzer:
On 2021-05-27 11:51 p.m., Marek Olšák wrote:
- Compositors (and other privileged processes, and display flipping) can't trust imported/exported fences. They need a timeout recovery mechanism from the beginning, and the following are some possible solutions to timeouts:
a) use a CPU wait with a small absolute timeout, and display the previous content on timeout b) use a GPU wait with a small absolute timeout, and conditional rendering will choose between the latest content (if signalled) and previous content (if timed out)
The result would be that the desktop can run close to 60 fps even if an app runs at 1 fps.
FWIW, this is working with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1880 , even with implicit sync (on current Intel GPUs; amdgpu/radeonsi would need to provide the same dma-buf poll semantics as other drivers and high priority GFX contexts via EGL_IMG_context_priority which can preempt lower priority ones).
Yeah, that is really nice to have.
One question is if you wait on the CPU or the GPU for the new surface to become available?
It's based on polling dma-buf fds, i.e. CPU.
The former is a bit bad for latency and power management.
There isn't a choice for Wayland compositors in general, since there can be arbitrary other state which needs to be applied atomically together with the new buffer. (Though in theory, a compositor might get fancy and special-case surface commits which can be handled by waiting on the GPU)
Latency is largely a matter of scheduling in the compositor. The latency incurred by the compositor shouldn't have to be more than single-digit milliseconds. (I've seen total latency from when the client starts processing a (static) frame to when it starts being scanned out as low as ~6 ms with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620, lower than typical with Xorg)
Well let me describe it like this:
We have an use cases for 144 Hz guaranteed refresh rate. That essentially means that the client application needs to be able to spit out one frame/window content every ~6.9ms. That's tough, but doable.
When you now add 6ms latency in the compositor that means the client application has only .9ms left for it's frame which is basically impossible to do.
You misunderstood me. 6 ms is the lowest possible end-to-end latency from client to scanout, but the client can start as early as it wants/needs to. It's a trade-off between latency and the risk of missing a scanout cycle.
Note that what I wrote above is about the case where the compositor needs to draw its own frame sampling from the client buffer. If your concern is about a fullscreen application for which the compositor can directly use the application buffers for scanout, it should be possible in theory to get the latency incurred by the compositor down to ~1 ms.
If that's too much[0], it could be improved further by adding atomic KMS API to replace a pending page flip with another one. Then the compositor could just directly submit a flip as soon as a new buffer becomes ready (or even as soon as the client submits it to the compositor, depending on how exactly the new KMS API works). Then the minimum latency should be mostly up to the kernel driver / HW.
Another possibility would be for the application to use KMS directly, e.g. via a DRM lease. That might still require the same new API to get the flip submission latency significantly below 1 ms though.
[0] Though I'm not sure how to reconcile that with "spitting out one frame every ~6.9ms is tough", as that means the theoretical minimum total client→scanout latency is ~7 ms (and missing a scanout cycle ~doubles the latency).
Hi,
On Tue, 1 Jun 2021 at 14:18, Michel Dänzer michel@daenzer.net wrote:
On 2021-06-01 2:10 p.m., Christian König wrote:
Am 01.06.21 um 12:49 schrieb Michel Dänzer:
There isn't a choice for Wayland compositors in general, since there can be arbitrary other state which needs to be applied atomically together with the new buffer. (Though in theory, a compositor might get fancy and special-case surface commits which can be handled by waiting on the GPU)
Yeah, this is pretty crucial.
Latency is largely a matter of scheduling in the compositor. The latency incurred by the compositor shouldn't have to be more than single-digit milliseconds. (I've seen total latency from when the client starts processing a (static) frame to when it starts being scanned out as low as ~6 ms with https://gitlab.gnome.org/GNOME/mutter/-/merge_requests/1620, lower than typical with Xorg)
Well let me describe it like this:
We have an use cases for 144 Hz guaranteed refresh rate. That essentially means that the client application needs to be able to spit out one frame/window content every ~6.9ms. That's tough, but doable.
When you now add 6ms latency in the compositor that means the client application has only .9ms left for it's frame which is basically impossible to do.
You misunderstood me. 6 ms is the lowest possible end-to-end latency from client to scanout, but the client can start as early as it wants/needs to. It's a trade-off between latency and the risk of missing a scanout cycle.
Not quite.
When weston-presentation-shm is reporting is a 6ms delta between when it started its rendering loop and when the frame was presented to screen. How w-p-s was run matters a lot, because you can insert an arbitrary delay in there to simulate client rendering. It also matters a lot that the client is SHM, because that will result in Mutter doing glTexImage2D on whatever size the window is, then doing a full GL compositing pass, so even if it's run with zero delay, 6ms isn't 'the amount of time it takes Mutter to get a frame to screen', it's measuring the overhead of a texture upload and full-screen composition as well.
I'm assuming the 'guaranteed 144Hz' target is a fullscreen GL client, for which you definitely avoid TexImage2D, and could hopefully (assuming the client picks a modifier which can be displayed) also avoid the composition pass in favour of direct scanout from the client buffer; that would give you a much lower number.
Each compositor has its own heuristics around timing. They all make their own tradeoff between low latency and fewest dropped frames. Obviously, the higher your latency, the lower the chance of missing a deadline. There's a lot of detail in the MR that Michel linked.
Cheers, Daniel
On 2021-06-01 12:49 p.m., Michel Dänzer wrote:
On 2021-06-01 12:21 p.m., Christian König wrote:
Another question is if that is sufficient as security for the display server or if we need further handling down the road? I mean essentially we are moving the reliability problem into the display server.
Good question. This should generally protect the display server from freezing due to client fences never signalling, but there might still be corner cases.
E.g. a client might be able to sneak in a fence between when the compositor checks fences and when it submits its drawing to the kernel.
On Wed, Jun 02, 2021 at 10:09:01AM +0200, Michel Dänzer wrote:
On 2021-06-01 12:49 p.m., Michel Dänzer wrote:
On 2021-06-01 12:21 p.m., Christian König wrote:
Another question is if that is sufficient as security for the display server or if we need further handling down the road? I mean essentially we are moving the reliability problem into the display server.
Good question. This should generally protect the display server from freezing due to client fences never signalling, but there might still be corner cases.
E.g. a client might be able to sneak in a fence between when the compositor checks fences and when it submits its drawing to the kernel.
This is why implicit sync should be handled with explicit IPC. You pick the fence up once, and then you need to tell your GL stack to _not_ do implicit sync. Would need a new extension. vk afaiui does this automatically already. -Daniel
dri-devel@lists.freedesktop.org