Hi,
On #dri-devel Daniel invited me to chime in on the topic of clearing GPU memory handed to userspace, so here I go.
I was asking how information leak from giving userspace dirty memory previously used by another process is not seen as a security issue. I was pointed to a recent thread, which offers a little perspective: https://lists.freedesktop.org/archives/dri-devel/2020-November/287144.html
I think the main argument shown there is weak:
And for the legacy node model with authentication of clients against the X server, leaking that all around was ok.
seeing how there's the XCSECURITY extension that is supposed to limit what clients can retrieve, or there could be two X servers running for different users.
My other concern is how easy it is to cause system instability or hangs by out-of-bounds writes from the GPU (via compute shaders or copy commands). In my experience of several years doing GPU computing with NVIDIA tech, I don't recall needing to lose time rebooting my PC after running a buggy CUDA "kernel". Heck, I could run the GCC C testsuite on the GPU without worrying about locking myself and others from the server. But now when I develop on a laptop with AMD's latest mobile SoC, every time I make a mistake in my GLSL code it more often than not forces a reboot. I hope you understand what a huge pain it is.
What are the existing GPU hardware capabilities for memory protection (both in terms of preventing random accesses to system memory like with an IOMMU, and in terms of isolating different process contexts from each other), and to what extend Linux DRM drivers are taking advantage of them?
Would you consider producing a document with answers to the above so users know what to expect?
Thank you. Alexander
On Mon, Nov 30, 2020 at 05:07:00PM +0300, Alexander Monakov wrote:
Hi,
On #dri-devel Daniel invited me to chime in on the topic of clearing GPU memory handed to userspace, so here I go.
I was asking how information leak from giving userspace dirty memory previously used by another process is not seen as a security issue. I was pointed to a recent thread, which offers a little perspective: https://lists.freedesktop.org/archives/dri-devel/2020-November/287144.html
I think the main argument shown there is weak:
And for the legacy node model with authentication of clients against the X server, leaking that all around was ok.
seeing how there's the XCSECURITY extension that is supposed to limit what clients can retrieve, or there could be two X servers running for different users.
My other concern is how easy it is to cause system instability or hangs by out-of-bounds writes from the GPU (via compute shaders or copy commands). In my experience of several years doing GPU computing with NVIDIA tech, I don't recall needing to lose time rebooting my PC after running a buggy CUDA "kernel". Heck, I could run the GCC C testsuite on the GPU without worrying about locking myself and others from the server. But now when I develop on a laptop with AMD's latest mobile SoC, every time I make a mistake in my GLSL code it more often than not forces a reboot. I hope you understand what a huge pain it is.
That sounds like amdgpu reset not being great. Which it is (from what I've seen looking at it). There shouldn't be any leaks after that.
What are the existing GPU hardware capabilities for memory protection (both in terms of preventing random accesses to system memory like with an IOMMU, and in terms of isolating different process contexts from each other), and to what extend Linux DRM drivers are taking advantage of them?
Would you consider producing a document with answers to the above so users know what to expect?
Atm not documented anywhere unfortunately. There's some documentation about rendernode, but it is not very explicit on what it guarantees wrt security:
https://dri.freedesktop.org/docs/drm/gpu/drm-uapi.html#render-nodes
Currently render nodes should guarantee that you never see anything else from another gpu client (including some hw exploits, where that's doable). Without render nodes we only try to make sure to protect the system overall from gpu workloads, but not gpu workloads against each another. That's mostly a thing on older hardware though.
Note that the reality is slightly more disappointing, like amdgpu not force clearing vram when rendernodes are used. But there's more like that I think.
Cheers, Daniel
Thank you. Alexander
On 2020-11-30 3:07 p.m., Alexander Monakov wrote:
My other concern is how easy it is to cause system instability or hangs by out-of-bounds writes from the GPU (via compute shaders or copy commands). In my experience of several years doing GPU computing with NVIDIA tech, I don't recall needing to lose time rebooting my PC after running a buggy CUDA "kernel". Heck, I could run the GCC C testsuite on the GPU without worrying about locking myself and others from the server. But now when I develop on a laptop with AMD's latest mobile SoC, every time I make a mistake in my GLSL code it more often than not forces a reboot. I hope you understand what a huge pain it is.
What are the existing GPU hardware capabilities for memory protection (both in terms of preventing random accesses to system memory like with an IOMMU, and in terms of isolating different process contexts from each other), and to what extend Linux DRM drivers are taking advantage of them?
Modern (or more like non-ancient at this point, basically anything which came out within the last decade) AMD GPUs have mostly perfect protection between different execution contexts (i.e. different processes normally, though it's not always a 1:1 mapping). Each context has its own virtual GPU address space and cannot access any memory which isn't mapped into that (which the kernel driver only does for memory belonging to a buffer object which the context has permission to access and has explicitly asked to be mapped into its address space).
The instability you're seeing likely isn't due to lack of memory protection but due to any of a large number of other ways a GPU can end up in a hanging state, and the drivers and wider ecosystem not being very good at recovering from that yet.
dri-devel@lists.freedesktop.org