This is the i915 driver VM_BIND feature design RFC patch series along with the required uapi definition and description of intended use cases.
v2: Updated design and uapi, more documentation.
Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com
Niranjana Vishwanathapura (2): drm/doc/rfc: VM_BIND feature design document drm/doc/rfc: VM_BIND uapi definition
Documentation/gpu/rfc/i915_vm_bind.h | 176 +++++++++++++++++++++ Documentation/gpu/rfc/i915_vm_bind.rst | 210 +++++++++++++++++++++++++ Documentation/gpu/rfc/index.rst | 4 + 3 files changed, 390 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
VM_BIND design document with description of intended use cases.
Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com --- Documentation/gpu/rfc/i915_vm_bind.rst | 210 +++++++++++++++++++++++++ Documentation/gpu/rfc/index.rst | 4 + 2 files changed, 214 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst new file mode 100644 index 000000000000..cdc6bb25b942 --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.rst @@ -0,0 +1,210 @@ +========================================== +I915 VM_BIND feature design and use cases +========================================== + +VM_BIND feature +================ +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer +objects (BOs) or sections of a BOs at specified GPU virtual addresses on +a specified address space (VM). + +These mappings (also referred to as persistent mappings) will be persistent +across multiple GPU submissions (execbuff) issued by the UMD, without user +having to provide a list of all required mappings during each submission +(as required by older execbuff mode). + +VM_BIND ioctl deferes binding the mappings until next execbuff submission +where it will be required, or immediately if I915_GEM_VM_BIND_IMMEDIATE +flag is set (useful if mapping is required for an active context). + +VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND. +User has to opt-in for VM_BIND mode of binding for an address space (VM) +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension. +A VM in VM_BIND mode will not support older execbuff mode of binding. + +UMDs can still send BOs of these persistent mappings in execlist of execbuff +for specifying BO dependencies (implicit fencing) and to use BO as a batch, +but those BOs should be mapped ahead via vm_bind ioctl. + +VM_BIND features include, +- Multiple Virtual Address (VA) mappings can map to the same physical pages + of an object (aliasing). +- VA mapping can map to a partial section of the BO (partial binding). +- Support capture of persistent mappings in the dump upon GPU error. +- TLB is flushed upon unbind completion. Batching of TLB flushes in some + usecases will be helpful. +- Asynchronous vm_bind and vm_unbind support. +- VM_BIND uses user/memory fence mechanism for signaling bind completion + and for signaling batch completion in long running contexts (explained + below). + +VM_PRIVATE objects +------------------ +By default, BOs can be mapped on multiple VMs and can also be dma-buf +exported. Hence these BOs are referred to as Shared BOs. +During each execbuff submission, the request fence must be added to the +dma-resv fence list of all shared BOs mapped on the VM. + +VM_BIND feature introduces an optimization where user can create BO which +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on +the VM they are private to and can't be dma-buf exported. +All private BOs of a VM share the dma-resv object. Hence during each execbuff +submission, they need only one dma-resv fence list updated. Thus the fast +path (where required mappings are already bound) submission latency is O(1) +w.r.t the number of VM private BOs. + +VM_BIND locking hirarchy +------------------------- +VM_BIND locking order is as below. + +1) A vm_bind mutex will protect vm_bind lists. This lock is taken in vm_bind/ + vm_unbind ioctl calls, in the execbuff path and while releasing the mapping. + + In future, when GPU page faults are supported, we can potentially use a + rwsem instead, so that multiple pagefault handlers can take the read side + lock to lookup the mapping and hence can run in parallel. + +2) The BO's dma-resv lock will protect i915_vma state and needs to be held + while binding a vma and while updating dma-resv fence list of a BO. + The private BOs of a VM will all share a dma-resv object. + + This lock is held in vm_bind call for immediate binding, during vm_unbind + call for unbinding and during execbuff path for binding the mapping and + updating the dma-resv fence list of the BO. + +3) Spinlock/s to protect some of the VM's lists. + +We will also need support for bluk LRU movement of persistent mapping to +avoid additional latencies in execbuff path. + +GPU page faults +---------------- +Both older execbuff mode and the newer VM_BIND mode of binding will require +using dma-fence to ensure residency. +In future when GPU page faults are supported, no dma-fence usage is required +as residency is purely managed by installing and removing/invalidating ptes. + + +User/Memory Fence +================== +The idea is to take a user specified virtual address and install an interrupt +handler to wake up the current task when the memory location passes the user +supplied filter. + +User/Memory fence is a <address, value> pair. To signal the user fence, +specified value will be written at the specified virtual address and +wakeup the waiting process. User can wait on an user fence with the +gem_wait_user_fence ioctl. + +It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify +interrupt within their batches after updating the value to have sub-batch +precision on the wakeup. Each batch can signal an user fence to indicate +the completion of next level batch. The completion of very first level batch +needs to be signaled by the command streamer. The user must provide the +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE +extension of execbuff ioctl, so that KMD can setup the command streamer to +signal it. + +User/Memory fence can also be supplied to the kernel driver to signal/wake up +the user process after completion of an asynchronous operation. + +When VM_BIND ioctl was provided with a user/memory fence via the +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion +of binding of that mapping. All async binds/unbinds are serialized, hence +signaling of user/memory fence also indicate the completion of all previous +binds/unbinds. + +This feature will be derived from the below original work: +https://patchwork.freedesktop.org/patch/349417/ + + +VM_BIND use cases +================== + +Long running Compute contexts +------------------------------ +Usage of dma-fence expects that they complete in reasonable amount of time. +Compute on the other hand can be long running. Hence it is appropriate for +compute to use user/memory fence and dma-fence usage will be limited to +in-kernel consumption only. This requires an execbuff uapi extension to pass +in user fence. Compute must opt-in for this mechanism with +I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during context creation. + +The dma-fence based user interfaces like gem_wait ioctl, execbuff out fence +and implicit dependency setting is not allowed on long running contexts. + +Where GPU page faults are not available, kernel driver upon buffer invalidation +will initiate a suspend (preemption) of long running context with a dma-fence +attached to it. And upon completion of that suspend fence, finish the +invalidation, revalidate the BO and then resume the compute context. This is +done by having a per-context fence (called suspend fence) proxying as +i915_request fence. This suspend fence is enabled when there is a wait on it, +which triggers the context preemption. + +This is much easier to support with VM_BIND compared to the current heavier +execbuff path resource attachment. + +Low Latency Submission +----------------------- +Allows compute UMD to directly submit GPU jobs instead of through execbuff +ioctl. VM_BIND allows map/unmap of BOs required for directly submitted jobs. + +Debugger +--------- +With debug event interface user space process (debugger) is able to keep track +of and act upon resources created by another process (debuggee) and attached +to GPU via vm_bind interface. + +Mesa/Valkun +------------ +VM_BIND can potentially reduce the CPU-overhead in Mesa thus improving +performance. For Vulkan it should be straightforward to use VM_BIND. +For Iris implicit buffer tracking must be implemented before we can harness +VM_BIND benefits. With increasing GPU hardware performance reducing CPU +overhead becomes more important. + +Page level hints settings +-------------------------- +VM_BIND allows any hints setting per mapping instead of per BO. +Possible hints include read-only, placement and atomicity. +Sub-BO level placement hint will be even more relevant with +upcoming GPU on-demand page fault support. + +Page level Cache/CLOS settings +------------------------------- +VM_BIND allows cache/CLOS settings per mapping instead of per BO. + +Shared Virtual Memory (SVM) support +------------------------------------ +VM_BIND interface can be used to map system memory directly (without gem BO +abstraction) using the HMM interface. + + +Broder i915 cleanups +===================== +Supporting this whole new vm_bind mode of binding which comes with its own +usecases to support and the locking requirements requires proper integration +with the existing i915 driver. This calls for some broader i915 driver +cleanups/simplifications for maintainability of the driver going forward. +Here are few things identified and are being looked into. + +- Make pagetable allocations evictable and manage them similar to VM_BIND + mapped objects. Page table pages are similar to persistent mappings of a + VM (difference here are that the page table pages will not + have an i915_vma structure and after swapping pages back in, parent page + link needs to be updated). +- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature + do not use it and complexity it brings in is probably more than the + performance advantage we get in legacy execbuff case. +- Remove vma->open_count counting +- Remove i915_vma active reference tracking. Instead use underlying BO's + dma-resv fence list to determine if a i915_vma is active or not. + +These can be worked upon after intitial vm_bind support is added. + + +UAPI +===== +Uapi definiton can be found here: +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst index 91e93a705230..7d10c36b268d 100644 --- a/Documentation/gpu/rfc/index.rst +++ b/Documentation/gpu/rfc/index.rst @@ -23,3 +23,7 @@ host such documentation: .. toctree::
i915_scheduler.rst + +.. toctree:: + + i915_vm_bind.rst
On Mon, Mar 7, 2022 at 3:30 PM Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:
VM_BIND design document with description of intended use cases.
Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com
Documentation/gpu/rfc/i915_vm_bind.rst | 210 +++++++++++++++++++++++++ Documentation/gpu/rfc/index.rst | 4 + 2 files changed, 214 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst new file mode 100644 index 000000000000..cdc6bb25b942 --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.rst @@ -0,0 +1,210 @@ +========================================== +I915 VM_BIND feature design and use cases +==========================================
+VM_BIND feature +================ +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer +objects (BOs) or sections of a BOs at specified GPU virtual addresses on +a specified address space (VM).
+These mappings (also referred to as persistent mappings) will be persistent +across multiple GPU submissions (execbuff) issued by the UMD, without user +having to provide a list of all required mappings during each submission +(as required by older execbuff mode).
+VM_BIND ioctl deferes binding the mappings until next execbuff submission +where it will be required, or immediately if I915_GEM_VM_BIND_IMMEDIATE +flag is set (useful if mapping is required for an active context).
+VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND. +User has to opt-in for VM_BIND mode of binding for an address space (VM) +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension. +A VM in VM_BIND mode will not support older execbuff mode of binding.
+UMDs can still send BOs of these persistent mappings in execlist of execbuff +for specifying BO dependencies (implicit fencing) and to use BO as a batch, +but those BOs should be mapped ahead via vm_bind ioctl.
+VM_BIND features include, +- Multiple Virtual Address (VA) mappings can map to the same physical pages
- of an object (aliasing).
+- VA mapping can map to a partial section of the BO (partial binding). +- Support capture of persistent mappings in the dump upon GPU error. +- TLB is flushed upon unbind completion. Batching of TLB flushes in some
- usecases will be helpful.
+- Asynchronous vm_bind and vm_unbind support. +- VM_BIND uses user/memory fence mechanism for signaling bind completion
- and for signaling batch completion in long running contexts (explained
- below).
+VM_PRIVATE objects +------------------ +By default, BOs can be mapped on multiple VMs and can also be dma-buf +exported. Hence these BOs are referred to as Shared BOs. +During each execbuff submission, the request fence must be added to the +dma-resv fence list of all shared BOs mapped on the VM.
+VM_BIND feature introduces an optimization where user can create BO which +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on +the VM they are private to and can't be dma-buf exported. +All private BOs of a VM share the dma-resv object. Hence during each execbuff +submission, they need only one dma-resv fence list updated. Thus the fast +path (where required mappings are already bound) submission latency is O(1) +w.r.t the number of VM private BOs.
+VM_BIND locking hirarchy +------------------------- +VM_BIND locking order is as below.
+1) A vm_bind mutex will protect vm_bind lists. This lock is taken in vm_bind/
- vm_unbind ioctl calls, in the execbuff path and while releasing the mapping.
- In future, when GPU page faults are supported, we can potentially use a
- rwsem instead, so that multiple pagefault handlers can take the read side
- lock to lookup the mapping and hence can run in parallel.
+2) The BO's dma-resv lock will protect i915_vma state and needs to be held
- while binding a vma and while updating dma-resv fence list of a BO.
- The private BOs of a VM will all share a dma-resv object.
- This lock is held in vm_bind call for immediate binding, during vm_unbind
- call for unbinding and during execbuff path for binding the mapping and
- updating the dma-resv fence list of the BO.
+3) Spinlock/s to protect some of the VM's lists.
+We will also need support for bluk LRU movement of persistent mapping to +avoid additional latencies in execbuff path.
+GPU page faults +---------------- +Both older execbuff mode and the newer VM_BIND mode of binding will require +using dma-fence to ensure residency. +In future when GPU page faults are supported, no dma-fence usage is required +as residency is purely managed by installing and removing/invalidating ptes.
+User/Memory Fence +================== +The idea is to take a user specified virtual address and install an interrupt +handler to wake up the current task when the memory location passes the user +supplied filter.
+User/Memory fence is a <address, value> pair. To signal the user fence, +specified value will be written at the specified virtual address and +wakeup the waiting process. User can wait on an user fence with the +gem_wait_user_fence ioctl.
+It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify +interrupt within their batches after updating the value to have sub-batch +precision on the wakeup. Each batch can signal an user fence to indicate +the completion of next level batch. The completion of very first level batch +needs to be signaled by the command streamer. The user must provide the +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE +extension of execbuff ioctl, so that KMD can setup the command streamer to +signal it.
+User/Memory fence can also be supplied to the kernel driver to signal/wake up +the user process after completion of an asynchronous operation.
+When VM_BIND ioctl was provided with a user/memory fence via the +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion +of binding of that mapping. All async binds/unbinds are serialized, hence +signaling of user/memory fence also indicate the completion of all previous +binds/unbinds.
+This feature will be derived from the below original work: +https://patchwork.freedesktop.org/patch/349417/
+VM_BIND use cases +==================
+Long running Compute contexts +------------------------------ +Usage of dma-fence expects that they complete in reasonable amount of time. +Compute on the other hand can be long running. Hence it is appropriate for +compute to use user/memory fence and dma-fence usage will be limited to +in-kernel consumption only. This requires an execbuff uapi extension to pass +in user fence. Compute must opt-in for this mechanism with +I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during context creation.
+The dma-fence based user interfaces like gem_wait ioctl, execbuff out fence +and implicit dependency setting is not allowed on long running contexts.
+Where GPU page faults are not available, kernel driver upon buffer invalidation +will initiate a suspend (preemption) of long running context with a dma-fence +attached to it. And upon completion of that suspend fence, finish the +invalidation, revalidate the BO and then resume the compute context. This is +done by having a per-context fence (called suspend fence) proxying as +i915_request fence. This suspend fence is enabled when there is a wait on it, +which triggers the context preemption.
+This is much easier to support with VM_BIND compared to the current heavier +execbuff path resource attachment.
+Low Latency Submission +----------------------- +Allows compute UMD to directly submit GPU jobs instead of through execbuff +ioctl. VM_BIND allows map/unmap of BOs required for directly submitted jobs.
+Debugger +--------- +With debug event interface user space process (debugger) is able to keep track +of and act upon resources created by another process (debuggee) and attached +to GPU via vm_bind interface.
+Mesa/Valkun
s/Valkun/Vulkan/
Alex
+------------ +VM_BIND can potentially reduce the CPU-overhead in Mesa thus improving +performance. For Vulkan it should be straightforward to use VM_BIND. +For Iris implicit buffer tracking must be implemented before we can harness +VM_BIND benefits. With increasing GPU hardware performance reducing CPU +overhead becomes more important.
+Page level hints settings +-------------------------- +VM_BIND allows any hints setting per mapping instead of per BO. +Possible hints include read-only, placement and atomicity. +Sub-BO level placement hint will be even more relevant with +upcoming GPU on-demand page fault support.
+Page level Cache/CLOS settings +------------------------------- +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
+Shared Virtual Memory (SVM) support +------------------------------------ +VM_BIND interface can be used to map system memory directly (without gem BO +abstraction) using the HMM interface.
+Broder i915 cleanups +===================== +Supporting this whole new vm_bind mode of binding which comes with its own +usecases to support and the locking requirements requires proper integration +with the existing i915 driver. This calls for some broader i915 driver +cleanups/simplifications for maintainability of the driver going forward. +Here are few things identified and are being looked into.
+- Make pagetable allocations evictable and manage them similar to VM_BIND
- mapped objects. Page table pages are similar to persistent mappings of a
- VM (difference here are that the page table pages will not
- have an i915_vma structure and after swapping pages back in, parent page
- link needs to be updated).
+- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
- do not use it and complexity it brings in is probably more than the
- performance advantage we get in legacy execbuff case.
+- Remove vma->open_count counting +- Remove i915_vma active reference tracking. Instead use underlying BO's
- dma-resv fence list to determine if a i915_vma is active or not.
+These can be worked upon after intitial vm_bind support is added.
+UAPI +===== +Uapi definiton can be found here: +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst index 91e93a705230..7d10c36b268d 100644 --- a/Documentation/gpu/rfc/index.rst +++ b/Documentation/gpu/rfc/index.rst @@ -23,3 +23,7 @@ host such documentation: .. toctree::
i915_scheduler.rst
+.. toctree::
- i915_vm_bind.rst
-- 2.21.0.rc0.32.g243a4c7e27
On Wed, Mar 09, 2022 at 10:58:09AM -0500, Alex Deucher wrote:
On Mon, Mar 7, 2022 at 3:30 PM Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com wrote:
VM_BIND design document with description of intended use cases.
Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com
Documentation/gpu/rfc/i915_vm_bind.rst | 210 +++++++++++++++++++++++++ Documentation/gpu/rfc/index.rst | 4 + 2 files changed, 214 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst new file mode 100644 index 000000000000..cdc6bb25b942 --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.rst @@ -0,0 +1,210 @@ +========================================== +I915 VM_BIND feature design and use cases +==========================================
+VM_BIND feature +================ +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer +objects (BOs) or sections of a BOs at specified GPU virtual addresses on +a specified address space (VM).
+These mappings (also referred to as persistent mappings) will be persistent +across multiple GPU submissions (execbuff) issued by the UMD, without user +having to provide a list of all required mappings during each submission +(as required by older execbuff mode).
+VM_BIND ioctl deferes binding the mappings until next execbuff submission +where it will be required, or immediately if I915_GEM_VM_BIND_IMMEDIATE +flag is set (useful if mapping is required for an active context).
+VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND. +User has to opt-in for VM_BIND mode of binding for an address space (VM) +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension. +A VM in VM_BIND mode will not support older execbuff mode of binding.
+UMDs can still send BOs of these persistent mappings in execlist of execbuff +for specifying BO dependencies (implicit fencing) and to use BO as a batch, +but those BOs should be mapped ahead via vm_bind ioctl.
+VM_BIND features include, +- Multiple Virtual Address (VA) mappings can map to the same physical pages
- of an object (aliasing).
+- VA mapping can map to a partial section of the BO (partial binding). +- Support capture of persistent mappings in the dump upon GPU error. +- TLB is flushed upon unbind completion. Batching of TLB flushes in some
- usecases will be helpful.
+- Asynchronous vm_bind and vm_unbind support. +- VM_BIND uses user/memory fence mechanism for signaling bind completion
- and for signaling batch completion in long running contexts (explained
- below).
+VM_PRIVATE objects +------------------ +By default, BOs can be mapped on multiple VMs and can also be dma-buf +exported. Hence these BOs are referred to as Shared BOs. +During each execbuff submission, the request fence must be added to the +dma-resv fence list of all shared BOs mapped on the VM.
+VM_BIND feature introduces an optimization where user can create BO which +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on +the VM they are private to and can't be dma-buf exported. +All private BOs of a VM share the dma-resv object. Hence during each execbuff +submission, they need only one dma-resv fence list updated. Thus the fast +path (where required mappings are already bound) submission latency is O(1) +w.r.t the number of VM private BOs.
+VM_BIND locking hirarchy +------------------------- +VM_BIND locking order is as below.
+1) A vm_bind mutex will protect vm_bind lists. This lock is taken in vm_bind/
- vm_unbind ioctl calls, in the execbuff path and while releasing the mapping.
- In future, when GPU page faults are supported, we can potentially use a
- rwsem instead, so that multiple pagefault handlers can take the read side
- lock to lookup the mapping and hence can run in parallel.
+2) The BO's dma-resv lock will protect i915_vma state and needs to be held
- while binding a vma and while updating dma-resv fence list of a BO.
- The private BOs of a VM will all share a dma-resv object.
- This lock is held in vm_bind call for immediate binding, during vm_unbind
- call for unbinding and during execbuff path for binding the mapping and
- updating the dma-resv fence list of the BO.
+3) Spinlock/s to protect some of the VM's lists.
+We will also need support for bluk LRU movement of persistent mapping to +avoid additional latencies in execbuff path.
+GPU page faults +---------------- +Both older execbuff mode and the newer VM_BIND mode of binding will require +using dma-fence to ensure residency. +In future when GPU page faults are supported, no dma-fence usage is required +as residency is purely managed by installing and removing/invalidating ptes.
+User/Memory Fence +================== +The idea is to take a user specified virtual address and install an interrupt +handler to wake up the current task when the memory location passes the user +supplied filter.
+User/Memory fence is a <address, value> pair. To signal the user fence, +specified value will be written at the specified virtual address and +wakeup the waiting process. User can wait on an user fence with the +gem_wait_user_fence ioctl.
+It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify +interrupt within their batches after updating the value to have sub-batch +precision on the wakeup. Each batch can signal an user fence to indicate +the completion of next level batch. The completion of very first level batch +needs to be signaled by the command streamer. The user must provide the +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE +extension of execbuff ioctl, so that KMD can setup the command streamer to +signal it.
+User/Memory fence can also be supplied to the kernel driver to signal/wake up +the user process after completion of an asynchronous operation.
+When VM_BIND ioctl was provided with a user/memory fence via the +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion +of binding of that mapping. All async binds/unbinds are serialized, hence +signaling of user/memory fence also indicate the completion of all previous +binds/unbinds.
+This feature will be derived from the below original work: +https://patchwork.freedesktop.org/patch/349417/
+VM_BIND use cases +==================
+Long running Compute contexts +------------------------------ +Usage of dma-fence expects that they complete in reasonable amount of time. +Compute on the other hand can be long running. Hence it is appropriate for +compute to use user/memory fence and dma-fence usage will be limited to +in-kernel consumption only. This requires an execbuff uapi extension to pass +in user fence. Compute must opt-in for this mechanism with +I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during context creation.
+The dma-fence based user interfaces like gem_wait ioctl, execbuff out fence +and implicit dependency setting is not allowed on long running contexts.
+Where GPU page faults are not available, kernel driver upon buffer invalidation +will initiate a suspend (preemption) of long running context with a dma-fence +attached to it. And upon completion of that suspend fence, finish the +invalidation, revalidate the BO and then resume the compute context. This is +done by having a per-context fence (called suspend fence) proxying as +i915_request fence. This suspend fence is enabled when there is a wait on it, +which triggers the context preemption.
+This is much easier to support with VM_BIND compared to the current heavier +execbuff path resource attachment.
+Low Latency Submission +----------------------- +Allows compute UMD to directly submit GPU jobs instead of through execbuff +ioctl. VM_BIND allows map/unmap of BOs required for directly submitted jobs.
+Debugger +--------- +With debug event interface user space process (debugger) is able to keep track +of and act upon resources created by another process (debuggee) and attached +to GPU via vm_bind interface.
+Mesa/Valkun
s/Valkun/Vulkan/
Thanks Alex, Will fix.
Niranjana
Alex
+------------ +VM_BIND can potentially reduce the CPU-overhead in Mesa thus improving +performance. For Vulkan it should be straightforward to use VM_BIND. +For Iris implicit buffer tracking must be implemented before we can harness +VM_BIND benefits. With increasing GPU hardware performance reducing CPU +overhead becomes more important.
+Page level hints settings +-------------------------- +VM_BIND allows any hints setting per mapping instead of per BO. +Possible hints include read-only, placement and atomicity. +Sub-BO level placement hint will be even more relevant with +upcoming GPU on-demand page fault support.
+Page level Cache/CLOS settings +------------------------------- +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
+Shared Virtual Memory (SVM) support +------------------------------------ +VM_BIND interface can be used to map system memory directly (without gem BO +abstraction) using the HMM interface.
+Broder i915 cleanups +===================== +Supporting this whole new vm_bind mode of binding which comes with its own +usecases to support and the locking requirements requires proper integration +with the existing i915 driver. This calls for some broader i915 driver +cleanups/simplifications for maintainability of the driver going forward. +Here are few things identified and are being looked into.
+- Make pagetable allocations evictable and manage them similar to VM_BIND
- mapped objects. Page table pages are similar to persistent mappings of a
- VM (difference here are that the page table pages will not
- have an i915_vma structure and after swapping pages back in, parent page
- link needs to be updated).
+- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
- do not use it and complexity it brings in is probably more than the
- performance advantage we get in legacy execbuff case.
+- Remove vma->open_count counting +- Remove i915_vma active reference tracking. Instead use underlying BO's
- dma-resv fence list to determine if a i915_vma is active or not.
+These can be worked upon after intitial vm_bind support is added.
+UAPI +===== +Uapi definiton can be found here: +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst index 91e93a705230..7d10c36b268d 100644 --- a/Documentation/gpu/rfc/index.rst +++ b/Documentation/gpu/rfc/index.rst @@ -23,3 +23,7 @@ host such documentation: .. toctree::
i915_scheduler.rst
+.. toctree::
- i915_vm_bind.rst
-- 2.21.0.rc0.32.g243a4c7e27
Adding a pile of people who've expressed interest in vm_bind for their drivers.
Also note to the intel folks: This is largely written with me having my subsystem co-maintainer hat on, i.e. what I think is the right thing to do here for the subsystem at large. There is substantial rework involved here, but it's not any different from i915 adopting ttm or i915 adpoting drm/sched, and I do think this stuff needs to happen in one form or another.
On Mon, Mar 07, 2022 at 12:31:45PM -0800, Niranjana Vishwanathapura wrote:
VM_BIND design document with description of intended use cases.
Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com
Documentation/gpu/rfc/i915_vm_bind.rst | 210 +++++++++++++++++++++++++ Documentation/gpu/rfc/index.rst | 4 + 2 files changed, 214 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst new file mode 100644 index 000000000000..cdc6bb25b942 --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.rst @@ -0,0 +1,210 @@ +========================================== +I915 VM_BIND feature design and use cases +==========================================
+VM_BIND feature +================ +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer +objects (BOs) or sections of a BOs at specified GPU virtual addresses on +a specified address space (VM).
+These mappings (also referred to as persistent mappings) will be persistent +across multiple GPU submissions (execbuff) issued by the UMD, without user +having to provide a list of all required mappings during each submission +(as required by older execbuff mode).
+VM_BIND ioctl deferes binding the mappings until next execbuff submission +where it will be required, or immediately if I915_GEM_VM_BIND_IMMEDIATE +flag is set (useful if mapping is required for an active context).
So this is a screw-up I've done, and for upstream I think we need to fix it: Implicit sync is bad, and it's also still a bad idea for vm_bind, and I was wrong suggesting we should do this a few years back when we kicked this off internally :-(
What I think we need is just always VM_BIND_IMMEDIATE mode, and then a few things on top: - in and out fences, like with execbuf, to allow userspace to sync with execbuf as needed - for compute-mode context this means userspace memory fences - for legacy context this means a timeline syncobj in drm_syncobj
No sync_file or anything else like this at all. This means a bunch of work, but also it'll have benefits because it means we should be able to use exactly the same code paths and logic for both compute and for legacy context, because drm_syncobj support future fence semantics.
Also on the implementation side we still need to install dma_fence to the various dma_resv, and for this we need the new dma_resv_usage series from Christian König first. vm_bind fences can then use the USAGE_BOOKKEEPING flag to make sure they never result in an oversync issue with execbuf. I don't think trying to land vm_bind without that prep work in dma_resv_usage makes sense.
Also as soon as dma_resv_usage has landed there's a few cleanups we should do in i915: - ttm bo moving code should probably simplify a bit (and maybe more of the code should be pushed as helpers into ttm) - clflush code should be moved over to using USAGE_KERNEL and the various hacks and special cases should be ditched. See df94fd05e69e ("drm/i915: expand on the kernel-doc for cache_dirty") for a bit more context
This is still not yet enough, since if a vm_bind races with an eviction we might stall on the new buffers being readied first before the context can continue. This needs some care to make sure that vma which aren't fully bound yet are on a separate list, and vma which are marked for unbinding are removed from the main working set list as soon as possible.
All of these things are relevant for the uapi semantics, which means - they need to be documented in the uapi kerneldoc, ideally with example flows - umd need to ack this
The other thing here is the async/nonblocking path. I think we still need that one, but again it should not sync with anything going on in execbuf, but simply execute the ioctl code in a kernel thread. The idea here is that this works like a special gpu engine, so that compute and vk can schedule bindings interleaved with rendering. This should be enough to get a performant vk sparse binding/textures implementation.
But I'm not entirely sure on this one, so this definitely needs acks from umds.
+VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND. +User has to opt-in for VM_BIND mode of binding for an address space (VM) +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension. +A VM in VM_BIND mode will not support older execbuff mode of binding.
+UMDs can still send BOs of these persistent mappings in execlist of execbuff +for specifying BO dependencies (implicit fencing) and to use BO as a batch, +but those BOs should be mapped ahead via vm_bind ioctl.
should or must?
Also I'm not really sure that's a great interface. The batchbuffer really only needs to be an address, so maybe all we need is an extension to supply an u64 batchbuffer address instead of trying to retrofit this into an unfitting current uapi.
And for implicit sync there's two things: - for vk I think the right uapi is the dma-buf fence import/export ioctls from Jason Ekstrand. I think we should land that first instead of hacking funny concepts together - for gl the dma-buf import/export might not be fast enough, since gl needs to do a _lot_ of implicit sync. There we might need to use the execbuffer buffer list, but then we should have extremely clear uapi rules which disallow _everything_ except setting the explicit sync uapi
Again all this stuff needs to be documented in detail in the kerneldoc uapi spec.
+VM_BIND features include, +- Multiple Virtual Address (VA) mappings can map to the same physical pages
- of an object (aliasing).
+- VA mapping can map to a partial section of the BO (partial binding). +- Support capture of persistent mappings in the dump upon GPU error. +- TLB is flushed upon unbind completion. Batching of TLB flushes in some
- usecases will be helpful.
+- Asynchronous vm_bind and vm_unbind support. +- VM_BIND uses user/memory fence mechanism for signaling bind completion
- and for signaling batch completion in long running contexts (explained
- below).
This should all be in the kerneldoc.
+VM_PRIVATE objects +------------------ +By default, BOs can be mapped on multiple VMs and can also be dma-buf +exported. Hence these BOs are referred to as Shared BOs. +During each execbuff submission, the request fence must be added to the +dma-resv fence list of all shared BOs mapped on the VM.
+VM_BIND feature introduces an optimization where user can create BO which +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on +the VM they are private to and can't be dma-buf exported. +All private BOs of a VM share the dma-resv object. Hence during each execbuff +submission, they need only one dma-resv fence list updated. Thus the fast +path (where required mappings are already bound) submission latency is O(1) +w.r.t the number of VM private BOs.
Two things:
- I think the above is required to for initial vm_bind for vk, it kinda doesn't make much sense without that, and will allow us to match amdgpu and radeonsi
- Christian König just landed ttm bulk lru helpers, and I think we need to use those. This means vm_bind will only work with the ttm backend, but that's what we have for the big dgpu where vm_bind helps more in terms of performance, and the igfx conversion to ttm is already going on.
Furthermore the i915 shrinker lru has stopped being an lru, so I think that should also be moved over to the ttm lru in some fashion to make sure we once again have a reasonable and consistent memory aging and reclaim architecture. The current code is just too much of a complete mess.
And since this is all fairly integral to how the code arch works I don't think merging a different version which isn't based on ttm bulk lru helpers makes sense.
Also I do think the page table lru handling needs to be included here, because that's another complete hand-rolled separate world for not much good reasons. I guess that can happen in parallel with the initial vm_bind bring-up, but it needs to be completed by the time we add the features beyond the initial support needed for vk.
+VM_BIND locking hirarchy +------------------------- +VM_BIND locking order is as below.
+1) A vm_bind mutex will protect vm_bind lists. This lock is taken in vm_bind/
- vm_unbind ioctl calls, in the execbuff path and while releasing the mapping.
- In future, when GPU page faults are supported, we can potentially use a
- rwsem instead, so that multiple pagefault handlers can take the read side
- lock to lookup the mapping and hence can run in parallel.
+2) The BO's dma-resv lock will protect i915_vma state and needs to be held
- while binding a vma and while updating dma-resv fence list of a BO.
- The private BOs of a VM will all share a dma-resv object.
- This lock is held in vm_bind call for immediate binding, during vm_unbind
- call for unbinding and during execbuff path for binding the mapping and
- updating the dma-resv fence list of the BO.
+3) Spinlock/s to protect some of the VM's lists.
+We will also need support for bluk LRU movement of persistent mapping to +avoid additional latencies in execbuff path.
This needs more detail and explanation of how each level is required. Also the shared dma_resv for VM_PRIVATE objects is kinda important to explain.
Like "some of the VM's lists" explains pretty much nothing.
+GPU page faults +---------------- +Both older execbuff mode and the newer VM_BIND mode of binding will require +using dma-fence to ensure residency. +In future when GPU page faults are supported, no dma-fence usage is required +as residency is purely managed by installing and removing/invalidating ptes.
This is a bit confusing. I think one part of this should be moved into the section with future vm_bind use-cases (we're not going to support page faults with legacy softpin or even worse, relocations). The locking discussion should be part of the much longer list of uses cases that motivate the locking design.
+User/Memory Fence +================== +The idea is to take a user specified virtual address and install an interrupt +handler to wake up the current task when the memory location passes the user +supplied filter.
+User/Memory fence is a <address, value> pair. To signal the user fence, +specified value will be written at the specified virtual address and +wakeup the waiting process. User can wait on an user fence with the +gem_wait_user_fence ioctl.
+It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify +interrupt within their batches after updating the value to have sub-batch +precision on the wakeup. Each batch can signal an user fence to indicate +the completion of next level batch. The completion of very first level batch +needs to be signaled by the command streamer. The user must provide the +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE +extension of execbuff ioctl, so that KMD can setup the command streamer to +signal it.
+User/Memory fence can also be supplied to the kernel driver to signal/wake up +the user process after completion of an asynchronous operation.
+When VM_BIND ioctl was provided with a user/memory fence via the +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion +of binding of that mapping. All async binds/unbinds are serialized, hence +signaling of user/memory fence also indicate the completion of all previous +binds/unbinds.
+This feature will be derived from the below original work: +https://patchwork.freedesktop.org/patch/349417/
This is 1:1 tied to long running compute mode contexts (which in the uapi doc must reference the endless amounts of bikeshed summary we have in the docs about indefinite fences).
I'd put this into a new section about compute and userspace memory fences support, with this and the next chapter ...
+VM_BIND use cases +==================
... and then make this section here focus entirely on additional vm_bind use-cases that we'll be adding later on. Which doesn't need to go into any details, it's just justification for why we want to build the world on top of vm_bind.
+Long running Compute contexts +------------------------------ +Usage of dma-fence expects that they complete in reasonable amount of time. +Compute on the other hand can be long running. Hence it is appropriate for +compute to use user/memory fence and dma-fence usage will be limited to +in-kernel consumption only. This requires an execbuff uapi extension to pass +in user fence. Compute must opt-in for this mechanism with +I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during context creation.
+The dma-fence based user interfaces like gem_wait ioctl, execbuff out fence +and implicit dependency setting is not allowed on long running contexts.
+Where GPU page faults are not available, kernel driver upon buffer invalidation +will initiate a suspend (preemption) of long running context with a dma-fence +attached to it. And upon completion of that suspend fence, finish the +invalidation, revalidate the BO and then resume the compute context. This is +done by having a per-context fence (called suspend fence) proxying as +i915_request fence. This suspend fence is enabled when there is a wait on it, +which triggers the context preemption.
+This is much easier to support with VM_BIND compared to the current heavier +execbuff path resource attachment.
There's a bunch of tricky code around compute mode context support, like the preempt ctx fence (or suspend fence or whatever you want to call it), and the resume work. And I think that code should be shared across drivers.
I think the right place to put this is into drm/sched, somewhere attached to the drm_sched_entity structure. I expect i915 folks to collaborate with amd and ideally also get amdkfd to adopt the same thing if possible. At least Christian has mentioned in the past that he's a bit unhappy about how this works.
Also drm/sched has dependency tracking, which will be needed to pipeline context resume operations. That needs to be used instead of i915-gem inventing yet another dependency tracking data structure (it already has 3 and that's roughly 3 too many).
This means compute mode support and userspace memory fences are blocked on the drm/sched conversion, but *eh* add it to the list of reasons for why drm/sched needs to happen.
Also since we only have support for compute mode ctx in our internal tree with the guc scheduler backend anyway, and the first conversion target is the guc backend, I don't think this actually holds up a lot of the code.
+Low Latency Submission +----------------------- +Allows compute UMD to directly submit GPU jobs instead of through execbuff +ioctl. VM_BIND allows map/unmap of BOs required for directly submitted jobs.
This is really just a special case of compute mode contexts, I think I'd include that in there, but explain better what it requires (i.e. vm_bind not being synchronized against execbuf).
+Debugger +--------- +With debug event interface user space process (debugger) is able to keep track +of and act upon resources created by another process (debuggee) and attached +to GPU via vm_bind interface.
+Mesa/Valkun +------------ +VM_BIND can potentially reduce the CPU-overhead in Mesa thus improving +performance. For Vulkan it should be straightforward to use VM_BIND. +For Iris implicit buffer tracking must be implemented before we can harness +VM_BIND benefits. With increasing GPU hardware performance reducing CPU +overhead becomes more important.
Just to clarify, I don't think we can land vm_bind into upstream if it doesn't work 100% for vk. There's a bit much "can" instead of "will in this section".
+Page level hints settings +-------------------------- +VM_BIND allows any hints setting per mapping instead of per BO. +Possible hints include read-only, placement and atomicity. +Sub-BO level placement hint will be even more relevant with +upcoming GPU on-demand page fault support.
+Page level Cache/CLOS settings +------------------------------- +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
+Shared Virtual Memory (SVM) support +------------------------------------ +VM_BIND interface can be used to map system memory directly (without gem BO +abstraction) using the HMM interface.
Userptr is absent here (and it's not the same as svm, at least on discrete), and this is needed for the initial version since otherwise vk can't use it because we're not at feature parity.
Irc discussions by Maarten and Dave came up with the idea that maybe userptr for vm_bind should work _without_ any gem bo as backing storage, since that guarantees that people don't come up with funny ideas like trying to share such bo across process or mmap it and other nonsense which just doesn't work.
+Broder i915 cleanups +===================== +Supporting this whole new vm_bind mode of binding which comes with its own +usecases to support and the locking requirements requires proper integration +with the existing i915 driver. This calls for some broader i915 driver +cleanups/simplifications for maintainability of the driver going forward. +Here are few things identified and are being looked into.
+- Make pagetable allocations evictable and manage them similar to VM_BIND
- mapped objects. Page table pages are similar to persistent mappings of a
- VM (difference here are that the page table pages will not
- have an i915_vma structure and after swapping pages back in, parent page
- link needs to be updated).
See above, but I think this should be included as part of the initial vm_bind push.
+- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
- do not use it and complexity it brings in is probably more than the
- performance advantage we get in legacy execbuff case.
+- Remove vma->open_count counting +- Remove i915_vma active reference tracking. Instead use underlying BO's
- dma-resv fence list to determine if a i915_vma is active or not.
So this is a complete mess, and really should not exist. I think it needs to be removed before we try to make i915_vma even more complex by adding vm_bind.
The other thing I've been pondering here is that vm_bind is really completely different from legacy vm structures for a lot of reasons: - no relocation or softpin handling, which means vm_bind has no reason to ever look at the i915_vma structure in execbuf code. Unfortunately execbuf has been rewritten to be vma instead of obj centric, so it's a 100% mismatch
- vm_bind never has to manage any vm lru. Legacy execbuf has to maintain that because the kernel manages the virtual address space fully. Again ideally that entire vma_move_to_active code and everything related to it would simply not exist.
- similar on the eviction side, the rules are quite different: For vm_bind we never tear down the vma, instead it's just moved to the list of evicted vma. Legacy vm have no need for all these additional lists, so another huge confusion.
- if the refcount is done correctly for vm_bind we wouldn't need the tricky code in the bo close paths. Unfortunately legacy vm with relocations and softpin require that vma are only a weak reference, so that cannot be removed.
- there's also a ton of special cases for ggtt handling, like the different views (for display, partial views for mmap), but also the gen2/3 alignment and padding requirements which vm_bind never needs.
I think the right thing here is to massively split the implementation behind some solid vm/vma abstraction, with a base clase for vm and vma which _only_ has the pieces which both vm_bind and the legacy vm stuff needs. But it's a bit tricky to get there. I think a workable path would be: - Add a new base class to both i915_address_space and i915_vma, which starts out empty.
- As vm_bind code lands, move things that vm_bind code needs into these base classes
- The goal should be that these base classes are a stand-alone library that other drivers could reuse. Like we've done with the buddy allocator, which first moved from i915-gem to i915-ttm, and which amd now moved to drm/ttm for reuse by amdgpu. Ideally other drivers interested in adding something like vm_bind should be involved from the start (or maybe the entire thing reused in amdgpu, they're looking at vk sparse binding support too or at least have perf issues I think).
- Locking must be the same across all implemntations, otherwise it's really not an abstract. i915 screwed this up terribly by having different locking rules for ppgtt and ggtt, which is just nonsense.
- The legacy specific code needs to be extracted as much as possible and shoved into separate files. In execbuf this means we need to get back to object centric flow, and the slowpaths need to become a lot simpler again (Maarten has cleaned up some of this, but there's still a silly amount of hacks in there with funny layering).
- I think if stuff like the vma eviction details (list movement and locking and refcounting of the underlying object)
+These can be worked upon after intitial vm_bind support is added.
I don't think that works, given how badly i915-gem team screwed up in other places. And those places had to be fixed by adopting shared code like ttm. Plus there's already a huge unfulffiled promise pending with the drm/sched conversion, i915-gem team is clearly deeply in the red here :-/
Cheers, Daniel
+UAPI +===== +Uapi definiton can be found here: +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst index 91e93a705230..7d10c36b268d 100644 --- a/Documentation/gpu/rfc/index.rst +++ b/Documentation/gpu/rfc/index.rst @@ -23,3 +23,7 @@ host such documentation: .. toctree::
i915_scheduler.rst
+.. toctree::
- i915_vm_bind.rst
-- 2.21.0.rc0.32.g243a4c7e27
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
One thing I've forgotten, since it's only hinted at here: If/when we switch tlb flushing from the current dumb&synchronous implementation we now have in i915 in upstream to one with batching using dma_fence, then I think that should be something which is done with a small helper library of shared code too. The batching is somewhat tricky, and you need to make sure you put the fence into the right dma_resv_usage slot, and the trick with replace the vm fence with a tlb flush fence is also a good reason to share the code so we only have it one.
Christian's recent work also has some prep work for this already with the fence replacing trick. -Daniel
On Thu, 31 Mar 2022 at 10:28, Daniel Vetter daniel@ffwll.ch wrote:
Adding a pile of people who've expressed interest in vm_bind for their drivers.
Also note to the intel folks: This is largely written with me having my subsystem co-maintainer hat on, i.e. what I think is the right thing to do here for the subsystem at large. There is substantial rework involved here, but it's not any different from i915 adopting ttm or i915 adpoting drm/sched, and I do think this stuff needs to happen in one form or another.
On Mon, Mar 07, 2022 at 12:31:45PM -0800, Niranjana Vishwanathapura wrote:
VM_BIND design document with description of intended use cases.
Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com
Documentation/gpu/rfc/i915_vm_bind.rst | 210 +++++++++++++++++++++++++ Documentation/gpu/rfc/index.rst | 4 + 2 files changed, 214 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst new file mode 100644 index 000000000000..cdc6bb25b942 --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.rst @@ -0,0 +1,210 @@ +========================================== +I915 VM_BIND feature design and use cases +==========================================
+VM_BIND feature +================ +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer +objects (BOs) or sections of a BOs at specified GPU virtual addresses on +a specified address space (VM).
+These mappings (also referred to as persistent mappings) will be persistent +across multiple GPU submissions (execbuff) issued by the UMD, without user +having to provide a list of all required mappings during each submission +(as required by older execbuff mode).
+VM_BIND ioctl deferes binding the mappings until next execbuff submission +where it will be required, or immediately if I915_GEM_VM_BIND_IMMEDIATE +flag is set (useful if mapping is required for an active context).
So this is a screw-up I've done, and for upstream I think we need to fix it: Implicit sync is bad, and it's also still a bad idea for vm_bind, and I was wrong suggesting we should do this a few years back when we kicked this off internally :-(
What I think we need is just always VM_BIND_IMMEDIATE mode, and then a few things on top:
- in and out fences, like with execbuf, to allow userspace to sync with execbuf as needed
- for compute-mode context this means userspace memory fences
- for legacy context this means a timeline syncobj in drm_syncobj
No sync_file or anything else like this at all. This means a bunch of work, but also it'll have benefits because it means we should be able to use exactly the same code paths and logic for both compute and for legacy context, because drm_syncobj support future fence semantics.
Also on the implementation side we still need to install dma_fence to the various dma_resv, and for this we need the new dma_resv_usage series from Christian König first. vm_bind fences can then use the USAGE_BOOKKEEPING flag to make sure they never result in an oversync issue with execbuf. I don't think trying to land vm_bind without that prep work in dma_resv_usage makes sense.
Also as soon as dma_resv_usage has landed there's a few cleanups we should do in i915:
- ttm bo moving code should probably simplify a bit (and maybe more of the code should be pushed as helpers into ttm)
- clflush code should be moved over to using USAGE_KERNEL and the various hacks and special cases should be ditched. See df94fd05e69e ("drm/i915: expand on the kernel-doc for cache_dirty") for a bit more context
This is still not yet enough, since if a vm_bind races with an eviction we might stall on the new buffers being readied first before the context can continue. This needs some care to make sure that vma which aren't fully bound yet are on a separate list, and vma which are marked for unbinding are removed from the main working set list as soon as possible.
All of these things are relevant for the uapi semantics, which means
- they need to be documented in the uapi kerneldoc, ideally with example flows
- umd need to ack this
The other thing here is the async/nonblocking path. I think we still need that one, but again it should not sync with anything going on in execbuf, but simply execute the ioctl code in a kernel thread. The idea here is that this works like a special gpu engine, so that compute and vk can schedule bindings interleaved with rendering. This should be enough to get a performant vk sparse binding/textures implementation.
But I'm not entirely sure on this one, so this definitely needs acks from umds.
+VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND. +User has to opt-in for VM_BIND mode of binding for an address space (VM) +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension. +A VM in VM_BIND mode will not support older execbuff mode of binding.
+UMDs can still send BOs of these persistent mappings in execlist of execbuff +for specifying BO dependencies (implicit fencing) and to use BO as a batch, +but those BOs should be mapped ahead via vm_bind ioctl.
should or must?
Also I'm not really sure that's a great interface. The batchbuffer really only needs to be an address, so maybe all we need is an extension to supply an u64 batchbuffer address instead of trying to retrofit this into an unfitting current uapi.
And for implicit sync there's two things:
- for vk I think the right uapi is the dma-buf fence import/export ioctls from Jason Ekstrand. I think we should land that first instead of hacking funny concepts together
- for gl the dma-buf import/export might not be fast enough, since gl needs to do a _lot_ of implicit sync. There we might need to use the execbuffer buffer list, but then we should have extremely clear uapi rules which disallow _everything_ except setting the explicit sync uapi
Again all this stuff needs to be documented in detail in the kerneldoc uapi spec.
+VM_BIND features include, +- Multiple Virtual Address (VA) mappings can map to the same physical pages
- of an object (aliasing).
+- VA mapping can map to a partial section of the BO (partial binding). +- Support capture of persistent mappings in the dump upon GPU error. +- TLB is flushed upon unbind completion. Batching of TLB flushes in some
- usecases will be helpful.
+- Asynchronous vm_bind and vm_unbind support. +- VM_BIND uses user/memory fence mechanism for signaling bind completion
- and for signaling batch completion in long running contexts (explained
- below).
This should all be in the kerneldoc.
+VM_PRIVATE objects +------------------ +By default, BOs can be mapped on multiple VMs and can also be dma-buf +exported. Hence these BOs are referred to as Shared BOs. +During each execbuff submission, the request fence must be added to the +dma-resv fence list of all shared BOs mapped on the VM.
+VM_BIND feature introduces an optimization where user can create BO which +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on +the VM they are private to and can't be dma-buf exported. +All private BOs of a VM share the dma-resv object. Hence during each execbuff +submission, they need only one dma-resv fence list updated. Thus the fast +path (where required mappings are already bound) submission latency is O(1) +w.r.t the number of VM private BOs.
Two things:
I think the above is required to for initial vm_bind for vk, it kinda doesn't make much sense without that, and will allow us to match amdgpu and radeonsi
Christian König just landed ttm bulk lru helpers, and I think we need to use those. This means vm_bind will only work with the ttm backend, but that's what we have for the big dgpu where vm_bind helps more in terms of performance, and the igfx conversion to ttm is already going on.
Furthermore the i915 shrinker lru has stopped being an lru, so I think that should also be moved over to the ttm lru in some fashion to make sure we once again have a reasonable and consistent memory aging and reclaim architecture. The current code is just too much of a complete mess.
And since this is all fairly integral to how the code arch works I don't think merging a different version which isn't based on ttm bulk lru helpers makes sense.
Also I do think the page table lru handling needs to be included here, because that's another complete hand-rolled separate world for not much good reasons. I guess that can happen in parallel with the initial vm_bind bring-up, but it needs to be completed by the time we add the features beyond the initial support needed for vk.
+VM_BIND locking hirarchy +------------------------- +VM_BIND locking order is as below.
+1) A vm_bind mutex will protect vm_bind lists. This lock is taken in vm_bind/
- vm_unbind ioctl calls, in the execbuff path and while releasing the mapping.
- In future, when GPU page faults are supported, we can potentially use a
- rwsem instead, so that multiple pagefault handlers can take the read side
- lock to lookup the mapping and hence can run in parallel.
+2) The BO's dma-resv lock will protect i915_vma state and needs to be held
- while binding a vma and while updating dma-resv fence list of a BO.
- The private BOs of a VM will all share a dma-resv object.
- This lock is held in vm_bind call for immediate binding, during vm_unbind
- call for unbinding and during execbuff path for binding the mapping and
- updating the dma-resv fence list of the BO.
+3) Spinlock/s to protect some of the VM's lists.
+We will also need support for bluk LRU movement of persistent mapping to +avoid additional latencies in execbuff path.
This needs more detail and explanation of how each level is required. Also the shared dma_resv for VM_PRIVATE objects is kinda important to explain.
Like "some of the VM's lists" explains pretty much nothing.
+GPU page faults +---------------- +Both older execbuff mode and the newer VM_BIND mode of binding will require +using dma-fence to ensure residency. +In future when GPU page faults are supported, no dma-fence usage is required +as residency is purely managed by installing and removing/invalidating ptes.
This is a bit confusing. I think one part of this should be moved into the section with future vm_bind use-cases (we're not going to support page faults with legacy softpin or even worse, relocations). The locking discussion should be part of the much longer list of uses cases that motivate the locking design.
+User/Memory Fence +================== +The idea is to take a user specified virtual address and install an interrupt +handler to wake up the current task when the memory location passes the user +supplied filter.
+User/Memory fence is a <address, value> pair. To signal the user fence, +specified value will be written at the specified virtual address and +wakeup the waiting process. User can wait on an user fence with the +gem_wait_user_fence ioctl.
+It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify +interrupt within their batches after updating the value to have sub-batch +precision on the wakeup. Each batch can signal an user fence to indicate +the completion of next level batch. The completion of very first level batch +needs to be signaled by the command streamer. The user must provide the +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE +extension of execbuff ioctl, so that KMD can setup the command streamer to +signal it.
+User/Memory fence can also be supplied to the kernel driver to signal/wake up +the user process after completion of an asynchronous operation.
+When VM_BIND ioctl was provided with a user/memory fence via the +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion +of binding of that mapping. All async binds/unbinds are serialized, hence +signaling of user/memory fence also indicate the completion of all previous +binds/unbinds.
+This feature will be derived from the below original work: +https://patchwork.freedesktop.org/patch/349417/
This is 1:1 tied to long running compute mode contexts (which in the uapi doc must reference the endless amounts of bikeshed summary we have in the docs about indefinite fences).
I'd put this into a new section about compute and userspace memory fences support, with this and the next chapter ...
+VM_BIND use cases +==================
... and then make this section here focus entirely on additional vm_bind use-cases that we'll be adding later on. Which doesn't need to go into any details, it's just justification for why we want to build the world on top of vm_bind.
+Long running Compute contexts +------------------------------ +Usage of dma-fence expects that they complete in reasonable amount of time. +Compute on the other hand can be long running. Hence it is appropriate for +compute to use user/memory fence and dma-fence usage will be limited to +in-kernel consumption only. This requires an execbuff uapi extension to pass +in user fence. Compute must opt-in for this mechanism with +I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during context creation.
+The dma-fence based user interfaces like gem_wait ioctl, execbuff out fence +and implicit dependency setting is not allowed on long running contexts.
+Where GPU page faults are not available, kernel driver upon buffer invalidation +will initiate a suspend (preemption) of long running context with a dma-fence +attached to it. And upon completion of that suspend fence, finish the +invalidation, revalidate the BO and then resume the compute context. This is +done by having a per-context fence (called suspend fence) proxying as +i915_request fence. This suspend fence is enabled when there is a wait on it, +which triggers the context preemption.
+This is much easier to support with VM_BIND compared to the current heavier +execbuff path resource attachment.
There's a bunch of tricky code around compute mode context support, like the preempt ctx fence (or suspend fence or whatever you want to call it), and the resume work. And I think that code should be shared across drivers.
I think the right place to put this is into drm/sched, somewhere attached to the drm_sched_entity structure. I expect i915 folks to collaborate with amd and ideally also get amdkfd to adopt the same thing if possible. At least Christian has mentioned in the past that he's a bit unhappy about how this works.
Also drm/sched has dependency tracking, which will be needed to pipeline context resume operations. That needs to be used instead of i915-gem inventing yet another dependency tracking data structure (it already has 3 and that's roughly 3 too many).
This means compute mode support and userspace memory fences are blocked on the drm/sched conversion, but *eh* add it to the list of reasons for why drm/sched needs to happen.
Also since we only have support for compute mode ctx in our internal tree with the guc scheduler backend anyway, and the first conversion target is the guc backend, I don't think this actually holds up a lot of the code.
+Low Latency Submission +----------------------- +Allows compute UMD to directly submit GPU jobs instead of through execbuff +ioctl. VM_BIND allows map/unmap of BOs required for directly submitted jobs.
This is really just a special case of compute mode contexts, I think I'd include that in there, but explain better what it requires (i.e. vm_bind not being synchronized against execbuf).
+Debugger +--------- +With debug event interface user space process (debugger) is able to keep track +of and act upon resources created by another process (debuggee) and attached +to GPU via vm_bind interface.
+Mesa/Valkun +------------ +VM_BIND can potentially reduce the CPU-overhead in Mesa thus improving +performance. For Vulkan it should be straightforward to use VM_BIND. +For Iris implicit buffer tracking must be implemented before we can harness +VM_BIND benefits. With increasing GPU hardware performance reducing CPU +overhead becomes more important.
Just to clarify, I don't think we can land vm_bind into upstream if it doesn't work 100% for vk. There's a bit much "can" instead of "will in this section".
+Page level hints settings +-------------------------- +VM_BIND allows any hints setting per mapping instead of per BO. +Possible hints include read-only, placement and atomicity. +Sub-BO level placement hint will be even more relevant with +upcoming GPU on-demand page fault support.
+Page level Cache/CLOS settings +------------------------------- +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
+Shared Virtual Memory (SVM) support +------------------------------------ +VM_BIND interface can be used to map system memory directly (without gem BO +abstraction) using the HMM interface.
Userptr is absent here (and it's not the same as svm, at least on discrete), and this is needed for the initial version since otherwise vk can't use it because we're not at feature parity.
Irc discussions by Maarten and Dave came up with the idea that maybe userptr for vm_bind should work _without_ any gem bo as backing storage, since that guarantees that people don't come up with funny ideas like trying to share such bo across process or mmap it and other nonsense which just doesn't work.
+Broder i915 cleanups +===================== +Supporting this whole new vm_bind mode of binding which comes with its own +usecases to support and the locking requirements requires proper integration +with the existing i915 driver. This calls for some broader i915 driver +cleanups/simplifications for maintainability of the driver going forward. +Here are few things identified and are being looked into.
+- Make pagetable allocations evictable and manage them similar to VM_BIND
- mapped objects. Page table pages are similar to persistent mappings of a
- VM (difference here are that the page table pages will not
- have an i915_vma structure and after swapping pages back in, parent page
- link needs to be updated).
See above, but I think this should be included as part of the initial vm_bind push.
+- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
- do not use it and complexity it brings in is probably more than the
- performance advantage we get in legacy execbuff case.
+- Remove vma->open_count counting +- Remove i915_vma active reference tracking. Instead use underlying BO's
- dma-resv fence list to determine if a i915_vma is active or not.
So this is a complete mess, and really should not exist. I think it needs to be removed before we try to make i915_vma even more complex by adding vm_bind.
The other thing I've been pondering here is that vm_bind is really completely different from legacy vm structures for a lot of reasons:
no relocation or softpin handling, which means vm_bind has no reason to ever look at the i915_vma structure in execbuf code. Unfortunately execbuf has been rewritten to be vma instead of obj centric, so it's a 100% mismatch
vm_bind never has to manage any vm lru. Legacy execbuf has to maintain that because the kernel manages the virtual address space fully. Again ideally that entire vma_move_to_active code and everything related to it would simply not exist.
similar on the eviction side, the rules are quite different: For vm_bind we never tear down the vma, instead it's just moved to the list of evicted vma. Legacy vm have no need for all these additional lists, so another huge confusion.
if the refcount is done correctly for vm_bind we wouldn't need the tricky code in the bo close paths. Unfortunately legacy vm with relocations and softpin require that vma are only a weak reference, so that cannot be removed.
there's also a ton of special cases for ggtt handling, like the different views (for display, partial views for mmap), but also the gen2/3 alignment and padding requirements which vm_bind never needs.
I think the right thing here is to massively split the implementation behind some solid vm/vma abstraction, with a base clase for vm and vma which _only_ has the pieces which both vm_bind and the legacy vm stuff needs. But it's a bit tricky to get there. I think a workable path would be:
Add a new base class to both i915_address_space and i915_vma, which starts out empty.
As vm_bind code lands, move things that vm_bind code needs into these base classes
The goal should be that these base classes are a stand-alone library that other drivers could reuse. Like we've done with the buddy allocator, which first moved from i915-gem to i915-ttm, and which amd now moved to drm/ttm for reuse by amdgpu. Ideally other drivers interested in adding something like vm_bind should be involved from the start (or maybe the entire thing reused in amdgpu, they're looking at vk sparse binding support too or at least have perf issues I think).
Locking must be the same across all implemntations, otherwise it's really not an abstract. i915 screwed this up terribly by having different locking rules for ppgtt and ggtt, which is just nonsense.
The legacy specific code needs to be extracted as much as possible and shoved into separate files. In execbuf this means we need to get back to object centric flow, and the slowpaths need to become a lot simpler again (Maarten has cleaned up some of this, but there's still a silly amount of hacks in there with funny layering).
I think if stuff like the vma eviction details (list movement and locking and refcounting of the underlying object)
+These can be worked upon after intitial vm_bind support is added.
I don't think that works, given how badly i915-gem team screwed up in other places. And those places had to be fixed by adopting shared code like ttm. Plus there's already a huge unfulffiled promise pending with the drm/sched conversion, i915-gem team is clearly deeply in the red here :-/
Cheers, Daniel
+UAPI +===== +Uapi definiton can be found here: +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst index 91e93a705230..7d10c36b268d 100644 --- a/Documentation/gpu/rfc/index.rst +++ b/Documentation/gpu/rfc/index.rst @@ -23,3 +23,7 @@ host such documentation: .. toctree::
i915_scheduler.rst
+.. toctree::
- i915_vm_bind.rst
-- 2.21.0.rc0.32.g243a4c7e27
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Thu, Mar 31, 2022 at 01:37:08PM +0200, Daniel Vetter wrote:
One thing I've forgotten, since it's only hinted at here: If/when we switch tlb flushing from the current dumb&synchronous implementation we now have in i915 in upstream to one with batching using dma_fence, then I think that should be something which is done with a small helper library of shared code too. The batching is somewhat tricky, and you need to make sure you put the fence into the right dma_resv_usage slot, and the trick with replace the vm fence with a tlb flush fence is also a good reason to share the code so we only have it one.
Christian's recent work also has some prep work for this already with the fence replacing trick.
Sure, but this optimization is not required for initial vm_bind support to land right? We can look at it soon after that. Is that ok? I have made a reference to this TLB flush batching work in the rst file.
Niranjana
-Daniel
On Thu, 31 Mar 2022 at 10:28, Daniel Vetter daniel@ffwll.ch wrote:
Adding a pile of people who've expressed interest in vm_bind for their drivers.
Also note to the intel folks: This is largely written with me having my subsystem co-maintainer hat on, i.e. what I think is the right thing to do here for the subsystem at large. There is substantial rework involved here, but it's not any different from i915 adopting ttm or i915 adpoting drm/sched, and I do think this stuff needs to happen in one form or another.
On Mon, Mar 07, 2022 at 12:31:45PM -0800, Niranjana Vishwanathapura wrote:
VM_BIND design document with description of intended use cases.
Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com
Documentation/gpu/rfc/i915_vm_bind.rst | 210 +++++++++++++++++++++++++ Documentation/gpu/rfc/index.rst | 4 + 2 files changed, 214 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst new file mode 100644 index 000000000000..cdc6bb25b942 --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.rst @@ -0,0 +1,210 @@ +========================================== +I915 VM_BIND feature design and use cases +==========================================
+VM_BIND feature +================ +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer +objects (BOs) or sections of a BOs at specified GPU virtual addresses on +a specified address space (VM).
+These mappings (also referred to as persistent mappings) will be persistent +across multiple GPU submissions (execbuff) issued by the UMD, without user +having to provide a list of all required mappings during each submission +(as required by older execbuff mode).
+VM_BIND ioctl deferes binding the mappings until next execbuff submission +where it will be required, or immediately if I915_GEM_VM_BIND_IMMEDIATE +flag is set (useful if mapping is required for an active context).
So this is a screw-up I've done, and for upstream I think we need to fix it: Implicit sync is bad, and it's also still a bad idea for vm_bind, and I was wrong suggesting we should do this a few years back when we kicked this off internally :-(
What I think we need is just always VM_BIND_IMMEDIATE mode, and then a few things on top:
- in and out fences, like with execbuf, to allow userspace to sync with execbuf as needed
- for compute-mode context this means userspace memory fences
- for legacy context this means a timeline syncobj in drm_syncobj
No sync_file or anything else like this at all. This means a bunch of work, but also it'll have benefits because it means we should be able to use exactly the same code paths and logic for both compute and for legacy context, because drm_syncobj support future fence semantics.
Also on the implementation side we still need to install dma_fence to the various dma_resv, and for this we need the new dma_resv_usage series from Christian König first. vm_bind fences can then use the USAGE_BOOKKEEPING flag to make sure they never result in an oversync issue with execbuf. I don't think trying to land vm_bind without that prep work in dma_resv_usage makes sense.
Also as soon as dma_resv_usage has landed there's a few cleanups we should do in i915:
- ttm bo moving code should probably simplify a bit (and maybe more of the code should be pushed as helpers into ttm)
- clflush code should be moved over to using USAGE_KERNEL and the various hacks and special cases should be ditched. See df94fd05e69e ("drm/i915: expand on the kernel-doc for cache_dirty") for a bit more context
This is still not yet enough, since if a vm_bind races with an eviction we might stall on the new buffers being readied first before the context can continue. This needs some care to make sure that vma which aren't fully bound yet are on a separate list, and vma which are marked for unbinding are removed from the main working set list as soon as possible.
All of these things are relevant for the uapi semantics, which means
- they need to be documented in the uapi kerneldoc, ideally with example flows
- umd need to ack this
The other thing here is the async/nonblocking path. I think we still need that one, but again it should not sync with anything going on in execbuf, but simply execute the ioctl code in a kernel thread. The idea here is that this works like a special gpu engine, so that compute and vk can schedule bindings interleaved with rendering. This should be enough to get a performant vk sparse binding/textures implementation.
But I'm not entirely sure on this one, so this definitely needs acks from umds.
+VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND. +User has to opt-in for VM_BIND mode of binding for an address space (VM) +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension. +A VM in VM_BIND mode will not support older execbuff mode of binding.
+UMDs can still send BOs of these persistent mappings in execlist of execbuff +for specifying BO dependencies (implicit fencing) and to use BO as a batch, +but those BOs should be mapped ahead via vm_bind ioctl.
should or must?
Also I'm not really sure that's a great interface. The batchbuffer really only needs to be an address, so maybe all we need is an extension to supply an u64 batchbuffer address instead of trying to retrofit this into an unfitting current uapi.
And for implicit sync there's two things:
- for vk I think the right uapi is the dma-buf fence import/export ioctls from Jason Ekstrand. I think we should land that first instead of hacking funny concepts together
- for gl the dma-buf import/export might not be fast enough, since gl needs to do a _lot_ of implicit sync. There we might need to use the execbuffer buffer list, but then we should have extremely clear uapi rules which disallow _everything_ except setting the explicit sync uapi
Again all this stuff needs to be documented in detail in the kerneldoc uapi spec.
+VM_BIND features include, +- Multiple Virtual Address (VA) mappings can map to the same physical pages
- of an object (aliasing).
+- VA mapping can map to a partial section of the BO (partial binding). +- Support capture of persistent mappings in the dump upon GPU error. +- TLB is flushed upon unbind completion. Batching of TLB flushes in some
- usecases will be helpful.
+- Asynchronous vm_bind and vm_unbind support. +- VM_BIND uses user/memory fence mechanism for signaling bind completion
- and for signaling batch completion in long running contexts (explained
- below).
This should all be in the kerneldoc.
+VM_PRIVATE objects +------------------ +By default, BOs can be mapped on multiple VMs and can also be dma-buf +exported. Hence these BOs are referred to as Shared BOs. +During each execbuff submission, the request fence must be added to the +dma-resv fence list of all shared BOs mapped on the VM.
+VM_BIND feature introduces an optimization where user can create BO which +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on +the VM they are private to and can't be dma-buf exported. +All private BOs of a VM share the dma-resv object. Hence during each execbuff +submission, they need only one dma-resv fence list updated. Thus the fast +path (where required mappings are already bound) submission latency is O(1) +w.r.t the number of VM private BOs.
Two things:
I think the above is required to for initial vm_bind for vk, it kinda doesn't make much sense without that, and will allow us to match amdgpu and radeonsi
Christian König just landed ttm bulk lru helpers, and I think we need to use those. This means vm_bind will only work with the ttm backend, but that's what we have for the big dgpu where vm_bind helps more in terms of performance, and the igfx conversion to ttm is already going on.
Furthermore the i915 shrinker lru has stopped being an lru, so I think that should also be moved over to the ttm lru in some fashion to make sure we once again have a reasonable and consistent memory aging and reclaim architecture. The current code is just too much of a complete mess.
And since this is all fairly integral to how the code arch works I don't think merging a different version which isn't based on ttm bulk lru helpers makes sense.
Also I do think the page table lru handling needs to be included here, because that's another complete hand-rolled separate world for not much good reasons. I guess that can happen in parallel with the initial vm_bind bring-up, but it needs to be completed by the time we add the features beyond the initial support needed for vk.
+VM_BIND locking hirarchy +------------------------- +VM_BIND locking order is as below.
+1) A vm_bind mutex will protect vm_bind lists. This lock is taken in vm_bind/
- vm_unbind ioctl calls, in the execbuff path and while releasing the mapping.
- In future, when GPU page faults are supported, we can potentially use a
- rwsem instead, so that multiple pagefault handlers can take the read side
- lock to lookup the mapping and hence can run in parallel.
+2) The BO's dma-resv lock will protect i915_vma state and needs to be held
- while binding a vma and while updating dma-resv fence list of a BO.
- The private BOs of a VM will all share a dma-resv object.
- This lock is held in vm_bind call for immediate binding, during vm_unbind
- call for unbinding and during execbuff path for binding the mapping and
- updating the dma-resv fence list of the BO.
+3) Spinlock/s to protect some of the VM's lists.
+We will also need support for bluk LRU movement of persistent mapping to +avoid additional latencies in execbuff path.
This needs more detail and explanation of how each level is required. Also the shared dma_resv for VM_PRIVATE objects is kinda important to explain.
Like "some of the VM's lists" explains pretty much nothing.
+GPU page faults +---------------- +Both older execbuff mode and the newer VM_BIND mode of binding will require +using dma-fence to ensure residency. +In future when GPU page faults are supported, no dma-fence usage is required +as residency is purely managed by installing and removing/invalidating ptes.
This is a bit confusing. I think one part of this should be moved into the section with future vm_bind use-cases (we're not going to support page faults with legacy softpin or even worse, relocations). The locking discussion should be part of the much longer list of uses cases that motivate the locking design.
+User/Memory Fence +================== +The idea is to take a user specified virtual address and install an interrupt +handler to wake up the current task when the memory location passes the user +supplied filter.
+User/Memory fence is a <address, value> pair. To signal the user fence, +specified value will be written at the specified virtual address and +wakeup the waiting process. User can wait on an user fence with the +gem_wait_user_fence ioctl.
+It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify +interrupt within their batches after updating the value to have sub-batch +precision on the wakeup. Each batch can signal an user fence to indicate +the completion of next level batch. The completion of very first level batch +needs to be signaled by the command streamer. The user must provide the +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE +extension of execbuff ioctl, so that KMD can setup the command streamer to +signal it.
+User/Memory fence can also be supplied to the kernel driver to signal/wake up +the user process after completion of an asynchronous operation.
+When VM_BIND ioctl was provided with a user/memory fence via the +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion +of binding of that mapping. All async binds/unbinds are serialized, hence +signaling of user/memory fence also indicate the completion of all previous +binds/unbinds.
+This feature will be derived from the below original work: +https://patchwork.freedesktop.org/patch/349417/
This is 1:1 tied to long running compute mode contexts (which in the uapi doc must reference the endless amounts of bikeshed summary we have in the docs about indefinite fences).
I'd put this into a new section about compute and userspace memory fences support, with this and the next chapter ...
+VM_BIND use cases +==================
... and then make this section here focus entirely on additional vm_bind use-cases that we'll be adding later on. Which doesn't need to go into any details, it's just justification for why we want to build the world on top of vm_bind.
+Long running Compute contexts +------------------------------ +Usage of dma-fence expects that they complete in reasonable amount of time. +Compute on the other hand can be long running. Hence it is appropriate for +compute to use user/memory fence and dma-fence usage will be limited to +in-kernel consumption only. This requires an execbuff uapi extension to pass +in user fence. Compute must opt-in for this mechanism with +I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during context creation.
+The dma-fence based user interfaces like gem_wait ioctl, execbuff out fence +and implicit dependency setting is not allowed on long running contexts.
+Where GPU page faults are not available, kernel driver upon buffer invalidation +will initiate a suspend (preemption) of long running context with a dma-fence +attached to it. And upon completion of that suspend fence, finish the +invalidation, revalidate the BO and then resume the compute context. This is +done by having a per-context fence (called suspend fence) proxying as +i915_request fence. This suspend fence is enabled when there is a wait on it, +which triggers the context preemption.
+This is much easier to support with VM_BIND compared to the current heavier +execbuff path resource attachment.
There's a bunch of tricky code around compute mode context support, like the preempt ctx fence (or suspend fence or whatever you want to call it), and the resume work. And I think that code should be shared across drivers.
I think the right place to put this is into drm/sched, somewhere attached to the drm_sched_entity structure. I expect i915 folks to collaborate with amd and ideally also get amdkfd to adopt the same thing if possible. At least Christian has mentioned in the past that he's a bit unhappy about how this works.
Also drm/sched has dependency tracking, which will be needed to pipeline context resume operations. That needs to be used instead of i915-gem inventing yet another dependency tracking data structure (it already has 3 and that's roughly 3 too many).
This means compute mode support and userspace memory fences are blocked on the drm/sched conversion, but *eh* add it to the list of reasons for why drm/sched needs to happen.
Also since we only have support for compute mode ctx in our internal tree with the guc scheduler backend anyway, and the first conversion target is the guc backend, I don't think this actually holds up a lot of the code.
+Low Latency Submission +----------------------- +Allows compute UMD to directly submit GPU jobs instead of through execbuff +ioctl. VM_BIND allows map/unmap of BOs required for directly submitted jobs.
This is really just a special case of compute mode contexts, I think I'd include that in there, but explain better what it requires (i.e. vm_bind not being synchronized against execbuf).
+Debugger +--------- +With debug event interface user space process (debugger) is able to keep track +of and act upon resources created by another process (debuggee) and attached +to GPU via vm_bind interface.
+Mesa/Valkun +------------ +VM_BIND can potentially reduce the CPU-overhead in Mesa thus improving +performance. For Vulkan it should be straightforward to use VM_BIND. +For Iris implicit buffer tracking must be implemented before we can harness +VM_BIND benefits. With increasing GPU hardware performance reducing CPU +overhead becomes more important.
Just to clarify, I don't think we can land vm_bind into upstream if it doesn't work 100% for vk. There's a bit much "can" instead of "will in this section".
+Page level hints settings +-------------------------- +VM_BIND allows any hints setting per mapping instead of per BO. +Possible hints include read-only, placement and atomicity. +Sub-BO level placement hint will be even more relevant with +upcoming GPU on-demand page fault support.
+Page level Cache/CLOS settings +------------------------------- +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
+Shared Virtual Memory (SVM) support +------------------------------------ +VM_BIND interface can be used to map system memory directly (without gem BO +abstraction) using the HMM interface.
Userptr is absent here (and it's not the same as svm, at least on discrete), and this is needed for the initial version since otherwise vk can't use it because we're not at feature parity.
Irc discussions by Maarten and Dave came up with the idea that maybe userptr for vm_bind should work _without_ any gem bo as backing storage, since that guarantees that people don't come up with funny ideas like trying to share such bo across process or mmap it and other nonsense which just doesn't work.
+Broder i915 cleanups +===================== +Supporting this whole new vm_bind mode of binding which comes with its own +usecases to support and the locking requirements requires proper integration +with the existing i915 driver. This calls for some broader i915 driver +cleanups/simplifications for maintainability of the driver going forward. +Here are few things identified and are being looked into.
+- Make pagetable allocations evictable and manage them similar to VM_BIND
- mapped objects. Page table pages are similar to persistent mappings of a
- VM (difference here are that the page table pages will not
- have an i915_vma structure and after swapping pages back in, parent page
- link needs to be updated).
See above, but I think this should be included as part of the initial vm_bind push.
+- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
- do not use it and complexity it brings in is probably more than the
- performance advantage we get in legacy execbuff case.
+- Remove vma->open_count counting +- Remove i915_vma active reference tracking. Instead use underlying BO's
- dma-resv fence list to determine if a i915_vma is active or not.
So this is a complete mess, and really should not exist. I think it needs to be removed before we try to make i915_vma even more complex by adding vm_bind.
The other thing I've been pondering here is that vm_bind is really completely different from legacy vm structures for a lot of reasons:
no relocation or softpin handling, which means vm_bind has no reason to ever look at the i915_vma structure in execbuf code. Unfortunately execbuf has been rewritten to be vma instead of obj centric, so it's a 100% mismatch
vm_bind never has to manage any vm lru. Legacy execbuf has to maintain that because the kernel manages the virtual address space fully. Again ideally that entire vma_move_to_active code and everything related to it would simply not exist.
similar on the eviction side, the rules are quite different: For vm_bind we never tear down the vma, instead it's just moved to the list of evicted vma. Legacy vm have no need for all these additional lists, so another huge confusion.
if the refcount is done correctly for vm_bind we wouldn't need the tricky code in the bo close paths. Unfortunately legacy vm with relocations and softpin require that vma are only a weak reference, so that cannot be removed.
there's also a ton of special cases for ggtt handling, like the different views (for display, partial views for mmap), but also the gen2/3 alignment and padding requirements which vm_bind never needs.
I think the right thing here is to massively split the implementation behind some solid vm/vma abstraction, with a base clase for vm and vma which _only_ has the pieces which both vm_bind and the legacy vm stuff needs. But it's a bit tricky to get there. I think a workable path would be:
Add a new base class to both i915_address_space and i915_vma, which starts out empty.
As vm_bind code lands, move things that vm_bind code needs into these base classes
The goal should be that these base classes are a stand-alone library that other drivers could reuse. Like we've done with the buddy allocator, which first moved from i915-gem to i915-ttm, and which amd now moved to drm/ttm for reuse by amdgpu. Ideally other drivers interested in adding something like vm_bind should be involved from the start (or maybe the entire thing reused in amdgpu, they're looking at vk sparse binding support too or at least have perf issues I think).
Locking must be the same across all implemntations, otherwise it's really not an abstract. i915 screwed this up terribly by having different locking rules for ppgtt and ggtt, which is just nonsense.
The legacy specific code needs to be extracted as much as possible and shoved into separate files. In execbuf this means we need to get back to object centric flow, and the slowpaths need to become a lot simpler again (Maarten has cleaned up some of this, but there's still a silly amount of hacks in there with funny layering).
I think if stuff like the vma eviction details (list movement and locking and refcounting of the underlying object)
+These can be worked upon after intitial vm_bind support is added.
I don't think that works, given how badly i915-gem team screwed up in other places. And those places had to be fixed by adopting shared code like ttm. Plus there's already a huge unfulffiled promise pending with the drm/sched conversion, i915-gem team is clearly deeply in the red here :-/
Cheers, Daniel
+UAPI +===== +Uapi definiton can be found here: +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst index 91e93a705230..7d10c36b268d 100644 --- a/Documentation/gpu/rfc/index.rst +++ b/Documentation/gpu/rfc/index.rst @@ -23,3 +23,7 @@ host such documentation: .. toctree::
i915_scheduler.rst
+.. toctree::
- i915_vm_bind.rst
-- 2.21.0.rc0.32.g243a4c7e27
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Wed, Apr 20, 2022 at 03:50:00PM -0700, Niranjana Vishwanathapura wrote:
On Thu, Mar 31, 2022 at 01:37:08PM +0200, Daniel Vetter wrote:
One thing I've forgotten, since it's only hinted at here: If/when we switch tlb flushing from the current dumb&synchronous implementation we now have in i915 in upstream to one with batching using dma_fence, then I think that should be something which is done with a small helper library of shared code too. The batching is somewhat tricky, and you need to make sure you put the fence into the right dma_resv_usage slot, and the trick with replace the vm fence with a tlb flush fence is also a good reason to share the code so we only have it one.
Christian's recent work also has some prep work for this already with the fence replacing trick.
Sure, but this optimization is not required for initial vm_bind support to land right? We can look at it soon after that. Is that ok? I have made a reference to this TLB flush batching work in the rst file.
Yeah for now we can just rely on the tlb flush we do on vma unbinding, which also means there's no need for any separate tlb flushing in vm_bind related code. This was just a thought I dropped on here to make sure we ahve a complete picture. -Daniel
Niranjana
-Daniel
On Thu, 31 Mar 2022 at 10:28, Daniel Vetter daniel@ffwll.ch wrote:
Adding a pile of people who've expressed interest in vm_bind for their drivers.
Also note to the intel folks: This is largely written with me having my subsystem co-maintainer hat on, i.e. what I think is the right thing to do here for the subsystem at large. There is substantial rework involved here, but it's not any different from i915 adopting ttm or i915 adpoting drm/sched, and I do think this stuff needs to happen in one form or another.
On Mon, Mar 07, 2022 at 12:31:45PM -0800, Niranjana Vishwanathapura wrote:
VM_BIND design document with description of intended use cases.
Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com
Documentation/gpu/rfc/i915_vm_bind.rst | 210 +++++++++++++++++++++++++ Documentation/gpu/rfc/index.rst | 4 + 2 files changed, 214 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst new file mode 100644 index 000000000000..cdc6bb25b942 --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.rst @@ -0,0 +1,210 @@ +========================================== +I915 VM_BIND feature design and use cases +==========================================
+VM_BIND feature +================ +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer +objects (BOs) or sections of a BOs at specified GPU virtual addresses on +a specified address space (VM).
+These mappings (also referred to as persistent mappings) will be persistent +across multiple GPU submissions (execbuff) issued by the UMD, without user +having to provide a list of all required mappings during each submission +(as required by older execbuff mode).
+VM_BIND ioctl deferes binding the mappings until next execbuff submission +where it will be required, or immediately if I915_GEM_VM_BIND_IMMEDIATE +flag is set (useful if mapping is required for an active context).
So this is a screw-up I've done, and for upstream I think we need to fix it: Implicit sync is bad, and it's also still a bad idea for vm_bind, and I was wrong suggesting we should do this a few years back when we kicked this off internally :-(
What I think we need is just always VM_BIND_IMMEDIATE mode, and then a few things on top:
- in and out fences, like with execbuf, to allow userspace to sync with execbuf as needed
- for compute-mode context this means userspace memory fences
- for legacy context this means a timeline syncobj in drm_syncobj
No sync_file or anything else like this at all. This means a bunch of work, but also it'll have benefits because it means we should be able to use exactly the same code paths and logic for both compute and for legacy context, because drm_syncobj support future fence semantics.
Also on the implementation side we still need to install dma_fence to the various dma_resv, and for this we need the new dma_resv_usage series from Christian König first. vm_bind fences can then use the USAGE_BOOKKEEPING flag to make sure they never result in an oversync issue with execbuf. I don't think trying to land vm_bind without that prep work in dma_resv_usage makes sense.
Also as soon as dma_resv_usage has landed there's a few cleanups we should do in i915:
- ttm bo moving code should probably simplify a bit (and maybe more of the code should be pushed as helpers into ttm)
- clflush code should be moved over to using USAGE_KERNEL and the various hacks and special cases should be ditched. See df94fd05e69e ("drm/i915: expand on the kernel-doc for cache_dirty") for a bit more context
This is still not yet enough, since if a vm_bind races with an eviction we might stall on the new buffers being readied first before the context can continue. This needs some care to make sure that vma which aren't fully bound yet are on a separate list, and vma which are marked for unbinding are removed from the main working set list as soon as possible.
All of these things are relevant for the uapi semantics, which means
- they need to be documented in the uapi kerneldoc, ideally with example flows
- umd need to ack this
The other thing here is the async/nonblocking path. I think we still need that one, but again it should not sync with anything going on in execbuf, but simply execute the ioctl code in a kernel thread. The idea here is that this works like a special gpu engine, so that compute and vk can schedule bindings interleaved with rendering. This should be enough to get a performant vk sparse binding/textures implementation.
But I'm not entirely sure on this one, so this definitely needs acks from umds.
+VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND. +User has to opt-in for VM_BIND mode of binding for an address space (VM) +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension. +A VM in VM_BIND mode will not support older execbuff mode of binding.
+UMDs can still send BOs of these persistent mappings in execlist of execbuff +for specifying BO dependencies (implicit fencing) and to use BO as a batch, +but those BOs should be mapped ahead via vm_bind ioctl.
should or must?
Also I'm not really sure that's a great interface. The batchbuffer really only needs to be an address, so maybe all we need is an extension to supply an u64 batchbuffer address instead of trying to retrofit this into an unfitting current uapi.
And for implicit sync there's two things:
- for vk I think the right uapi is the dma-buf fence import/export ioctls from Jason Ekstrand. I think we should land that first instead of hacking funny concepts together
- for gl the dma-buf import/export might not be fast enough, since gl needs to do a _lot_ of implicit sync. There we might need to use the execbuffer buffer list, but then we should have extremely clear uapi rules which disallow _everything_ except setting the explicit sync uapi
Again all this stuff needs to be documented in detail in the kerneldoc uapi spec.
+VM_BIND features include, +- Multiple Virtual Address (VA) mappings can map to the same physical pages
- of an object (aliasing).
+- VA mapping can map to a partial section of the BO (partial binding). +- Support capture of persistent mappings in the dump upon GPU error. +- TLB is flushed upon unbind completion. Batching of TLB flushes in some
- usecases will be helpful.
+- Asynchronous vm_bind and vm_unbind support. +- VM_BIND uses user/memory fence mechanism for signaling bind completion
- and for signaling batch completion in long running contexts (explained
- below).
This should all be in the kerneldoc.
+VM_PRIVATE objects +------------------ +By default, BOs can be mapped on multiple VMs and can also be dma-buf +exported. Hence these BOs are referred to as Shared BOs. +During each execbuff submission, the request fence must be added to the +dma-resv fence list of all shared BOs mapped on the VM.
+VM_BIND feature introduces an optimization where user can create BO which +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on +the VM they are private to and can't be dma-buf exported. +All private BOs of a VM share the dma-resv object. Hence during each execbuff +submission, they need only one dma-resv fence list updated. Thus the fast +path (where required mappings are already bound) submission latency is O(1) +w.r.t the number of VM private BOs.
Two things:
I think the above is required to for initial vm_bind for vk, it kinda doesn't make much sense without that, and will allow us to match amdgpu and radeonsi
Christian König just landed ttm bulk lru helpers, and I think we need to use those. This means vm_bind will only work with the ttm backend, but that's what we have for the big dgpu where vm_bind helps more in terms of performance, and the igfx conversion to ttm is already going on.
Furthermore the i915 shrinker lru has stopped being an lru, so I think that should also be moved over to the ttm lru in some fashion to make sure we once again have a reasonable and consistent memory aging and reclaim architecture. The current code is just too much of a complete mess.
And since this is all fairly integral to how the code arch works I don't think merging a different version which isn't based on ttm bulk lru helpers makes sense.
Also I do think the page table lru handling needs to be included here, because that's another complete hand-rolled separate world for not much good reasons. I guess that can happen in parallel with the initial vm_bind bring-up, but it needs to be completed by the time we add the features beyond the initial support needed for vk.
+VM_BIND locking hirarchy +------------------------- +VM_BIND locking order is as below.
+1) A vm_bind mutex will protect vm_bind lists. This lock is taken in vm_bind/
- vm_unbind ioctl calls, in the execbuff path and while releasing the mapping.
- In future, when GPU page faults are supported, we can potentially use a
- rwsem instead, so that multiple pagefault handlers can take the read side
- lock to lookup the mapping and hence can run in parallel.
+2) The BO's dma-resv lock will protect i915_vma state and needs to be held
- while binding a vma and while updating dma-resv fence list of a BO.
- The private BOs of a VM will all share a dma-resv object.
- This lock is held in vm_bind call for immediate binding, during vm_unbind
- call for unbinding and during execbuff path for binding the mapping and
- updating the dma-resv fence list of the BO.
+3) Spinlock/s to protect some of the VM's lists.
+We will also need support for bluk LRU movement of persistent mapping to +avoid additional latencies in execbuff path.
This needs more detail and explanation of how each level is required. Also the shared dma_resv for VM_PRIVATE objects is kinda important to explain.
Like "some of the VM's lists" explains pretty much nothing.
+GPU page faults +---------------- +Both older execbuff mode and the newer VM_BIND mode of binding will require +using dma-fence to ensure residency. +In future when GPU page faults are supported, no dma-fence usage is required +as residency is purely managed by installing and removing/invalidating ptes.
This is a bit confusing. I think one part of this should be moved into the section with future vm_bind use-cases (we're not going to support page faults with legacy softpin or even worse, relocations). The locking discussion should be part of the much longer list of uses cases that motivate the locking design.
+User/Memory Fence +================== +The idea is to take a user specified virtual address and install an interrupt +handler to wake up the current task when the memory location passes the user +supplied filter.
+User/Memory fence is a <address, value> pair. To signal the user fence, +specified value will be written at the specified virtual address and +wakeup the waiting process. User can wait on an user fence with the +gem_wait_user_fence ioctl.
+It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify +interrupt within their batches after updating the value to have sub-batch +precision on the wakeup. Each batch can signal an user fence to indicate +the completion of next level batch. The completion of very first level batch +needs to be signaled by the command streamer. The user must provide the +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE +extension of execbuff ioctl, so that KMD can setup the command streamer to +signal it.
+User/Memory fence can also be supplied to the kernel driver to signal/wake up +the user process after completion of an asynchronous operation.
+When VM_BIND ioctl was provided with a user/memory fence via the +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion +of binding of that mapping. All async binds/unbinds are serialized, hence +signaling of user/memory fence also indicate the completion of all previous +binds/unbinds.
+This feature will be derived from the below original work: +https://patchwork.freedesktop.org/patch/349417/
This is 1:1 tied to long running compute mode contexts (which in the uapi doc must reference the endless amounts of bikeshed summary we have in the docs about indefinite fences).
I'd put this into a new section about compute and userspace memory fences support, with this and the next chapter ...
+VM_BIND use cases +==================
... and then make this section here focus entirely on additional vm_bind use-cases that we'll be adding later on. Which doesn't need to go into any details, it's just justification for why we want to build the world on top of vm_bind.
+Long running Compute contexts +------------------------------ +Usage of dma-fence expects that they complete in reasonable amount of time. +Compute on the other hand can be long running. Hence it is appropriate for +compute to use user/memory fence and dma-fence usage will be limited to +in-kernel consumption only. This requires an execbuff uapi extension to pass +in user fence. Compute must opt-in for this mechanism with +I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during context creation.
+The dma-fence based user interfaces like gem_wait ioctl, execbuff out fence +and implicit dependency setting is not allowed on long running contexts.
+Where GPU page faults are not available, kernel driver upon buffer invalidation +will initiate a suspend (preemption) of long running context with a dma-fence +attached to it. And upon completion of that suspend fence, finish the +invalidation, revalidate the BO and then resume the compute context. This is +done by having a per-context fence (called suspend fence) proxying as +i915_request fence. This suspend fence is enabled when there is a wait on it, +which triggers the context preemption.
+This is much easier to support with VM_BIND compared to the current heavier +execbuff path resource attachment.
There's a bunch of tricky code around compute mode context support, like the preempt ctx fence (or suspend fence or whatever you want to call it), and the resume work. And I think that code should be shared across drivers.
I think the right place to put this is into drm/sched, somewhere attached to the drm_sched_entity structure. I expect i915 folks to collaborate with amd and ideally also get amdkfd to adopt the same thing if possible. At least Christian has mentioned in the past that he's a bit unhappy about how this works.
Also drm/sched has dependency tracking, which will be needed to pipeline context resume operations. That needs to be used instead of i915-gem inventing yet another dependency tracking data structure (it already has 3 and that's roughly 3 too many).
This means compute mode support and userspace memory fences are blocked on the drm/sched conversion, but *eh* add it to the list of reasons for why drm/sched needs to happen.
Also since we only have support for compute mode ctx in our internal tree with the guc scheduler backend anyway, and the first conversion target is the guc backend, I don't think this actually holds up a lot of the code.
+Low Latency Submission +----------------------- +Allows compute UMD to directly submit GPU jobs instead of through execbuff +ioctl. VM_BIND allows map/unmap of BOs required for directly submitted jobs.
This is really just a special case of compute mode contexts, I think I'd include that in there, but explain better what it requires (i.e. vm_bind not being synchronized against execbuf).
+Debugger +--------- +With debug event interface user space process (debugger) is able to keep track +of and act upon resources created by another process (debuggee) and attached +to GPU via vm_bind interface.
+Mesa/Valkun +------------ +VM_BIND can potentially reduce the CPU-overhead in Mesa thus improving +performance. For Vulkan it should be straightforward to use VM_BIND. +For Iris implicit buffer tracking must be implemented before we can harness +VM_BIND benefits. With increasing GPU hardware performance reducing CPU +overhead becomes more important.
Just to clarify, I don't think we can land vm_bind into upstream if it doesn't work 100% for vk. There's a bit much "can" instead of "will in this section".
+Page level hints settings +-------------------------- +VM_BIND allows any hints setting per mapping instead of per BO. +Possible hints include read-only, placement and atomicity. +Sub-BO level placement hint will be even more relevant with +upcoming GPU on-demand page fault support.
+Page level Cache/CLOS settings +------------------------------- +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
+Shared Virtual Memory (SVM) support +------------------------------------ +VM_BIND interface can be used to map system memory directly (without gem BO +abstraction) using the HMM interface.
Userptr is absent here (and it's not the same as svm, at least on discrete), and this is needed for the initial version since otherwise vk can't use it because we're not at feature parity.
Irc discussions by Maarten and Dave came up with the idea that maybe userptr for vm_bind should work _without_ any gem bo as backing storage, since that guarantees that people don't come up with funny ideas like trying to share such bo across process or mmap it and other nonsense which just doesn't work.
+Broder i915 cleanups +===================== +Supporting this whole new vm_bind mode of binding which comes with its own +usecases to support and the locking requirements requires proper integration +with the existing i915 driver. This calls for some broader i915 driver +cleanups/simplifications for maintainability of the driver going forward. +Here are few things identified and are being looked into.
+- Make pagetable allocations evictable and manage them similar to VM_BIND
- mapped objects. Page table pages are similar to persistent mappings of a
- VM (difference here are that the page table pages will not
- have an i915_vma structure and after swapping pages back in, parent page
- link needs to be updated).
See above, but I think this should be included as part of the initial vm_bind push.
+- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
- do not use it and complexity it brings in is probably more than the
- performance advantage we get in legacy execbuff case.
+- Remove vma->open_count counting +- Remove i915_vma active reference tracking. Instead use underlying BO's
- dma-resv fence list to determine if a i915_vma is active or not.
So this is a complete mess, and really should not exist. I think it needs to be removed before we try to make i915_vma even more complex by adding vm_bind.
The other thing I've been pondering here is that vm_bind is really completely different from legacy vm structures for a lot of reasons:
no relocation or softpin handling, which means vm_bind has no reason to ever look at the i915_vma structure in execbuf code. Unfortunately execbuf has been rewritten to be vma instead of obj centric, so it's a 100% mismatch
vm_bind never has to manage any vm lru. Legacy execbuf has to maintain that because the kernel manages the virtual address space fully. Again ideally that entire vma_move_to_active code and everything related to it would simply not exist.
similar on the eviction side, the rules are quite different: For vm_bind we never tear down the vma, instead it's just moved to the list of evicted vma. Legacy vm have no need for all these additional lists, so another huge confusion.
if the refcount is done correctly for vm_bind we wouldn't need the tricky code in the bo close paths. Unfortunately legacy vm with relocations and softpin require that vma are only a weak reference, so that cannot be removed.
there's also a ton of special cases for ggtt handling, like the different views (for display, partial views for mmap), but also the gen2/3 alignment and padding requirements which vm_bind never needs.
I think the right thing here is to massively split the implementation behind some solid vm/vma abstraction, with a base clase for vm and vma which _only_ has the pieces which both vm_bind and the legacy vm stuff needs. But it's a bit tricky to get there. I think a workable path would be:
Add a new base class to both i915_address_space and i915_vma, which starts out empty.
As vm_bind code lands, move things that vm_bind code needs into these base classes
The goal should be that these base classes are a stand-alone library that other drivers could reuse. Like we've done with the buddy allocator, which first moved from i915-gem to i915-ttm, and which amd now moved to drm/ttm for reuse by amdgpu. Ideally other drivers interested in adding something like vm_bind should be involved from the start (or maybe the entire thing reused in amdgpu, they're looking at vk sparse binding support too or at least have perf issues I think).
Locking must be the same across all implemntations, otherwise it's really not an abstract. i915 screwed this up terribly by having different locking rules for ppgtt and ggtt, which is just nonsense.
The legacy specific code needs to be extracted as much as possible and shoved into separate files. In execbuf this means we need to get back to object centric flow, and the slowpaths need to become a lot simpler again (Maarten has cleaned up some of this, but there's still a silly amount of hacks in there with funny layering).
I think if stuff like the vma eviction details (list movement and locking and refcounting of the underlying object)
+These can be worked upon after intitial vm_bind support is added.
I don't think that works, given how badly i915-gem team screwed up in other places. And those places had to be fixed by adopting shared code like ttm. Plus there's already a huge unfulffiled promise pending with the drm/sched conversion, i915-gem team is clearly deeply in the red here :-/
Cheers, Daniel
+UAPI +===== +Uapi definiton can be found here: +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst index 91e93a705230..7d10c36b268d 100644 --- a/Documentation/gpu/rfc/index.rst +++ b/Documentation/gpu/rfc/index.rst @@ -23,3 +23,7 @@ host such documentation: .. toctree::
i915_scheduler.rst
+.. toctree::
- i915_vm_bind.rst
-- 2.21.0.rc0.32.g243a4c7e27
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Thu, Mar 31, 2022 at 10:28:48AM +0200, Daniel Vetter wrote:
Adding a pile of people who've expressed interest in vm_bind for their drivers.
Also note to the intel folks: This is largely written with me having my subsystem co-maintainer hat on, i.e. what I think is the right thing to do here for the subsystem at large. There is substantial rework involved here, but it's not any different from i915 adopting ttm or i915 adpoting drm/sched, and I do think this stuff needs to happen in one form or another.
On Mon, Mar 07, 2022 at 12:31:45PM -0800, Niranjana Vishwanathapura wrote:
VM_BIND design document with description of intended use cases.
Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com
Documentation/gpu/rfc/i915_vm_bind.rst | 210 +++++++++++++++++++++++++ Documentation/gpu/rfc/index.rst | 4 + 2 files changed, 214 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst new file mode 100644 index 000000000000..cdc6bb25b942 --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.rst @@ -0,0 +1,210 @@ +========================================== +I915 VM_BIND feature design and use cases +==========================================
+VM_BIND feature +================ +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer +objects (BOs) or sections of a BOs at specified GPU virtual addresses on +a specified address space (VM).
+These mappings (also referred to as persistent mappings) will be persistent +across multiple GPU submissions (execbuff) issued by the UMD, without user +having to provide a list of all required mappings during each submission +(as required by older execbuff mode).
+VM_BIND ioctl deferes binding the mappings until next execbuff submission +where it will be required, or immediately if I915_GEM_VM_BIND_IMMEDIATE +flag is set (useful if mapping is required for an active context).
So this is a screw-up I've done, and for upstream I think we need to fix it: Implicit sync is bad, and it's also still a bad idea for vm_bind, and I was wrong suggesting we should do this a few years back when we kicked this off internally :-(
What I think we need is just always VM_BIND_IMMEDIATE mode, and then a few things on top:
- in and out fences, like with execbuf, to allow userspace to sync with
execbuf as needed
- for compute-mode context this means userspace memory fences
- for legacy context this means a timeline syncobj in drm_syncobj
No sync_file or anything else like this at all. This means a bunch of work, but also it'll have benefits because it means we should be able to use exactly the same code paths and logic for both compute and for legacy context, because drm_syncobj support future fence semantics.
Thanks Daniel, Ok, will update
Also on the implementation side we still need to install dma_fence to the various dma_resv, and for this we need the new dma_resv_usage series from Christian König first. vm_bind fences can then use the USAGE_BOOKKEEPING flag to make sure they never result in an oversync issue with execbuf. I don't think trying to land vm_bind without that prep work in dma_resv_usage makes sense.
Ok, but that is not a dependency for this VM_BIND design RFC patch right? I will add this to the documentation here.
Also as soon as dma_resv_usage has landed there's a few cleanups we should do in i915:
- ttm bo moving code should probably simplify a bit (and maybe more of the
code should be pushed as helpers into ttm)
- clflush code should be moved over to using USAGE_KERNEL and the various
hacks and special cases should be ditched. See df94fd05e69e ("drm/i915: expand on the kernel-doc for cache_dirty") for a bit more context
This is still not yet enough, since if a vm_bind races with an eviction we might stall on the new buffers being readied first before the context can continue. This needs some care to make sure that vma which aren't fully bound yet are on a separate list, and vma which are marked for unbinding are removed from the main working set list as soon as possible.
All of these things are relevant for the uapi semantics, which means
- they need to be documented in the uapi kerneldoc, ideally with example
flows
- umd need to ack this
Ok
The other thing here is the async/nonblocking path. I think we still need that one, but again it should not sync with anything going on in execbuf, but simply execute the ioctl code in a kernel thread. The idea here is that this works like a special gpu engine, so that compute and vk can schedule bindings interleaved with rendering. This should be enough to get a performant vk sparse binding/textures implementation.
But I'm not entirely sure on this one, so this definitely needs acks from umds.
+VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND. +User has to opt-in for VM_BIND mode of binding for an address space (VM) +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension. +A VM in VM_BIND mode will not support older execbuff mode of binding.
+UMDs can still send BOs of these persistent mappings in execlist of execbuff +for specifying BO dependencies (implicit fencing) and to use BO as a batch, +but those BOs should be mapped ahead via vm_bind ioctl.
should or must?
Must, will fix.
Also I'm not really sure that's a great interface. The batchbuffer really only needs to be an address, so maybe all we need is an extension to supply an u64 batchbuffer address instead of trying to retrofit this into an unfitting current uapi.
Yah, this was considered, but was decided to do it as later optimization. But if we were to remove execlist entries completely (ie., no implicit sync also), then we need to do this from the beginning.
And for implicit sync there's two things:
- for vk I think the right uapi is the dma-buf fence import/export ioctls
from Jason Ekstrand. I think we should land that first instead of hacking funny concepts together
I did not understand fully, can you point to it?
- for gl the dma-buf import/export might not be fast enough, since gl
needs to do a _lot_ of implicit sync. There we might need to use the execbuffer buffer list, but then we should have extremely clear uapi rules which disallow _everything_ except setting the explicit sync uapi
Ok, so then, we still need to support implicit sync in vm_bind mode. Right?
Again all this stuff needs to be documented in detail in the kerneldoc uapi spec.
ok
+VM_BIND features include, +- Multiple Virtual Address (VA) mappings can map to the same physical pages
- of an object (aliasing).
+- VA mapping can map to a partial section of the BO (partial binding). +- Support capture of persistent mappings in the dump upon GPU error. +- TLB is flushed upon unbind completion. Batching of TLB flushes in some
- usecases will be helpful.
+- Asynchronous vm_bind and vm_unbind support. +- VM_BIND uses user/memory fence mechanism for signaling bind completion
- and for signaling batch completion in long running contexts (explained
- below).
This should all be in the kerneldoc.
ok
+VM_PRIVATE objects +------------------ +By default, BOs can be mapped on multiple VMs and can also be dma-buf +exported. Hence these BOs are referred to as Shared BOs. +During each execbuff submission, the request fence must be added to the +dma-resv fence list of all shared BOs mapped on the VM.
+VM_BIND feature introduces an optimization where user can create BO which +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on +the VM they are private to and can't be dma-buf exported. +All private BOs of a VM share the dma-resv object. Hence during each execbuff +submission, they need only one dma-resv fence list updated. Thus the fast +path (where required mappings are already bound) submission latency is O(1) +w.r.t the number of VM private BOs.
Two things:
- I think the above is required to for initial vm_bind for vk, it kinda
doesn't make much sense without that, and will allow us to match amdgpu and radeonsi
- Christian König just landed ttm bulk lru helpers, and I think we need to
use those. This means vm_bind will only work with the ttm backend, but that's what we have for the big dgpu where vm_bind helps more in terms of performance, and the igfx conversion to ttm is already going on.
ok
Furthermore the i915 shrinker lru has stopped being an lru, so I think that should also be moved over to the ttm lru in some fashion to make sure we once again have a reasonable and consistent memory aging and reclaim architecture. The current code is just too much of a complete mess.
And since this is all fairly integral to how the code arch works I don't think merging a different version which isn't based on ttm bulk lru helpers makes sense.
Also I do think the page table lru handling needs to be included here, because that's another complete hand-rolled separate world for not much good reasons. I guess that can happen in parallel with the initial vm_bind bring-up, but it needs to be completed by the time we add the features beyond the initial support needed for vk.
Ok
+VM_BIND locking hirarchy +------------------------- +VM_BIND locking order is as below.
+1) A vm_bind mutex will protect vm_bind lists. This lock is taken in vm_bind/
- vm_unbind ioctl calls, in the execbuff path and while releasing the mapping.
- In future, when GPU page faults are supported, we can potentially use a
- rwsem instead, so that multiple pagefault handlers can take the read side
- lock to lookup the mapping and hence can run in parallel.
+2) The BO's dma-resv lock will protect i915_vma state and needs to be held
- while binding a vma and while updating dma-resv fence list of a BO.
- The private BOs of a VM will all share a dma-resv object.
- This lock is held in vm_bind call for immediate binding, during vm_unbind
- call for unbinding and during execbuff path for binding the mapping and
- updating the dma-resv fence list of the BO.
+3) Spinlock/s to protect some of the VM's lists.
+We will also need support for bluk LRU movement of persistent mapping to +avoid additional latencies in execbuff path.
This needs more detail and explanation of how each level is required. Also the shared dma_resv for VM_PRIVATE objects is kinda important to explain.
Like "some of the VM's lists" explains pretty much nothing.
Ok, will explain.
+GPU page faults +---------------- +Both older execbuff mode and the newer VM_BIND mode of binding will require +using dma-fence to ensure residency. +In future when GPU page faults are supported, no dma-fence usage is required +as residency is purely managed by installing and removing/invalidating ptes.
This is a bit confusing. I think one part of this should be moved into the section with future vm_bind use-cases (we're not going to support page faults with legacy softpin or even worse, relocations). The locking discussion should be part of the much longer list of uses cases that motivate the locking design.
Ok, will move.
+User/Memory Fence +================== +The idea is to take a user specified virtual address and install an interrupt +handler to wake up the current task when the memory location passes the user +supplied filter.
+User/Memory fence is a <address, value> pair. To signal the user fence, +specified value will be written at the specified virtual address and +wakeup the waiting process. User can wait on an user fence with the +gem_wait_user_fence ioctl.
+It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify +interrupt within their batches after updating the value to have sub-batch +precision on the wakeup. Each batch can signal an user fence to indicate +the completion of next level batch. The completion of very first level batch +needs to be signaled by the command streamer. The user must provide the +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE +extension of execbuff ioctl, so that KMD can setup the command streamer to +signal it.
+User/Memory fence can also be supplied to the kernel driver to signal/wake up +the user process after completion of an asynchronous operation.
+When VM_BIND ioctl was provided with a user/memory fence via the +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion +of binding of that mapping. All async binds/unbinds are serialized, hence +signaling of user/memory fence also indicate the completion of all previous +binds/unbinds.
+This feature will be derived from the below original work: +https://patchwork.freedesktop.org/patch/349417/
This is 1:1 tied to long running compute mode contexts (which in the uapi doc must reference the endless amounts of bikeshed summary we have in the docs about indefinite fences).
Ok, will check and add reference.
I'd put this into a new section about compute and userspace memory fences support, with this and the next chapter ...
ok
+VM_BIND use cases +==================
... and then make this section here focus entirely on additional vm_bind use-cases that we'll be adding later on. Which doesn't need to go into any details, it's just justification for why we want to build the world on top of vm_bind.
ok
+Long running Compute contexts +------------------------------ +Usage of dma-fence expects that they complete in reasonable amount of time. +Compute on the other hand can be long running. Hence it is appropriate for +compute to use user/memory fence and dma-fence usage will be limited to +in-kernel consumption only. This requires an execbuff uapi extension to pass +in user fence. Compute must opt-in for this mechanism with +I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during context creation.
+The dma-fence based user interfaces like gem_wait ioctl, execbuff out fence +and implicit dependency setting is not allowed on long running contexts.
+Where GPU page faults are not available, kernel driver upon buffer invalidation +will initiate a suspend (preemption) of long running context with a dma-fence +attached to it. And upon completion of that suspend fence, finish the +invalidation, revalidate the BO and then resume the compute context. This is +done by having a per-context fence (called suspend fence) proxying as +i915_request fence. This suspend fence is enabled when there is a wait on it, +which triggers the context preemption.
+This is much easier to support with VM_BIND compared to the current heavier +execbuff path resource attachment.
There's a bunch of tricky code around compute mode context support, like the preempt ctx fence (or suspend fence or whatever you want to call it), and the resume work. And I think that code should be shared across drivers.
I think the right place to put this is into drm/sched, somewhere attached to the drm_sched_entity structure. I expect i915 folks to collaborate with amd and ideally also get amdkfd to adopt the same thing if possible. At least Christian has mentioned in the past that he's a bit unhappy about how this works.
Also drm/sched has dependency tracking, which will be needed to pipeline context resume operations. That needs to be used instead of i915-gem inventing yet another dependency tracking data structure (it already has 3 and that's roughly 3 too many).
This means compute mode support and userspace memory fences are blocked on the drm/sched conversion, but *eh* add it to the list of reasons for why drm/sched needs to happen.
Also since we only have support for compute mode ctx in our internal tree with the guc scheduler backend anyway, and the first conversion target is the guc backend, I don't think this actually holds up a lot of the code.
Hmm...ok. Currently, the context suspend and resume operations in out internal tree is through an orthogonal guc interface (not through scheduler). So, I need to look more into this part.
+Low Latency Submission +----------------------- +Allows compute UMD to directly submit GPU jobs instead of through execbuff +ioctl. VM_BIND allows map/unmap of BOs required for directly submitted jobs.
This is really just a special case of compute mode contexts, I think I'd include that in there, but explain better what it requires (i.e. vm_bind not being synchronized against execbuf).
ok
+Debugger +--------- +With debug event interface user space process (debugger) is able to keep track +of and act upon resources created by another process (debuggee) and attached +to GPU via vm_bind interface.
+Mesa/Valkun +------------ +VM_BIND can potentially reduce the CPU-overhead in Mesa thus improving +performance. For Vulkan it should be straightforward to use VM_BIND. +For Iris implicit buffer tracking must be implemented before we can harness +VM_BIND benefits. With increasing GPU hardware performance reducing CPU +overhead becomes more important.
Just to clarify, I don't think we can land vm_bind into upstream if it doesn't work 100% for vk. There's a bit much "can" instead of "will in this section".
ok, will explain better.
+Page level hints settings +-------------------------- +VM_BIND allows any hints setting per mapping instead of per BO. +Possible hints include read-only, placement and atomicity. +Sub-BO level placement hint will be even more relevant with +upcoming GPU on-demand page fault support.
+Page level Cache/CLOS settings +------------------------------- +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
+Shared Virtual Memory (SVM) support +------------------------------------ +VM_BIND interface can be used to map system memory directly (without gem BO +abstraction) using the HMM interface.
Userptr is absent here (and it's not the same as svm, at least on discrete), and this is needed for the initial version since otherwise vk can't use it because we're not at feature parity.
userptr gem objects are supported in initial version (and yes it is not same as SVM). I did not add it here as there is no additional uapi change required to support that.
Irc discussions by Maarten and Dave came up with the idea that maybe userptr for vm_bind should work _without_ any gem bo as backing storage, since that guarantees that people don't come up with funny ideas like trying to share such bo across process or mmap it and other nonsense which just doesn't work.
Hmm...there is no plan to support userptr _without_ gem bo not atleast in the initial vm_bind support. Is it Ok to put it in the 'futues' section?
+Broder i915 cleanups +===================== +Supporting this whole new vm_bind mode of binding which comes with its own +usecases to support and the locking requirements requires proper integration +with the existing i915 driver. This calls for some broader i915 driver +cleanups/simplifications for maintainability of the driver going forward. +Here are few things identified and are being looked into.
+- Make pagetable allocations evictable and manage them similar to VM_BIND
- mapped objects. Page table pages are similar to persistent mappings of a
- VM (difference here are that the page table pages will not
- have an i915_vma structure and after swapping pages back in, parent page
- link needs to be updated).
See above, but I think this should be included as part of the initial vm_bind push.
Ok, as you mentioned above, we can do it soon after initial vm_bind support lands, but before we add any new vm_bind features.
+- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
- do not use it and complexity it brings in is probably more than the
- performance advantage we get in legacy execbuff case.
+- Remove vma->open_count counting +- Remove i915_vma active reference tracking. Instead use underlying BO's
- dma-resv fence list to determine if a i915_vma is active or not.
So this is a complete mess, and really should not exist. I think it needs to be removed before we try to make i915_vma even more complex by adding vm_bind.
Hmm...Need to look into this. I am not sure how much of an effort it is going to be to remove i915_vma active reference tracking and instead use dma_resv fences for activeness tracking.
The other thing I've been pondering here is that vm_bind is really completely different from legacy vm structures for a lot of reasons:
- no relocation or softpin handling, which means vm_bind has no reason to
ever look at the i915_vma structure in execbuf code. Unfortunately execbuf has been rewritten to be vma instead of obj centric, so it's a 100% mismatch
- vm_bind never has to manage any vm lru. Legacy execbuf has to maintain
that because the kernel manages the virtual address space fully. Again ideally that entire vma_move_to_active code and everything related to it would simply not exist.
- similar on the eviction side, the rules are quite different: For vm_bind
we never tear down the vma, instead it's just moved to the list of evicted vma. Legacy vm have no need for all these additional lists, so another huge confusion.
- if the refcount is done correctly for vm_bind we wouldn't need the
tricky code in the bo close paths. Unfortunately legacy vm with relocations and softpin require that vma are only a weak reference, so that cannot be removed.
- there's also a ton of special cases for ggtt handling, like the
different views (for display, partial views for mmap), but also the gen2/3 alignment and padding requirements which vm_bind never needs.
I think the right thing here is to massively split the implementation behind some solid vm/vma abstraction, with a base clase for vm and vma which _only_ has the pieces which both vm_bind and the legacy vm stuff needs. But it's a bit tricky to get there. I think a workable path would be:
- Add a new base class to both i915_address_space and i915_vma, which
starts out empty.
- As vm_bind code lands, move things that vm_bind code needs into these
base classes
Ok
- The goal should be that these base classes are a stand-alone library
that other drivers could reuse. Like we've done with the buddy allocator, which first moved from i915-gem to i915-ttm, and which amd now moved to drm/ttm for reuse by amdgpu. Ideally other drivers interested in adding something like vm_bind should be involved from the start (or maybe the entire thing reused in amdgpu, they're looking at vk sparse binding support too or at least have perf issues I think).
- Locking must be the same across all implemntations, otherwise it's
really not an abstract. i915 screwed this up terribly by having different locking rules for ppgtt and ggtt, which is just nonsense.
- The legacy specific code needs to be extracted as much as possible and
shoved into separate files. In execbuf this means we need to get back to object centric flow, and the slowpaths need to become a lot simpler again (Maarten has cleaned up some of this, but there's still a silly amount of hacks in there with funny layering).
This also, we can do soon after vm_bind code lands right?
- I think if stuff like the vma eviction details (list movement and
locking and refcounting of the underlying object)
+These can be worked upon after intitial vm_bind support is added.
I don't think that works, given how badly i915-gem team screwed up in other places. And those places had to be fixed by adopting shared code like ttm. Plus there's already a huge unfulffiled promise pending with the drm/sched conversion, i915-gem team is clearly deeply in the red here :-/
Hmmm ok. As I mentioned above, I need to look into how to remove i915_vma active reference tracking code from i915 driver. Wonder if there is any middle ground here like not using that in vm_bind mode?
Niranjana
Cheers, Daniel
+UAPI +===== +Uapi definiton can be found here: +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst index 91e93a705230..7d10c36b268d 100644 --- a/Documentation/gpu/rfc/index.rst +++ b/Documentation/gpu/rfc/index.rst @@ -23,3 +23,7 @@ host such documentation: .. toctree::
i915_scheduler.rst
+.. toctree::
- i915_vm_bind.rst
-- 2.21.0.rc0.32.g243a4c7e27
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Wed, Apr 20, 2022 at 03:45:25PM -0700, Niranjana Vishwanathapura wrote:
On Thu, Mar 31, 2022 at 10:28:48AM +0200, Daniel Vetter wrote:
Adding a pile of people who've expressed interest in vm_bind for their drivers.
Also note to the intel folks: This is largely written with me having my subsystem co-maintainer hat on, i.e. what I think is the right thing to do here for the subsystem at large. There is substantial rework involved here, but it's not any different from i915 adopting ttm or i915 adpoting drm/sched, and I do think this stuff needs to happen in one form or another.
On Mon, Mar 07, 2022 at 12:31:45PM -0800, Niranjana Vishwanathapura wrote:
VM_BIND design document with description of intended use cases.
Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com
Documentation/gpu/rfc/i915_vm_bind.rst | 210 +++++++++++++++++++++++++ Documentation/gpu/rfc/index.rst | 4 + 2 files changed, 214 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst new file mode 100644 index 000000000000..cdc6bb25b942 --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.rst @@ -0,0 +1,210 @@ +========================================== +I915 VM_BIND feature design and use cases +==========================================
+VM_BIND feature +================ +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer +objects (BOs) or sections of a BOs at specified GPU virtual addresses on +a specified address space (VM).
+These mappings (also referred to as persistent mappings) will be persistent +across multiple GPU submissions (execbuff) issued by the UMD, without user +having to provide a list of all required mappings during each submission +(as required by older execbuff mode).
+VM_BIND ioctl deferes binding the mappings until next execbuff submission +where it will be required, or immediately if I915_GEM_VM_BIND_IMMEDIATE +flag is set (useful if mapping is required for an active context).
So this is a screw-up I've done, and for upstream I think we need to fix it: Implicit sync is bad, and it's also still a bad idea for vm_bind, and I was wrong suggesting we should do this a few years back when we kicked this off internally :-(
What I think we need is just always VM_BIND_IMMEDIATE mode, and then a few things on top:
- in and out fences, like with execbuf, to allow userspace to sync with
execbuf as needed
- for compute-mode context this means userspace memory fences
- for legacy context this means a timeline syncobj in drm_syncobj
No sync_file or anything else like this at all. This means a bunch of work, but also it'll have benefits because it means we should be able to use exactly the same code paths and logic for both compute and for legacy context, because drm_syncobj support future fence semantics.
Thanks Daniel, Ok, will update
I had a long conversation with Daniel on some of the points discussed here. Thanks to Daniel for clarifying many points here.
Here is the summary of the discussion.
1) A prep patch is needed to update documentation of some existing uapi and this new VM_BIND uapi can update/refer to that. I will include this prep patch in the next revision of this RFC series. Will also include the uapi header file in the rst file so that it gets rendered.
2) Will update documentation here with proper use of dma_resv_usage while adding fences to vm_bind objects. It is going to be, DMA_RESV_USAGE_BOOKKEEP by default if not override with execlist in execbuff path.
3) Add extension to execbuff ioctl to specify batch buffer as GPU virtual address instead of having to pass it as a BO handle in execlist. This will also make the execlist usage solely for implicit sync setting which is further discussed below.
4) Need to look into when will Jason's dma-buf fence import/export ioctl support will land and whether it will be used both for vl and gl. Need to sync with Jason on this. Probably the better option here would be to not support execlist in execbuff path in vm_bind mode for initial vm_bind support (hoping Jason's dma-buf fence import/export ioctl will be enough). We can add support for execlist in execbuff for vm_bind mode later if required (say for gl).
5) There are lot of things in execbuff path that doesn't apply in VM_BIND mode (like relocations, implicit sync etc). Separate them out by using function pointers wherever the functionality differs between current design and the newer VM_BIND design.
6) Separate out i915_vma active reference counting in execbuff path and do not use it in VM_BIND mode. Instead use dma-resv fence checking for VM_BIND mode. This should be easier to get it working with the current TTM backend (which initial VM_BIND support will use). And remove i915_vma active reference counting fully while supporting TTM backend for igfx.
7) As we support compute mode contexts only with GuC scheduler backend and compute mode requires support for suspend and resume of contexts, it will have a dependency on i915 drm scheduler conversion.
Will revise this series accordingly.
Thanks, Niranjana
Also on the implementation side we still need to install dma_fence to the various dma_resv, and for this we need the new dma_resv_usage series from Christian König first. vm_bind fences can then use the USAGE_BOOKKEEPING flag to make sure they never result in an oversync issue with execbuf. I don't think trying to land vm_bind without that prep work in dma_resv_usage makes sense.
Ok, but that is not a dependency for this VM_BIND design RFC patch right? I will add this to the documentation here.
Also as soon as dma_resv_usage has landed there's a few cleanups we should do in i915:
- ttm bo moving code should probably simplify a bit (and maybe more of the
code should be pushed as helpers into ttm)
- clflush code should be moved over to using USAGE_KERNEL and the various
hacks and special cases should be ditched. See df94fd05e69e ("drm/i915: expand on the kernel-doc for cache_dirty") for a bit more context
This is still not yet enough, since if a vm_bind races with an eviction we might stall on the new buffers being readied first before the context can continue. This needs some care to make sure that vma which aren't fully bound yet are on a separate list, and vma which are marked for unbinding are removed from the main working set list as soon as possible.
All of these things are relevant for the uapi semantics, which means
- they need to be documented in the uapi kerneldoc, ideally with example
flows
- umd need to ack this
Ok
The other thing here is the async/nonblocking path. I think we still need that one, but again it should not sync with anything going on in execbuf, but simply execute the ioctl code in a kernel thread. The idea here is that this works like a special gpu engine, so that compute and vk can schedule bindings interleaved with rendering. This should be enough to get a performant vk sparse binding/textures implementation.
But I'm not entirely sure on this one, so this definitely needs acks from umds.
+VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND. +User has to opt-in for VM_BIND mode of binding for an address space (VM) +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension. +A VM in VM_BIND mode will not support older execbuff mode of binding.
+UMDs can still send BOs of these persistent mappings in execlist of execbuff +for specifying BO dependencies (implicit fencing) and to use BO as a batch, +but those BOs should be mapped ahead via vm_bind ioctl.
should or must?
Must, will fix.
Also I'm not really sure that's a great interface. The batchbuffer really only needs to be an address, so maybe all we need is an extension to supply an u64 batchbuffer address instead of trying to retrofit this into an unfitting current uapi.
Yah, this was considered, but was decided to do it as later optimization. But if we were to remove execlist entries completely (ie., no implicit sync also), then we need to do this from the beginning.
And for implicit sync there's two things:
- for vk I think the right uapi is the dma-buf fence import/export ioctls
from Jason Ekstrand. I think we should land that first instead of hacking funny concepts together
I did not understand fully, can you point to it?
- for gl the dma-buf import/export might not be fast enough, since gl
needs to do a _lot_ of implicit sync. There we might need to use the execbuffer buffer list, but then we should have extremely clear uapi rules which disallow _everything_ except setting the explicit sync uapi
Ok, so then, we still need to support implicit sync in vm_bind mode. Right?
Again all this stuff needs to be documented in detail in the kerneldoc uapi spec.
ok
+VM_BIND features include, +- Multiple Virtual Address (VA) mappings can map to the same physical pages
- of an object (aliasing).
+- VA mapping can map to a partial section of the BO (partial binding). +- Support capture of persistent mappings in the dump upon GPU error. +- TLB is flushed upon unbind completion. Batching of TLB flushes in some
- usecases will be helpful.
+- Asynchronous vm_bind and vm_unbind support. +- VM_BIND uses user/memory fence mechanism for signaling bind completion
- and for signaling batch completion in long running contexts (explained
- below).
This should all be in the kerneldoc.
ok
+VM_PRIVATE objects +------------------ +By default, BOs can be mapped on multiple VMs and can also be dma-buf +exported. Hence these BOs are referred to as Shared BOs. +During each execbuff submission, the request fence must be added to the +dma-resv fence list of all shared BOs mapped on the VM.
+VM_BIND feature introduces an optimization where user can create BO which +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on +the VM they are private to and can't be dma-buf exported. +All private BOs of a VM share the dma-resv object. Hence during each execbuff +submission, they need only one dma-resv fence list updated. Thus the fast +path (where required mappings are already bound) submission latency is O(1) +w.r.t the number of VM private BOs.
Two things:
- I think the above is required to for initial vm_bind for vk, it kinda
doesn't make much sense without that, and will allow us to match amdgpu and radeonsi
- Christian König just landed ttm bulk lru helpers, and I think we need to
use those. This means vm_bind will only work with the ttm backend, but that's what we have for the big dgpu where vm_bind helps more in terms of performance, and the igfx conversion to ttm is already going on.
ok
Furthermore the i915 shrinker lru has stopped being an lru, so I think that should also be moved over to the ttm lru in some fashion to make sure we once again have a reasonable and consistent memory aging and reclaim architecture. The current code is just too much of a complete mess.
And since this is all fairly integral to how the code arch works I don't think merging a different version which isn't based on ttm bulk lru helpers makes sense.
Also I do think the page table lru handling needs to be included here, because that's another complete hand-rolled separate world for not much good reasons. I guess that can happen in parallel with the initial vm_bind bring-up, but it needs to be completed by the time we add the features beyond the initial support needed for vk.
Ok
+VM_BIND locking hirarchy +------------------------- +VM_BIND locking order is as below.
+1) A vm_bind mutex will protect vm_bind lists. This lock is taken in vm_bind/
- vm_unbind ioctl calls, in the execbuff path and while releasing the mapping.
- In future, when GPU page faults are supported, we can potentially use a
- rwsem instead, so that multiple pagefault handlers can take the read side
- lock to lookup the mapping and hence can run in parallel.
+2) The BO's dma-resv lock will protect i915_vma state and needs to be held
- while binding a vma and while updating dma-resv fence list of a BO.
- The private BOs of a VM will all share a dma-resv object.
- This lock is held in vm_bind call for immediate binding, during vm_unbind
- call for unbinding and during execbuff path for binding the mapping and
- updating the dma-resv fence list of the BO.
+3) Spinlock/s to protect some of the VM's lists.
+We will also need support for bluk LRU movement of persistent mapping to +avoid additional latencies in execbuff path.
This needs more detail and explanation of how each level is required. Also the shared dma_resv for VM_PRIVATE objects is kinda important to explain.
Like "some of the VM's lists" explains pretty much nothing.
Ok, will explain.
+GPU page faults +---------------- +Both older execbuff mode and the newer VM_BIND mode of binding will require +using dma-fence to ensure residency. +In future when GPU page faults are supported, no dma-fence usage is required +as residency is purely managed by installing and removing/invalidating ptes.
This is a bit confusing. I think one part of this should be moved into the section with future vm_bind use-cases (we're not going to support page faults with legacy softpin or even worse, relocations). The locking discussion should be part of the much longer list of uses cases that motivate the locking design.
Ok, will move.
+User/Memory Fence +================== +The idea is to take a user specified virtual address and install an interrupt +handler to wake up the current task when the memory location passes the user +supplied filter.
+User/Memory fence is a <address, value> pair. To signal the user fence, +specified value will be written at the specified virtual address and +wakeup the waiting process. User can wait on an user fence with the +gem_wait_user_fence ioctl.
+It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify +interrupt within their batches after updating the value to have sub-batch +precision on the wakeup. Each batch can signal an user fence to indicate +the completion of next level batch. The completion of very first level batch +needs to be signaled by the command streamer. The user must provide the +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE +extension of execbuff ioctl, so that KMD can setup the command streamer to +signal it.
+User/Memory fence can also be supplied to the kernel driver to signal/wake up +the user process after completion of an asynchronous operation.
+When VM_BIND ioctl was provided with a user/memory fence via the +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion +of binding of that mapping. All async binds/unbinds are serialized, hence +signaling of user/memory fence also indicate the completion of all previous +binds/unbinds.
+This feature will be derived from the below original work: +https://patchwork.freedesktop.org/patch/349417/
This is 1:1 tied to long running compute mode contexts (which in the uapi doc must reference the endless amounts of bikeshed summary we have in the docs about indefinite fences).
Ok, will check and add reference.
I'd put this into a new section about compute and userspace memory fences support, with this and the next chapter ...
ok
+VM_BIND use cases +==================
... and then make this section here focus entirely on additional vm_bind use-cases that we'll be adding later on. Which doesn't need to go into any details, it's just justification for why we want to build the world on top of vm_bind.
ok
+Long running Compute contexts +------------------------------ +Usage of dma-fence expects that they complete in reasonable amount of time. +Compute on the other hand can be long running. Hence it is appropriate for +compute to use user/memory fence and dma-fence usage will be limited to +in-kernel consumption only. This requires an execbuff uapi extension to pass +in user fence. Compute must opt-in for this mechanism with +I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during context creation.
+The dma-fence based user interfaces like gem_wait ioctl, execbuff out fence +and implicit dependency setting is not allowed on long running contexts.
+Where GPU page faults are not available, kernel driver upon buffer invalidation +will initiate a suspend (preemption) of long running context with a dma-fence +attached to it. And upon completion of that suspend fence, finish the +invalidation, revalidate the BO and then resume the compute context. This is +done by having a per-context fence (called suspend fence) proxying as +i915_request fence. This suspend fence is enabled when there is a wait on it, +which triggers the context preemption.
+This is much easier to support with VM_BIND compared to the current heavier +execbuff path resource attachment.
There's a bunch of tricky code around compute mode context support, like the preempt ctx fence (or suspend fence or whatever you want to call it), and the resume work. And I think that code should be shared across drivers.
I think the right place to put this is into drm/sched, somewhere attached to the drm_sched_entity structure. I expect i915 folks to collaborate with amd and ideally also get amdkfd to adopt the same thing if possible. At least Christian has mentioned in the past that he's a bit unhappy about how this works.
Also drm/sched has dependency tracking, which will be needed to pipeline context resume operations. That needs to be used instead of i915-gem inventing yet another dependency tracking data structure (it already has 3 and that's roughly 3 too many).
This means compute mode support and userspace memory fences are blocked on the drm/sched conversion, but *eh* add it to the list of reasons for why drm/sched needs to happen.
Also since we only have support for compute mode ctx in our internal tree with the guc scheduler backend anyway, and the first conversion target is the guc backend, I don't think this actually holds up a lot of the code.
Hmm...ok. Currently, the context suspend and resume operations in out internal tree is through an orthogonal guc interface (not through scheduler). So, I need to look more into this part.
+Low Latency Submission +----------------------- +Allows compute UMD to directly submit GPU jobs instead of through execbuff +ioctl. VM_BIND allows map/unmap of BOs required for directly submitted jobs.
This is really just a special case of compute mode contexts, I think I'd include that in there, but explain better what it requires (i.e. vm_bind not being synchronized against execbuf).
ok
+Debugger +--------- +With debug event interface user space process (debugger) is able to keep track +of and act upon resources created by another process (debuggee) and attached +to GPU via vm_bind interface.
+Mesa/Valkun +------------ +VM_BIND can potentially reduce the CPU-overhead in Mesa thus improving +performance. For Vulkan it should be straightforward to use VM_BIND. +For Iris implicit buffer tracking must be implemented before we can harness +VM_BIND benefits. With increasing GPU hardware performance reducing CPU +overhead becomes more important.
Just to clarify, I don't think we can land vm_bind into upstream if it doesn't work 100% for vk. There's a bit much "can" instead of "will in this section".
ok, will explain better.
+Page level hints settings +-------------------------- +VM_BIND allows any hints setting per mapping instead of per BO. +Possible hints include read-only, placement and atomicity. +Sub-BO level placement hint will be even more relevant with +upcoming GPU on-demand page fault support.
+Page level Cache/CLOS settings +------------------------------- +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
+Shared Virtual Memory (SVM) support +------------------------------------ +VM_BIND interface can be used to map system memory directly (without gem BO +abstraction) using the HMM interface.
Userptr is absent here (and it's not the same as svm, at least on discrete), and this is needed for the initial version since otherwise vk can't use it because we're not at feature parity.
userptr gem objects are supported in initial version (and yes it is not same as SVM). I did not add it here as there is no additional uapi change required to support that.
Irc discussions by Maarten and Dave came up with the idea that maybe userptr for vm_bind should work _without_ any gem bo as backing storage, since that guarantees that people don't come up with funny ideas like trying to share such bo across process or mmap it and other nonsense which just doesn't work.
Hmm...there is no plan to support userptr _without_ gem bo not atleast in the initial vm_bind support. Is it Ok to put it in the 'futues' section?
+Broder i915 cleanups +===================== +Supporting this whole new vm_bind mode of binding which comes with its own +usecases to support and the locking requirements requires proper integration +with the existing i915 driver. This calls for some broader i915 driver +cleanups/simplifications for maintainability of the driver going forward. +Here are few things identified and are being looked into.
+- Make pagetable allocations evictable and manage them similar to VM_BIND
- mapped objects. Page table pages are similar to persistent mappings of a
- VM (difference here are that the page table pages will not
- have an i915_vma structure and after swapping pages back in, parent page
- link needs to be updated).
See above, but I think this should be included as part of the initial vm_bind push.
Ok, as you mentioned above, we can do it soon after initial vm_bind support lands, but before we add any new vm_bind features.
+- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
- do not use it and complexity it brings in is probably more than the
- performance advantage we get in legacy execbuff case.
+- Remove vma->open_count counting +- Remove i915_vma active reference tracking. Instead use underlying BO's
- dma-resv fence list to determine if a i915_vma is active or not.
So this is a complete mess, and really should not exist. I think it needs to be removed before we try to make i915_vma even more complex by adding vm_bind.
Hmm...Need to look into this. I am not sure how much of an effort it is going to be to remove i915_vma active reference tracking and instead use dma_resv fences for activeness tracking.
The other thing I've been pondering here is that vm_bind is really completely different from legacy vm structures for a lot of reasons:
- no relocation or softpin handling, which means vm_bind has no reason to
ever look at the i915_vma structure in execbuf code. Unfortunately execbuf has been rewritten to be vma instead of obj centric, so it's a 100% mismatch
- vm_bind never has to manage any vm lru. Legacy execbuf has to maintain
that because the kernel manages the virtual address space fully. Again ideally that entire vma_move_to_active code and everything related to it would simply not exist.
- similar on the eviction side, the rules are quite different: For vm_bind
we never tear down the vma, instead it's just moved to the list of evicted vma. Legacy vm have no need for all these additional lists, so another huge confusion.
- if the refcount is done correctly for vm_bind we wouldn't need the
tricky code in the bo close paths. Unfortunately legacy vm with relocations and softpin require that vma are only a weak reference, so that cannot be removed.
- there's also a ton of special cases for ggtt handling, like the
different views (for display, partial views for mmap), but also the gen2/3 alignment and padding requirements which vm_bind never needs.
I think the right thing here is to massively split the implementation behind some solid vm/vma abstraction, with a base clase for vm and vma which _only_ has the pieces which both vm_bind and the legacy vm stuff needs. But it's a bit tricky to get there. I think a workable path would be:
- Add a new base class to both i915_address_space and i915_vma, which
starts out empty.
- As vm_bind code lands, move things that vm_bind code needs into these
base classes
Ok
- The goal should be that these base classes are a stand-alone library
that other drivers could reuse. Like we've done with the buddy allocator, which first moved from i915-gem to i915-ttm, and which amd now moved to drm/ttm for reuse by amdgpu. Ideally other drivers interested in adding something like vm_bind should be involved from the start (or maybe the entire thing reused in amdgpu, they're looking at vk sparse binding support too or at least have perf issues I think).
- Locking must be the same across all implemntations, otherwise it's
really not an abstract. i915 screwed this up terribly by having different locking rules for ppgtt and ggtt, which is just nonsense.
- The legacy specific code needs to be extracted as much as possible and
shoved into separate files. In execbuf this means we need to get back to object centric flow, and the slowpaths need to become a lot simpler again (Maarten has cleaned up some of this, but there's still a silly amount of hacks in there with funny layering).
This also, we can do soon after vm_bind code lands right?
- I think if stuff like the vma eviction details (list movement and
locking and refcounting of the underlying object)
+These can be worked upon after intitial vm_bind support is added.
I don't think that works, given how badly i915-gem team screwed up in other places. And those places had to be fixed by adopting shared code like ttm. Plus there's already a huge unfulffiled promise pending with the drm/sched conversion, i915-gem team is clearly deeply in the red here :-/
Hmmm ok. As I mentioned above, I need to look into how to remove i915_vma active reference tracking code from i915 driver. Wonder if there is any middle ground here like not using that in vm_bind mode?
Niranjana
Cheers, Daniel
+UAPI +===== +Uapi definiton can be found here: +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst index 91e93a705230..7d10c36b268d 100644 --- a/Documentation/gpu/rfc/index.rst +++ b/Documentation/gpu/rfc/index.rst @@ -23,3 +23,7 @@ host such documentation: .. toctree::
i915_scheduler.rst
+.. toctree::
- i915_vm_bind.rst
-- 2.21.0.rc0.32.g243a4c7e27
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Wed, Apr 27, 2022 at 08:41:35AM -0700, Niranjana Vishwanathapura wrote:
On Wed, Apr 20, 2022 at 03:45:25PM -0700, Niranjana Vishwanathapura wrote:
On Thu, Mar 31, 2022 at 10:28:48AM +0200, Daniel Vetter wrote:
Adding a pile of people who've expressed interest in vm_bind for their drivers.
Also note to the intel folks: This is largely written with me having my subsystem co-maintainer hat on, i.e. what I think is the right thing to do here for the subsystem at large. There is substantial rework involved here, but it's not any different from i915 adopting ttm or i915 adpoting drm/sched, and I do think this stuff needs to happen in one form or another.
On Mon, Mar 07, 2022 at 12:31:45PM -0800, Niranjana Vishwanathapura wrote:
VM_BIND design document with description of intended use cases.
Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com
Documentation/gpu/rfc/i915_vm_bind.rst | 210 +++++++++++++++++++++++++ Documentation/gpu/rfc/index.rst | 4 + 2 files changed, 214 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst new file mode 100644 index 000000000000..cdc6bb25b942 --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.rst @@ -0,0 +1,210 @@ +========================================== +I915 VM_BIND feature design and use cases +==========================================
+VM_BIND feature +================ +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer +objects (BOs) or sections of a BOs at specified GPU virtual addresses on +a specified address space (VM).
+These mappings (also referred to as persistent mappings) will be persistent +across multiple GPU submissions (execbuff) issued by the UMD, without user +having to provide a list of all required mappings during each submission +(as required by older execbuff mode).
+VM_BIND ioctl deferes binding the mappings until next execbuff submission +where it will be required, or immediately if I915_GEM_VM_BIND_IMMEDIATE +flag is set (useful if mapping is required for an active context).
So this is a screw-up I've done, and for upstream I think we need to fix it: Implicit sync is bad, and it's also still a bad idea for vm_bind, and I was wrong suggesting we should do this a few years back when we kicked this off internally :-(
What I think we need is just always VM_BIND_IMMEDIATE mode, and then a few things on top:
- in and out fences, like with execbuf, to allow userspace to sync with
execbuf as needed
- for compute-mode context this means userspace memory fences
- for legacy context this means a timeline syncobj in drm_syncobj
No sync_file or anything else like this at all. This means a bunch of work, but also it'll have benefits because it means we should be able to use exactly the same code paths and logic for both compute and for legacy context, because drm_syncobj support future fence semantics.
Thanks Daniel, Ok, will update
I had a long conversation with Daniel on some of the points discussed here. Thanks to Daniel for clarifying many points here.
Here is the summary of the discussion.
- A prep patch is needed to update documentation of some existing uapi and this
new VM_BIND uapi can update/refer to that. I will include this prep patch in the next revision of this RFC series. Will also include the uapi header file in the rst file so that it gets rendered.
- Will update documentation here with proper use of dma_resv_usage while adding
fences to vm_bind objects. It is going to be, DMA_RESV_USAGE_BOOKKEEP by default if not override with execlist in execbuff path.
- Add extension to execbuff ioctl to specify batch buffer as GPU virtual address
instead of having to pass it as a BO handle in execlist. This will also make the execlist usage solely for implicit sync setting which is further discussed below.
- Need to look into when will Jason's dma-buf fence import/export ioctl support will
land and whether it will be used both for vl and gl. Need to sync with Jason on this. Probably the better option here would be to not support execlist in execbuff path in vm_bind mode for initial vm_bind support (hoping Jason's dma-buf fence import/export ioctl will be enough). We can add support for execlist in execbuff for vm_bind mode later if required (say for gl).
So I'm again less sure whether the import/export ioctl is the right thing for gl, but I still think we should try. The reason is that we really need to set the implicit sync set per execbuf, otherwise there's oversync issues. So one of the ideas we've discussed where the implicit sync set would be controlled through vm_bind doesn't work for gl. -Daniel
- There are lot of things in execbuff path that doesn't apply in VM_BIND mode (like
relocations, implicit sync etc). Separate them out by using function pointers wherever the functionality differs between current design and the newer VM_BIND design.
- Separate out i915_vma active reference counting in execbuff path and do not use it in
VM_BIND mode. Instead use dma-resv fence checking for VM_BIND mode. This should be easier to get it working with the current TTM backend (which initial VM_BIND support will use). And remove i915_vma active reference counting fully while supporting TTM backend for igfx.
- As we support compute mode contexts only with GuC scheduler backend and compute mode requires
support for suspend and resume of contexts, it will have a dependency on i915 drm scheduler conversion.
Will revise this series accordingly.
Thanks, Niranjana
Also on the implementation side we still need to install dma_fence to the various dma_resv, and for this we need the new dma_resv_usage series from Christian König first. vm_bind fences can then use the USAGE_BOOKKEEPING flag to make sure they never result in an oversync issue with execbuf. I don't think trying to land vm_bind without that prep work in dma_resv_usage makes sense.
Ok, but that is not a dependency for this VM_BIND design RFC patch right? I will add this to the documentation here.
Also as soon as dma_resv_usage has landed there's a few cleanups we should do in i915:
- ttm bo moving code should probably simplify a bit (and maybe more of the
code should be pushed as helpers into ttm)
- clflush code should be moved over to using USAGE_KERNEL and the various
hacks and special cases should be ditched. See df94fd05e69e ("drm/i915: expand on the kernel-doc for cache_dirty") for a bit more context
This is still not yet enough, since if a vm_bind races with an eviction we might stall on the new buffers being readied first before the context can continue. This needs some care to make sure that vma which aren't fully bound yet are on a separate list, and vma which are marked for unbinding are removed from the main working set list as soon as possible.
All of these things are relevant for the uapi semantics, which means
- they need to be documented in the uapi kerneldoc, ideally with example
flows
- umd need to ack this
Ok
The other thing here is the async/nonblocking path. I think we still need that one, but again it should not sync with anything going on in execbuf, but simply execute the ioctl code in a kernel thread. The idea here is that this works like a special gpu engine, so that compute and vk can schedule bindings interleaved with rendering. This should be enough to get a performant vk sparse binding/textures implementation.
But I'm not entirely sure on this one, so this definitely needs acks from umds.
+VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND. +User has to opt-in for VM_BIND mode of binding for an address space (VM) +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension. +A VM in VM_BIND mode will not support older execbuff mode of binding.
+UMDs can still send BOs of these persistent mappings in execlist of execbuff +for specifying BO dependencies (implicit fencing) and to use BO as a batch, +but those BOs should be mapped ahead via vm_bind ioctl.
should or must?
Must, will fix.
Also I'm not really sure that's a great interface. The batchbuffer really only needs to be an address, so maybe all we need is an extension to supply an u64 batchbuffer address instead of trying to retrofit this into an unfitting current uapi.
Yah, this was considered, but was decided to do it as later optimization. But if we were to remove execlist entries completely (ie., no implicit sync also), then we need to do this from the beginning.
And for implicit sync there's two things:
- for vk I think the right uapi is the dma-buf fence import/export ioctls
from Jason Ekstrand. I think we should land that first instead of hacking funny concepts together
I did not understand fully, can you point to it?
- for gl the dma-buf import/export might not be fast enough, since gl
needs to do a _lot_ of implicit sync. There we might need to use the execbuffer buffer list, but then we should have extremely clear uapi rules which disallow _everything_ except setting the explicit sync uapi
Ok, so then, we still need to support implicit sync in vm_bind mode. Right?
Again all this stuff needs to be documented in detail in the kerneldoc uapi spec.
ok
+VM_BIND features include, +- Multiple Virtual Address (VA) mappings can map to the same physical pages
- of an object (aliasing).
+- VA mapping can map to a partial section of the BO (partial binding). +- Support capture of persistent mappings in the dump upon GPU error. +- TLB is flushed upon unbind completion. Batching of TLB flushes in some
- usecases will be helpful.
+- Asynchronous vm_bind and vm_unbind support. +- VM_BIND uses user/memory fence mechanism for signaling bind completion
- and for signaling batch completion in long running contexts (explained
- below).
This should all be in the kerneldoc.
ok
+VM_PRIVATE objects +------------------ +By default, BOs can be mapped on multiple VMs and can also be dma-buf +exported. Hence these BOs are referred to as Shared BOs. +During each execbuff submission, the request fence must be added to the +dma-resv fence list of all shared BOs mapped on the VM.
+VM_BIND feature introduces an optimization where user can create BO which +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on +the VM they are private to and can't be dma-buf exported. +All private BOs of a VM share the dma-resv object. Hence during each execbuff +submission, they need only one dma-resv fence list updated. Thus the fast +path (where required mappings are already bound) submission latency is O(1) +w.r.t the number of VM private BOs.
Two things:
- I think the above is required to for initial vm_bind for vk, it kinda
doesn't make much sense without that, and will allow us to match amdgpu and radeonsi
- Christian König just landed ttm bulk lru helpers, and I think we need to
use those. This means vm_bind will only work with the ttm backend, but that's what we have for the big dgpu where vm_bind helps more in terms of performance, and the igfx conversion to ttm is already going on.
ok
Furthermore the i915 shrinker lru has stopped being an lru, so I think that should also be moved over to the ttm lru in some fashion to make sure we once again have a reasonable and consistent memory aging and reclaim architecture. The current code is just too much of a complete mess.
And since this is all fairly integral to how the code arch works I don't think merging a different version which isn't based on ttm bulk lru helpers makes sense.
Also I do think the page table lru handling needs to be included here, because that's another complete hand-rolled separate world for not much good reasons. I guess that can happen in parallel with the initial vm_bind bring-up, but it needs to be completed by the time we add the features beyond the initial support needed for vk.
Ok
+VM_BIND locking hirarchy +------------------------- +VM_BIND locking order is as below.
+1) A vm_bind mutex will protect vm_bind lists. This lock is taken in vm_bind/
- vm_unbind ioctl calls, in the execbuff path and while releasing the mapping.
- In future, when GPU page faults are supported, we can potentially use a
- rwsem instead, so that multiple pagefault handlers can take the read side
- lock to lookup the mapping and hence can run in parallel.
+2) The BO's dma-resv lock will protect i915_vma state and needs to be held
- while binding a vma and while updating dma-resv fence list of a BO.
- The private BOs of a VM will all share a dma-resv object.
- This lock is held in vm_bind call for immediate binding, during vm_unbind
- call for unbinding and during execbuff path for binding the mapping and
- updating the dma-resv fence list of the BO.
+3) Spinlock/s to protect some of the VM's lists.
+We will also need support for bluk LRU movement of persistent mapping to +avoid additional latencies in execbuff path.
This needs more detail and explanation of how each level is required. Also the shared dma_resv for VM_PRIVATE objects is kinda important to explain.
Like "some of the VM's lists" explains pretty much nothing.
Ok, will explain.
+GPU page faults +---------------- +Both older execbuff mode and the newer VM_BIND mode of binding will require +using dma-fence to ensure residency. +In future when GPU page faults are supported, no dma-fence usage is required +as residency is purely managed by installing and removing/invalidating ptes.
This is a bit confusing. I think one part of this should be moved into the section with future vm_bind use-cases (we're not going to support page faults with legacy softpin or even worse, relocations). The locking discussion should be part of the much longer list of uses cases that motivate the locking design.
Ok, will move.
+User/Memory Fence +================== +The idea is to take a user specified virtual address and install an interrupt +handler to wake up the current task when the memory location passes the user +supplied filter.
+User/Memory fence is a <address, value> pair. To signal the user fence, +specified value will be written at the specified virtual address and +wakeup the waiting process. User can wait on an user fence with the +gem_wait_user_fence ioctl.
+It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify +interrupt within their batches after updating the value to have sub-batch +precision on the wakeup. Each batch can signal an user fence to indicate +the completion of next level batch. The completion of very first level batch +needs to be signaled by the command streamer. The user must provide the +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE +extension of execbuff ioctl, so that KMD can setup the command streamer to +signal it.
+User/Memory fence can also be supplied to the kernel driver to signal/wake up +the user process after completion of an asynchronous operation.
+When VM_BIND ioctl was provided with a user/memory fence via the +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion +of binding of that mapping. All async binds/unbinds are serialized, hence +signaling of user/memory fence also indicate the completion of all previous +binds/unbinds.
+This feature will be derived from the below original work: +https://patchwork.freedesktop.org/patch/349417/
This is 1:1 tied to long running compute mode contexts (which in the uapi doc must reference the endless amounts of bikeshed summary we have in the docs about indefinite fences).
Ok, will check and add reference.
I'd put this into a new section about compute and userspace memory fences support, with this and the next chapter ...
ok
+VM_BIND use cases +==================
... and then make this section here focus entirely on additional vm_bind use-cases that we'll be adding later on. Which doesn't need to go into any details, it's just justification for why we want to build the world on top of vm_bind.
ok
+Long running Compute contexts +------------------------------ +Usage of dma-fence expects that they complete in reasonable amount of time. +Compute on the other hand can be long running. Hence it is appropriate for +compute to use user/memory fence and dma-fence usage will be limited to +in-kernel consumption only. This requires an execbuff uapi extension to pass +in user fence. Compute must opt-in for this mechanism with +I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during context creation.
+The dma-fence based user interfaces like gem_wait ioctl, execbuff out fence +and implicit dependency setting is not allowed on long running contexts.
+Where GPU page faults are not available, kernel driver upon buffer invalidation +will initiate a suspend (preemption) of long running context with a dma-fence +attached to it. And upon completion of that suspend fence, finish the +invalidation, revalidate the BO and then resume the compute context. This is +done by having a per-context fence (called suspend fence) proxying as +i915_request fence. This suspend fence is enabled when there is a wait on it, +which triggers the context preemption.
+This is much easier to support with VM_BIND compared to the current heavier +execbuff path resource attachment.
There's a bunch of tricky code around compute mode context support, like the preempt ctx fence (or suspend fence or whatever you want to call it), and the resume work. And I think that code should be shared across drivers.
I think the right place to put this is into drm/sched, somewhere attached to the drm_sched_entity structure. I expect i915 folks to collaborate with amd and ideally also get amdkfd to adopt the same thing if possible. At least Christian has mentioned in the past that he's a bit unhappy about how this works.
Also drm/sched has dependency tracking, which will be needed to pipeline context resume operations. That needs to be used instead of i915-gem inventing yet another dependency tracking data structure (it already has 3 and that's roughly 3 too many).
This means compute mode support and userspace memory fences are blocked on the drm/sched conversion, but *eh* add it to the list of reasons for why drm/sched needs to happen.
Also since we only have support for compute mode ctx in our internal tree with the guc scheduler backend anyway, and the first conversion target is the guc backend, I don't think this actually holds up a lot of the code.
Hmm...ok. Currently, the context suspend and resume operations in out internal tree is through an orthogonal guc interface (not through scheduler). So, I need to look more into this part.
+Low Latency Submission +----------------------- +Allows compute UMD to directly submit GPU jobs instead of through execbuff +ioctl. VM_BIND allows map/unmap of BOs required for directly submitted jobs.
This is really just a special case of compute mode contexts, I think I'd include that in there, but explain better what it requires (i.e. vm_bind not being synchronized against execbuf).
ok
+Debugger +--------- +With debug event interface user space process (debugger) is able to keep track +of and act upon resources created by another process (debuggee) and attached +to GPU via vm_bind interface.
+Mesa/Valkun +------------ +VM_BIND can potentially reduce the CPU-overhead in Mesa thus improving +performance. For Vulkan it should be straightforward to use VM_BIND. +For Iris implicit buffer tracking must be implemented before we can harness +VM_BIND benefits. With increasing GPU hardware performance reducing CPU +overhead becomes more important.
Just to clarify, I don't think we can land vm_bind into upstream if it doesn't work 100% for vk. There's a bit much "can" instead of "will in this section".
ok, will explain better.
+Page level hints settings +-------------------------- +VM_BIND allows any hints setting per mapping instead of per BO. +Possible hints include read-only, placement and atomicity. +Sub-BO level placement hint will be even more relevant with +upcoming GPU on-demand page fault support.
+Page level Cache/CLOS settings +------------------------------- +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
+Shared Virtual Memory (SVM) support +------------------------------------ +VM_BIND interface can be used to map system memory directly (without gem BO +abstraction) using the HMM interface.
Userptr is absent here (and it's not the same as svm, at least on discrete), and this is needed for the initial version since otherwise vk can't use it because we're not at feature parity.
userptr gem objects are supported in initial version (and yes it is not same as SVM). I did not add it here as there is no additional uapi change required to support that.
Irc discussions by Maarten and Dave came up with the idea that maybe userptr for vm_bind should work _without_ any gem bo as backing storage, since that guarantees that people don't come up with funny ideas like trying to share such bo across process or mmap it and other nonsense which just doesn't work.
Hmm...there is no plan to support userptr _without_ gem bo not atleast in the initial vm_bind support. Is it Ok to put it in the 'futues' section?
+Broder i915 cleanups +===================== +Supporting this whole new vm_bind mode of binding which comes with its own +usecases to support and the locking requirements requires proper integration +with the existing i915 driver. This calls for some broader i915 driver +cleanups/simplifications for maintainability of the driver going forward. +Here are few things identified and are being looked into.
+- Make pagetable allocations evictable and manage them similar to VM_BIND
- mapped objects. Page table pages are similar to persistent mappings of a
- VM (difference here are that the page table pages will not
- have an i915_vma structure and after swapping pages back in, parent page
- link needs to be updated).
See above, but I think this should be included as part of the initial vm_bind push.
Ok, as you mentioned above, we can do it soon after initial vm_bind support lands, but before we add any new vm_bind features.
+- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
- do not use it and complexity it brings in is probably more than the
- performance advantage we get in legacy execbuff case.
+- Remove vma->open_count counting +- Remove i915_vma active reference tracking. Instead use underlying BO's
- dma-resv fence list to determine if a i915_vma is active or not.
So this is a complete mess, and really should not exist. I think it needs to be removed before we try to make i915_vma even more complex by adding vm_bind.
Hmm...Need to look into this. I am not sure how much of an effort it is going to be to remove i915_vma active reference tracking and instead use dma_resv fences for activeness tracking.
The other thing I've been pondering here is that vm_bind is really completely different from legacy vm structures for a lot of reasons:
- no relocation or softpin handling, which means vm_bind has no reason to
ever look at the i915_vma structure in execbuf code. Unfortunately execbuf has been rewritten to be vma instead of obj centric, so it's a 100% mismatch
- vm_bind never has to manage any vm lru. Legacy execbuf has to maintain
that because the kernel manages the virtual address space fully. Again ideally that entire vma_move_to_active code and everything related to it would simply not exist.
- similar on the eviction side, the rules are quite different: For vm_bind
we never tear down the vma, instead it's just moved to the list of evicted vma. Legacy vm have no need for all these additional lists, so another huge confusion.
- if the refcount is done correctly for vm_bind we wouldn't need the
tricky code in the bo close paths. Unfortunately legacy vm with relocations and softpin require that vma are only a weak reference, so that cannot be removed.
- there's also a ton of special cases for ggtt handling, like the
different views (for display, partial views for mmap), but also the gen2/3 alignment and padding requirements which vm_bind never needs.
I think the right thing here is to massively split the implementation behind some solid vm/vma abstraction, with a base clase for vm and vma which _only_ has the pieces which both vm_bind and the legacy vm stuff needs. But it's a bit tricky to get there. I think a workable path would be:
- Add a new base class to both i915_address_space and i915_vma, which
starts out empty.
- As vm_bind code lands, move things that vm_bind code needs into these
base classes
Ok
- The goal should be that these base classes are a stand-alone library
that other drivers could reuse. Like we've done with the buddy allocator, which first moved from i915-gem to i915-ttm, and which amd now moved to drm/ttm for reuse by amdgpu. Ideally other drivers interested in adding something like vm_bind should be involved from the start (or maybe the entire thing reused in amdgpu, they're looking at vk sparse binding support too or at least have perf issues I think).
- Locking must be the same across all implemntations, otherwise it's
really not an abstract. i915 screwed this up terribly by having different locking rules for ppgtt and ggtt, which is just nonsense.
- The legacy specific code needs to be extracted as much as possible and
shoved into separate files. In execbuf this means we need to get back to object centric flow, and the slowpaths need to become a lot simpler again (Maarten has cleaned up some of this, but there's still a silly amount of hacks in there with funny layering).
This also, we can do soon after vm_bind code lands right?
- I think if stuff like the vma eviction details (list movement and
locking and refcounting of the underlying object)
+These can be worked upon after intitial vm_bind support is added.
I don't think that works, given how badly i915-gem team screwed up in other places. And those places had to be fixed by adopting shared code like ttm. Plus there's already a huge unfulffiled promise pending with the drm/sched conversion, i915-gem team is clearly deeply in the red here :-/
Hmmm ok. As I mentioned above, I need to look into how to remove i915_vma active reference tracking code from i915 driver. Wonder if there is any middle ground here like not using that in vm_bind mode?
Niranjana
Cheers, Daniel
+UAPI +===== +Uapi definiton can be found here: +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst index 91e93a705230..7d10c36b268d 100644 --- a/Documentation/gpu/rfc/index.rst +++ b/Documentation/gpu/rfc/index.rst @@ -23,3 +23,7 @@ host such documentation: .. toctree::
i915_scheduler.rst
+.. toctree::
- i915_vm_bind.rst
-- 2.21.0.rc0.32.g243a4c7e27
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Wed, Apr 27, 2022 at 08:41:35AM -0700, Niranjana Vishwanathapura wrote:
On Wed, Apr 20, 2022 at 03:45:25PM -0700, Niranjana Vishwanathapura wrote:
On Thu, Mar 31, 2022 at 10:28:48AM +0200, Daniel Vetter wrote:
Adding a pile of people who've expressed interest in vm_bind for their drivers.
Also note to the intel folks: This is largely written with me having my subsystem co-maintainer hat on, i.e. what I think is the right thing to do here for the subsystem at large. There is substantial rework involved here, but it's not any different from i915 adopting ttm or i915 adpoting drm/sched, and I do think this stuff needs to happen in one form or another.
On Mon, Mar 07, 2022 at 12:31:45PM -0800, Niranjana Vishwanathapura wrote:
VM_BIND design document with description of intended use cases.
Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com
Documentation/gpu/rfc/i915_vm_bind.rst | 210 +++++++++++++++++++++++++ Documentation/gpu/rfc/index.rst | 4 + 2 files changed, 214 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.rst
diff --git a/Documentation/gpu/rfc/i915_vm_bind.rst b/Documentation/gpu/rfc/i915_vm_bind.rst new file mode 100644 index 000000000000..cdc6bb25b942 --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.rst @@ -0,0 +1,210 @@ +========================================== +I915 VM_BIND feature design and use cases +==========================================
+VM_BIND feature +================ +DRM_I915_GEM_VM_BIND/UNBIND ioctls allows UMD to bind/unbind GEM buffer +objects (BOs) or sections of a BOs at specified GPU virtual addresses on +a specified address space (VM).
+These mappings (also referred to as persistent mappings) will be persistent +across multiple GPU submissions (execbuff) issued by the UMD, without user +having to provide a list of all required mappings during each submission +(as required by older execbuff mode).
+VM_BIND ioctl deferes binding the mappings until next execbuff submission +where it will be required, or immediately if I915_GEM_VM_BIND_IMMEDIATE +flag is set (useful if mapping is required for an active context).
So this is a screw-up I've done, and for upstream I think we need to fix it: Implicit sync is bad, and it's also still a bad idea for vm_bind, and I was wrong suggesting we should do this a few years back when we kicked this off internally :-(
What I think we need is just always VM_BIND_IMMEDIATE mode, and then a few things on top:
- in and out fences, like with execbuf, to allow userspace to sync with
execbuf as needed
- for compute-mode context this means userspace memory fences
- for legacy context this means a timeline syncobj in drm_syncobj
No sync_file or anything else like this at all. This means a bunch of work, but also it'll have benefits because it means we should be able to use exactly the same code paths and logic for both compute and for legacy context, because drm_syncobj support future fence semantics.
Thanks Daniel, Ok, will update
I had a long conversation with Daniel on some of the points discussed here. Thanks to Daniel for clarifying many points here.
Here is the summary of the discussion.
- A prep patch is needed to update documentation of some existing uapi and this
new VM_BIND uapi can update/refer to that. I will include this prep patch in the next revision of this RFC series. Will also include the uapi header file in the rst file so that it gets rendered.
- Will update documentation here with proper use of dma_resv_usage while adding
fences to vm_bind objects. It is going to be, DMA_RESV_USAGE_BOOKKEEP by default if not override with execlist in execbuff path.
- Add extension to execbuff ioctl to specify batch buffer as GPU virtual address
instead of having to pass it as a BO handle in execlist. This will also make the execlist usage solely for implicit sync setting which is further discussed below.
- Need to look into when will Jason's dma-buf fence import/export ioctl support will
land and whether it will be used both for vl and gl. Need to sync with Jason on this. Probably the better option here would be to not support execlist in execbuff path in vm_bind mode for initial vm_bind support (hoping Jason's dma-buf fence import/export ioctl will be enough). We can add support for execlist in execbuff for vm_bind mode later if required (say for gl).
- There are lot of things in execbuff path that doesn't apply in VM_BIND mode (like
relocations, implicit sync etc). Separate them out by using function pointers wherever the functionality differs between current design and the newer VM_BIND design.
- Separate out i915_vma active reference counting in execbuff path and do not use it in
VM_BIND mode. Instead use dma-resv fence checking for VM_BIND mode. This should be easier to get it working with the current TTM backend (which initial VM_BIND support will use). And remove i915_vma active reference counting fully while supporting TTM backend for igfx.
- As we support compute mode contexts only with GuC scheduler backend and compute mode requires
support for suspend and resume of contexts, it will have a dependency on i915 drm scheduler conversion.
Will revise this series accordingly.
I was prototyping some of these and they look good. Still need to address few opens on dma-resv fence usage for VM_BIND. Like, how to effectively update fence list during VM_BIND (for non VM private objects).
I will be addressing these review comments and hoping to post updated patch series by the end of this week or so.
Thanks, Niranjana
Thanks, Niranjana
Also on the implementation side we still need to install dma_fence to the various dma_resv, and for this we need the new dma_resv_usage series from Christian König first. vm_bind fences can then use the USAGE_BOOKKEEPING flag to make sure they never result in an oversync issue with execbuf. I don't think trying to land vm_bind without that prep work in dma_resv_usage makes sense.
Ok, but that is not a dependency for this VM_BIND design RFC patch right? I will add this to the documentation here.
Also as soon as dma_resv_usage has landed there's a few cleanups we should do in i915:
- ttm bo moving code should probably simplify a bit (and maybe more of the
code should be pushed as helpers into ttm)
- clflush code should be moved over to using USAGE_KERNEL and the various
hacks and special cases should be ditched. See df94fd05e69e ("drm/i915: expand on the kernel-doc for cache_dirty") for a bit more context
This is still not yet enough, since if a vm_bind races with an eviction we might stall on the new buffers being readied first before the context can continue. This needs some care to make sure that vma which aren't fully bound yet are on a separate list, and vma which are marked for unbinding are removed from the main working set list as soon as possible.
All of these things are relevant for the uapi semantics, which means
- they need to be documented in the uapi kerneldoc, ideally with example
flows
- umd need to ack this
Ok
The other thing here is the async/nonblocking path. I think we still need that one, but again it should not sync with anything going on in execbuf, but simply execute the ioctl code in a kernel thread. The idea here is that this works like a special gpu engine, so that compute and vk can schedule bindings interleaved with rendering. This should be enough to get a performant vk sparse binding/textures implementation.
But I'm not entirely sure on this one, so this definitely needs acks from umds.
+VM_BIND feature is advertised to user via I915_PARAM_HAS_VM_BIND. +User has to opt-in for VM_BIND mode of binding for an address space (VM) +during VM creation time via I915_VM_CREATE_FLAGS_USE_VM_BIND extension. +A VM in VM_BIND mode will not support older execbuff mode of binding.
+UMDs can still send BOs of these persistent mappings in execlist of execbuff +for specifying BO dependencies (implicit fencing) and to use BO as a batch, +but those BOs should be mapped ahead via vm_bind ioctl.
should or must?
Must, will fix.
Also I'm not really sure that's a great interface. The batchbuffer really only needs to be an address, so maybe all we need is an extension to supply an u64 batchbuffer address instead of trying to retrofit this into an unfitting current uapi.
Yah, this was considered, but was decided to do it as later optimization. But if we were to remove execlist entries completely (ie., no implicit sync also), then we need to do this from the beginning.
And for implicit sync there's two things:
- for vk I think the right uapi is the dma-buf fence import/export ioctls
from Jason Ekstrand. I think we should land that first instead of hacking funny concepts together
I did not understand fully, can you point to it?
- for gl the dma-buf import/export might not be fast enough, since gl
needs to do a _lot_ of implicit sync. There we might need to use the execbuffer buffer list, but then we should have extremely clear uapi rules which disallow _everything_ except setting the explicit sync uapi
Ok, so then, we still need to support implicit sync in vm_bind mode. Right?
Again all this stuff needs to be documented in detail in the kerneldoc uapi spec.
ok
+VM_BIND features include, +- Multiple Virtual Address (VA) mappings can map to the same physical pages
- of an object (aliasing).
+- VA mapping can map to a partial section of the BO (partial binding). +- Support capture of persistent mappings in the dump upon GPU error. +- TLB is flushed upon unbind completion. Batching of TLB flushes in some
- usecases will be helpful.
+- Asynchronous vm_bind and vm_unbind support. +- VM_BIND uses user/memory fence mechanism for signaling bind completion
- and for signaling batch completion in long running contexts (explained
- below).
This should all be in the kerneldoc.
ok
+VM_PRIVATE objects +------------------ +By default, BOs can be mapped on multiple VMs and can also be dma-buf +exported. Hence these BOs are referred to as Shared BOs. +During each execbuff submission, the request fence must be added to the +dma-resv fence list of all shared BOs mapped on the VM.
+VM_BIND feature introduces an optimization where user can create BO which +is private to a specified VM via I915_GEM_CREATE_EXT_VM_PRIVATE flag during +BO creation. Unlike Shared BOs, these VM private BOs can only be mapped on +the VM they are private to and can't be dma-buf exported. +All private BOs of a VM share the dma-resv object. Hence during each execbuff +submission, they need only one dma-resv fence list updated. Thus the fast +path (where required mappings are already bound) submission latency is O(1) +w.r.t the number of VM private BOs.
Two things:
- I think the above is required to for initial vm_bind for vk, it kinda
doesn't make much sense without that, and will allow us to match amdgpu and radeonsi
- Christian König just landed ttm bulk lru helpers, and I think we need to
use those. This means vm_bind will only work with the ttm backend, but that's what we have for the big dgpu where vm_bind helps more in terms of performance, and the igfx conversion to ttm is already going on.
ok
Furthermore the i915 shrinker lru has stopped being an lru, so I think that should also be moved over to the ttm lru in some fashion to make sure we once again have a reasonable and consistent memory aging and reclaim architecture. The current code is just too much of a complete mess.
And since this is all fairly integral to how the code arch works I don't think merging a different version which isn't based on ttm bulk lru helpers makes sense.
Also I do think the page table lru handling needs to be included here, because that's another complete hand-rolled separate world for not much good reasons. I guess that can happen in parallel with the initial vm_bind bring-up, but it needs to be completed by the time we add the features beyond the initial support needed for vk.
Ok
+VM_BIND locking hirarchy +------------------------- +VM_BIND locking order is as below.
+1) A vm_bind mutex will protect vm_bind lists. This lock is taken in vm_bind/
- vm_unbind ioctl calls, in the execbuff path and while releasing the mapping.
- In future, when GPU page faults are supported, we can potentially use a
- rwsem instead, so that multiple pagefault handlers can take the read side
- lock to lookup the mapping and hence can run in parallel.
+2) The BO's dma-resv lock will protect i915_vma state and needs to be held
- while binding a vma and while updating dma-resv fence list of a BO.
- The private BOs of a VM will all share a dma-resv object.
- This lock is held in vm_bind call for immediate binding, during vm_unbind
- call for unbinding and during execbuff path for binding the mapping and
- updating the dma-resv fence list of the BO.
+3) Spinlock/s to protect some of the VM's lists.
+We will also need support for bluk LRU movement of persistent mapping to +avoid additional latencies in execbuff path.
This needs more detail and explanation of how each level is required. Also the shared dma_resv for VM_PRIVATE objects is kinda important to explain.
Like "some of the VM's lists" explains pretty much nothing.
Ok, will explain.
+GPU page faults +---------------- +Both older execbuff mode and the newer VM_BIND mode of binding will require +using dma-fence to ensure residency. +In future when GPU page faults are supported, no dma-fence usage is required +as residency is purely managed by installing and removing/invalidating ptes.
This is a bit confusing. I think one part of this should be moved into the section with future vm_bind use-cases (we're not going to support page faults with legacy softpin or even worse, relocations). The locking discussion should be part of the much longer list of uses cases that motivate the locking design.
Ok, will move.
+User/Memory Fence +================== +The idea is to take a user specified virtual address and install an interrupt +handler to wake up the current task when the memory location passes the user +supplied filter.
+User/Memory fence is a <address, value> pair. To signal the user fence, +specified value will be written at the specified virtual address and +wakeup the waiting process. User can wait on an user fence with the +gem_wait_user_fence ioctl.
+It also allows the user to emit their own MI_FLUSH/PIPE_CONTROL notify +interrupt within their batches after updating the value to have sub-batch +precision on the wakeup. Each batch can signal an user fence to indicate +the completion of next level batch. The completion of very first level batch +needs to be signaled by the command streamer. The user must provide the +user/memory fence for this via the DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE +extension of execbuff ioctl, so that KMD can setup the command streamer to +signal it.
+User/Memory fence can also be supplied to the kernel driver to signal/wake up +the user process after completion of an asynchronous operation.
+When VM_BIND ioctl was provided with a user/memory fence via the +I915_VM_BIND_EXT_USER_FENCE extension, it will be signaled upon the completion +of binding of that mapping. All async binds/unbinds are serialized, hence +signaling of user/memory fence also indicate the completion of all previous +binds/unbinds.
+This feature will be derived from the below original work: +https://patchwork.freedesktop.org/patch/349417/
This is 1:1 tied to long running compute mode contexts (which in the uapi doc must reference the endless amounts of bikeshed summary we have in the docs about indefinite fences).
Ok, will check and add reference.
I'd put this into a new section about compute and userspace memory fences support, with this and the next chapter ...
ok
+VM_BIND use cases +==================
... and then make this section here focus entirely on additional vm_bind use-cases that we'll be adding later on. Which doesn't need to go into any details, it's just justification for why we want to build the world on top of vm_bind.
ok
+Long running Compute contexts +------------------------------ +Usage of dma-fence expects that they complete in reasonable amount of time. +Compute on the other hand can be long running. Hence it is appropriate for +compute to use user/memory fence and dma-fence usage will be limited to +in-kernel consumption only. This requires an execbuff uapi extension to pass +in user fence. Compute must opt-in for this mechanism with +I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING flag during context creation.
+The dma-fence based user interfaces like gem_wait ioctl, execbuff out fence +and implicit dependency setting is not allowed on long running contexts.
+Where GPU page faults are not available, kernel driver upon buffer invalidation +will initiate a suspend (preemption) of long running context with a dma-fence +attached to it. And upon completion of that suspend fence, finish the +invalidation, revalidate the BO and then resume the compute context. This is +done by having a per-context fence (called suspend fence) proxying as +i915_request fence. This suspend fence is enabled when there is a wait on it, +which triggers the context preemption.
+This is much easier to support with VM_BIND compared to the current heavier +execbuff path resource attachment.
There's a bunch of tricky code around compute mode context support, like the preempt ctx fence (or suspend fence or whatever you want to call it), and the resume work. And I think that code should be shared across drivers.
I think the right place to put this is into drm/sched, somewhere attached to the drm_sched_entity structure. I expect i915 folks to collaborate with amd and ideally also get amdkfd to adopt the same thing if possible. At least Christian has mentioned in the past that he's a bit unhappy about how this works.
Also drm/sched has dependency tracking, which will be needed to pipeline context resume operations. That needs to be used instead of i915-gem inventing yet another dependency tracking data structure (it already has 3 and that's roughly 3 too many).
This means compute mode support and userspace memory fences are blocked on the drm/sched conversion, but *eh* add it to the list of reasons for why drm/sched needs to happen.
Also since we only have support for compute mode ctx in our internal tree with the guc scheduler backend anyway, and the first conversion target is the guc backend, I don't think this actually holds up a lot of the code.
Hmm...ok. Currently, the context suspend and resume operations in out internal tree is through an orthogonal guc interface (not through scheduler). So, I need to look more into this part.
+Low Latency Submission +----------------------- +Allows compute UMD to directly submit GPU jobs instead of through execbuff +ioctl. VM_BIND allows map/unmap of BOs required for directly submitted jobs.
This is really just a special case of compute mode contexts, I think I'd include that in there, but explain better what it requires (i.e. vm_bind not being synchronized against execbuf).
ok
+Debugger +--------- +With debug event interface user space process (debugger) is able to keep track +of and act upon resources created by another process (debuggee) and attached +to GPU via vm_bind interface.
+Mesa/Valkun +------------ +VM_BIND can potentially reduce the CPU-overhead in Mesa thus improving +performance. For Vulkan it should be straightforward to use VM_BIND. +For Iris implicit buffer tracking must be implemented before we can harness +VM_BIND benefits. With increasing GPU hardware performance reducing CPU +overhead becomes more important.
Just to clarify, I don't think we can land vm_bind into upstream if it doesn't work 100% for vk. There's a bit much "can" instead of "will in this section".
ok, will explain better.
+Page level hints settings +-------------------------- +VM_BIND allows any hints setting per mapping instead of per BO. +Possible hints include read-only, placement and atomicity. +Sub-BO level placement hint will be even more relevant with +upcoming GPU on-demand page fault support.
+Page level Cache/CLOS settings +------------------------------- +VM_BIND allows cache/CLOS settings per mapping instead of per BO.
+Shared Virtual Memory (SVM) support +------------------------------------ +VM_BIND interface can be used to map system memory directly (without gem BO +abstraction) using the HMM interface.
Userptr is absent here (and it's not the same as svm, at least on discrete), and this is needed for the initial version since otherwise vk can't use it because we're not at feature parity.
userptr gem objects are supported in initial version (and yes it is not same as SVM). I did not add it here as there is no additional uapi change required to support that.
Irc discussions by Maarten and Dave came up with the idea that maybe userptr for vm_bind should work _without_ any gem bo as backing storage, since that guarantees that people don't come up with funny ideas like trying to share such bo across process or mmap it and other nonsense which just doesn't work.
Hmm...there is no plan to support userptr _without_ gem bo not atleast in the initial vm_bind support. Is it Ok to put it in the 'futues' section?
+Broder i915 cleanups +===================== +Supporting this whole new vm_bind mode of binding which comes with its own +usecases to support and the locking requirements requires proper integration +with the existing i915 driver. This calls for some broader i915 driver +cleanups/simplifications for maintainability of the driver going forward. +Here are few things identified and are being looked into.
+- Make pagetable allocations evictable and manage them similar to VM_BIND
- mapped objects. Page table pages are similar to persistent mappings of a
- VM (difference here are that the page table pages will not
- have an i915_vma structure and after swapping pages back in, parent page
- link needs to be updated).
See above, but I think this should be included as part of the initial vm_bind push.
Ok, as you mentioned above, we can do it soon after initial vm_bind support lands, but before we add any new vm_bind features.
+- Remove vma lookup cache (eb->gem_context->handles_vma). VM_BIND feature
- do not use it and complexity it brings in is probably more than the
- performance advantage we get in legacy execbuff case.
+- Remove vma->open_count counting +- Remove i915_vma active reference tracking. Instead use underlying BO's
- dma-resv fence list to determine if a i915_vma is active or not.
So this is a complete mess, and really should not exist. I think it needs to be removed before we try to make i915_vma even more complex by adding vm_bind.
Hmm...Need to look into this. I am not sure how much of an effort it is going to be to remove i915_vma active reference tracking and instead use dma_resv fences for activeness tracking.
The other thing I've been pondering here is that vm_bind is really completely different from legacy vm structures for a lot of reasons:
- no relocation or softpin handling, which means vm_bind has no reason to
ever look at the i915_vma structure in execbuf code. Unfortunately execbuf has been rewritten to be vma instead of obj centric, so it's a 100% mismatch
- vm_bind never has to manage any vm lru. Legacy execbuf has to maintain
that because the kernel manages the virtual address space fully. Again ideally that entire vma_move_to_active code and everything related to it would simply not exist.
- similar on the eviction side, the rules are quite different: For vm_bind
we never tear down the vma, instead it's just moved to the list of evicted vma. Legacy vm have no need for all these additional lists, so another huge confusion.
- if the refcount is done correctly for vm_bind we wouldn't need the
tricky code in the bo close paths. Unfortunately legacy vm with relocations and softpin require that vma are only a weak reference, so that cannot be removed.
- there's also a ton of special cases for ggtt handling, like the
different views (for display, partial views for mmap), but also the gen2/3 alignment and padding requirements which vm_bind never needs.
I think the right thing here is to massively split the implementation behind some solid vm/vma abstraction, with a base clase for vm and vma which _only_ has the pieces which both vm_bind and the legacy vm stuff needs. But it's a bit tricky to get there. I think a workable path would be:
- Add a new base class to both i915_address_space and i915_vma, which
starts out empty.
- As vm_bind code lands, move things that vm_bind code needs into these
base classes
Ok
- The goal should be that these base classes are a stand-alone library
that other drivers could reuse. Like we've done with the buddy allocator, which first moved from i915-gem to i915-ttm, and which amd now moved to drm/ttm for reuse by amdgpu. Ideally other drivers interested in adding something like vm_bind should be involved from the start (or maybe the entire thing reused in amdgpu, they're looking at vk sparse binding support too or at least have perf issues I think).
- Locking must be the same across all implemntations, otherwise it's
really not an abstract. i915 screwed this up terribly by having different locking rules for ppgtt and ggtt, which is just nonsense.
- The legacy specific code needs to be extracted as much as possible and
shoved into separate files. In execbuf this means we need to get back to object centric flow, and the slowpaths need to become a lot simpler again (Maarten has cleaned up some of this, but there's still a silly amount of hacks in there with funny layering).
This also, we can do soon after vm_bind code lands right?
- I think if stuff like the vma eviction details (list movement and
locking and refcounting of the underlying object)
+These can be worked upon after intitial vm_bind support is added.
I don't think that works, given how badly i915-gem team screwed up in other places. And those places had to be fixed by adopting shared code like ttm. Plus there's already a huge unfulffiled promise pending with the drm/sched conversion, i915-gem team is clearly deeply in the red here :-/
Hmmm ok. As I mentioned above, I need to look into how to remove i915_vma active reference tracking code from i915 driver. Wonder if there is any middle ground here like not using that in vm_bind mode?
Niranjana
Cheers, Daniel
+UAPI +===== +Uapi definiton can be found here: +.. kernel-doc:: Documentation/gpu/rfc/i915_vm_bind.h diff --git a/Documentation/gpu/rfc/index.rst b/Documentation/gpu/rfc/index.rst index 91e93a705230..7d10c36b268d 100644 --- a/Documentation/gpu/rfc/index.rst +++ b/Documentation/gpu/rfc/index.rst @@ -23,3 +23,7 @@ host such documentation: .. toctree::
i915_scheduler.rst
+.. toctree::
- i915_vm_bind.rst
-- 2.21.0.rc0.32.g243a4c7e27
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
VM_BIND und related uapi definitions
Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com --- Documentation/gpu/rfc/i915_vm_bind.h | 176 +++++++++++++++++++++++++++ 1 file changed, 176 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h new file mode 100644 index 000000000000..80f00ee6c8a1 --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.h @@ -0,0 +1,176 @@ +/* SPDX-License-Identifier: MIT */ +/* + * Copyright © 2022 Intel Corporation + */ + +/* VM_BIND feature availability through drm_i915_getparam */ +#define I915_PARAM_HAS_VM_BIND 57 + +/* VM_BIND related ioctls */ +#define DRM_I915_GEM_VM_BIND 0x3d +#define DRM_I915_GEM_VM_UNBIND 0x3e +#define DRM_I915_GEM_WAIT_USER_FENCE 0x3f + +#define DRM_IOCTL_I915_GEM_VM_BIND DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind) +#define DRM_IOCTL_I915_GEM_VM_UNBIND DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind) +#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct drm_i915_gem_wait_user_fence) + +/** + * struct drm_i915_gem_vm_bind - VA to object/buffer mapping to [un]bind. + */ +struct drm_i915_gem_vm_bind { + /** vm to [un]bind */ + __u32 vm_id; + + /** + * BO handle or file descriptor. + * 'fd' value of -1 is reserved for system pages (SVM) + */ + union { + __u32 handle; /* For unbind, it is reserved and must be 0 */ + __s32 fd; + } + + /** VA start to [un]bind */ + __u64 start; + + /** Offset in object to [un]bind */ + __u64 offset; + + /** VA length to [un]bind */ + __u64 length; + + /** Flags */ + __u64 flags; + /** Bind the mapping immediately instead of during next submission */ +#define I915_GEM_VM_BIND_IMMEDIATE (1 << 0) + /** Read-only mapping */ +#define I915_GEM_VM_BIND_READONLY (1 << 1) + /** Capture this mapping in the dump upon GPU error */ +#define I915_GEM_VM_BIND_CAPTURE (1 << 2) + + /** Zero-terminated chain of extensions */ + __u64 extensions; +}; + +/** + * struct drm_i915_vm_bind_ext_user_fence - Bind completion signaling extension. + */ +struct drm_i915_vm_bind_ext_user_fence { +#define I915_VM_BIND_EXT_USER_FENCE 0 + /** @base: Extension link. See struct i915_user_extension. */ + struct i915_user_extension base; + + /** User/Memory fence qword alinged process virtual address */ + __u64 addr; + + /** User/Memory fence value to be written after bind completion */ + __u64 val; + + /** Reserved for future extensions */ + __u64 rsvd; +}; + +/** + * struct drm_i915_gem_execbuffer_ext_user_fence - First level batch completion + * signaling extension. + * + * This extension allows user to attach a user fence (<addr, value> pair) to an + * execbuf to be signaled by the command streamer after the completion of 1st + * level batch, by writing the <value> at specified <addr> and triggering an + * interrupt. + * User can either poll for this user fence to signal or can also wait on it + * with i915_gem_wait_user_fence ioctl. + * This is very much usefaul for long running contexts where waiting on dma-fence + * by user (like i915_gem_wait ioctl) is not supported. + */ +struct drm_i915_gem_execbuffer_ext_user_fence { +#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE 0 + /** @base: Extension link. See struct i915_user_extension. */ + struct i915_user_extension base; + + /** + * User/Memory fence qword aligned GPU virtual address. + * Address has to be a valid GPU virtual address at the time of + * 1st level batch completion. + */ + __u64 addr; + + /** + * User/Memory fence Value to be written to above address + * after 1st level batch completes. + */ + __u64 value; + + /** Reserved for future extensions */ + __u64 rsvd; +}; + +struct drm_i915_gem_vm_control { +/** Flag to opt-in for VM_BIND mode of binding during VM creation */ +#define I915_VM_CREATE_FLAGS_USE_VM_BIND (1 << 0) +}; + + +struct drm_i915_gem_create_ext { +/** Extension to make the object private to a specified VM */ +#define I915_GEM_CREATE_EXT_VM_PRIVATE 2 +}; + + +struct prelim_drm_i915_gem_context_create_ext { +/** Flag to declare context as long running */ +#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING (1u << 2) +}; + +/** + * struct drm_i915_gem_wait_user_fence + * + * Wait on user/memory fence. User/Memory fence can be woken up either by, + * 1. GPU context indicated by 'ctx_id', or, + * 2. Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT. + * 'ctx_id' is ignored when this flag is set. + * + * Wakeup when below condition is true. + * (*addr & MASK) OP (VALUE & MASK) + * + */ +~struct drm_i915_gem_wait_user_fence { + /** @base: Extension link. See struct i915_user_extension. */ + __u64 extensions; + + /** User/Memory fence address */ + __u64 addr; + + /** Id of the Context which will signal the fence. */ + __u32 ctx_id; + + /** Wakeup condition operator */ + __u16 op; +#define I915_UFENCE_WAIT_EQ 0 +#define I915_UFENCE_WAIT_NEQ 1 +#define I915_UFENCE_WAIT_GT 2 +#define I915_UFENCE_WAIT_GTE 3 +#define I915_UFENCE_WAIT_LT 4 +#define I915_UFENCE_WAIT_LTE 5 +#define I915_UFENCE_WAIT_BEFORE 6 +#define I915_UFENCE_WAIT_AFTER 7 + + /** Flags */ + __u16 flags; +#define I915_UFENCE_WAIT_SOFT 0x1 +#define I915_UFENCE_WAIT_ABSTIME 0x2 + + /** Wakeup value */ + __u64 value; + + /** Wakeup mask */ + __u64 mask; +#define I915_UFENCE_WAIT_U8 0xffu +#define I915_UFENCE_WAIT_U16 0xffffu +#define I915_UFENCE_WAIT_U32 0xfffffffful +#define I915_UFENCE_WAIT_U64 0xffffffffffffffffull + + /** Timeout */ + __s64 timeout; +};
On Mon, Mar 07, 2022 at 12:31:46PM -0800, Niranjana Vishwanathapura wrote:
VM_BIND und related uapi definitions
Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com
Documentation/gpu/rfc/i915_vm_bind.h | 176 +++++++++++++++++++++++++++
Maybe as the top level comment: The point of documenting uapi isn't to just spell out all the fields, but to define _how_ and _why_ things work. This part is completely missing from these docs here.
1 file changed, 176 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h new file mode 100644 index 000000000000..80f00ee6c8a1 --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.h
You need to incldue this somewhere so it's rendered, see the previous examples.
@@ -0,0 +1,176 @@ +/* SPDX-License-Identifier: MIT */ +/*
- Copyright © 2022 Intel Corporation
- */
+/* VM_BIND feature availability through drm_i915_getparam */ +#define I915_PARAM_HAS_VM_BIND 57
Needs to be kernel-docified, which means we need a prep patch that fixes up the existing mess.
+/* VM_BIND related ioctls */ +#define DRM_I915_GEM_VM_BIND 0x3d +#define DRM_I915_GEM_VM_UNBIND 0x3e +#define DRM_I915_GEM_WAIT_USER_FENCE 0x3f
+#define DRM_IOCTL_I915_GEM_VM_BIND DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind) +#define DRM_IOCTL_I915_GEM_VM_UNBIND DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind) +#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct drm_i915_gem_wait_user_fence)
+/**
- struct drm_i915_gem_vm_bind - VA to object/buffer mapping to [un]bind.
Both binding and unbinding need to specify in excruciating detail what happens if there's overlaps (existing mappings, or unmapping a range which has no mapping, or only partially full of maps or different objects) and fun stuff like that.
- */
+struct drm_i915_gem_vm_bind {
- /** vm to [un]bind */
- __u32 vm_id;
- /**
* BO handle or file descriptor.
* 'fd' value of -1 is reserved for system pages (SVM)
*/
- union {
__u32 handle; /* For unbind, it is reserved and must be 0 */
I think it'd be a lot cleaner if we do a bind and an unbind struct for these, instead of mixing it up.
Also I thought mesa requested to be able to unmap an object from a vm without a range. Has that been dropped, and confirmed to not be needed.
__s32 fd;
If we don't need it right away then don't add it yet. If it's planned to be used then it needs to be documented, but I kinda have no idea why you'd need an fd for svm?
- }
- /** VA start to [un]bind */
- __u64 start;
- /** Offset in object to [un]bind */
- __u64 offset;
- /** VA length to [un]bind */
- __u64 length;
- /** Flags */
- __u64 flags;
- /** Bind the mapping immediately instead of during next submission */
This aint kerneldoc.
Also this needs to specify in much more detail what exactly this means, and also how it interacts with execbuf.
So the patch here probably needs to include the missing pieces on the execbuf side of things. Like how does execbuf work when it's used with a vm_bind managed vm? That means: - document the pieces that are there - then add a patch to document how that all changes with vm_bind
And do that for everything execbuf can do.
+#define I915_GEM_VM_BIND_IMMEDIATE (1 << 0)
- /** Read-only mapping */
+#define I915_GEM_VM_BIND_READONLY (1 << 1)
- /** Capture this mapping in the dump upon GPU error */
+#define I915_GEM_VM_BIND_CAPTURE (1 << 2)
- /** Zero-terminated chain of extensions */
- __u64 extensions;
+};
+/**
- struct drm_i915_vm_bind_ext_user_fence - Bind completion signaling extension.
- */
+struct drm_i915_vm_bind_ext_user_fence { +#define I915_VM_BIND_EXT_USER_FENCE 0
- /** @base: Extension link. See struct i915_user_extension. */
- struct i915_user_extension base;
- /** User/Memory fence qword alinged process virtual address */
- __u64 addr;
- /** User/Memory fence value to be written after bind completion */
- __u64 val;
- /** Reserved for future extensions */
- __u64 rsvd;
+};
+/**
- struct drm_i915_gem_execbuffer_ext_user_fence - First level batch completion
- signaling extension.
- This extension allows user to attach a user fence (<addr, value> pair) to an
- execbuf to be signaled by the command streamer after the completion of 1st
- level batch, by writing the <value> at specified <addr> and triggering an
- interrupt.
- User can either poll for this user fence to signal or can also wait on it
- with i915_gem_wait_user_fence ioctl.
- This is very much usefaul for long running contexts where waiting on dma-fence
- by user (like i915_gem_wait ioctl) is not supported.
- */
+struct drm_i915_gem_execbuffer_ext_user_fence { +#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE 0
- /** @base: Extension link. See struct i915_user_extension. */
- struct i915_user_extension base;
- /**
* User/Memory fence qword aligned GPU virtual address.
* Address has to be a valid GPU virtual address at the time of
* 1st level batch completion.
*/
- __u64 addr;
- /**
* User/Memory fence Value to be written to above address
* after 1st level batch completes.
*/
- __u64 value;
- /** Reserved for future extensions */
- __u64 rsvd;
+};
+struct drm_i915_gem_vm_control { +/** Flag to opt-in for VM_BIND mode of binding during VM creation */
This is very confusingly docunmented and I have no idea how you're going to use an empty extension. Also it's not kerneldoc.
Please check that the stuff you're creating renders properly in the html output.
+#define I915_VM_CREATE_FLAGS_USE_VM_BIND (1 << 0) +};
+struct drm_i915_gem_create_ext { +/** Extension to make the object private to a specified VM */ +#define I915_GEM_CREATE_EXT_VM_PRIVATE 2
Why 2?
Also this all needs to be documented what it precisely means.
+};
+struct prelim_drm_i915_gem_context_create_ext { +/** Flag to declare context as long running */ +#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING (1u << 2)
The compute mode context, again including full impact on execbuf, is not documented here. This also means any gaps in the context uapi documentation need to be filled first in prep patches.
Also memory fences are extremely tricky, we need to specify in detail when they're allowed to be used and when not. This needs to reference the relevant sections from the dma-fence docs.
+};
+/**
- struct drm_i915_gem_wait_user_fence
- Wait on user/memory fence. User/Memory fence can be woken up either by,
- GPU context indicated by 'ctx_id', or,
- Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.
'ctx_id' is ignored when this flag is set.
- Wakeup when below condition is true.
- (*addr & MASK) OP (VALUE & MASK)
- */
+~struct drm_i915_gem_wait_user_fence {
- /** @base: Extension link. See struct i915_user_extension. */
- __u64 extensions;
- /** User/Memory fence address */
- __u64 addr;
- /** Id of the Context which will signal the fence. */
- __u32 ctx_id;
- /** Wakeup condition operator */
- __u16 op;
+#define I915_UFENCE_WAIT_EQ 0 +#define I915_UFENCE_WAIT_NEQ 1 +#define I915_UFENCE_WAIT_GT 2 +#define I915_UFENCE_WAIT_GTE 3 +#define I915_UFENCE_WAIT_LT 4 +#define I915_UFENCE_WAIT_LTE 5 +#define I915_UFENCE_WAIT_BEFORE 6 +#define I915_UFENCE_WAIT_AFTER 7
- /** Flags */
- __u16 flags;
+#define I915_UFENCE_WAIT_SOFT 0x1 +#define I915_UFENCE_WAIT_ABSTIME 0x2
- /** Wakeup value */
- __u64 value;
- /** Wakeup mask */
- __u64 mask;
+#define I915_UFENCE_WAIT_U8 0xffu +#define I915_UFENCE_WAIT_U16 0xffffu +#define I915_UFENCE_WAIT_U32 0xfffffffful +#define I915_UFENCE_WAIT_U64 0xffffffffffffffffull
Do we really need all these flags, and does the hw really support all the combinations? Anything the hw doesn't support in MI_SEMAPHORE is pretty much useless as a umf (userspace memory fence) mode.
- /** Timeout */
Needs to specificy the clock source. -Daniel
- __s64 timeout;
+};
2.21.0.rc0.32.g243a4c7e27
On Wed, Mar 30, 2022 at 02:51:41PM +0200, Daniel Vetter wrote:
On Mon, Mar 07, 2022 at 12:31:46PM -0800, Niranjana Vishwanathapura wrote:
VM_BIND und related uapi definitions
Signed-off-by: Niranjana Vishwanathapura niranjana.vishwanathapura@intel.com
Documentation/gpu/rfc/i915_vm_bind.h | 176 +++++++++++++++++++++++++++
Maybe as the top level comment: The point of documenting uapi isn't to just spell out all the fields, but to define _how_ and _why_ things work. This part is completely missing from these docs here.
Thanks Daniel,
Some of the documentation is in the rst file. Ok, will add documentation here on _how and _why_.
1 file changed, 176 insertions(+) create mode 100644 Documentation/gpu/rfc/i915_vm_bind.h
diff --git a/Documentation/gpu/rfc/i915_vm_bind.h b/Documentation/gpu/rfc/i915_vm_bind.h new file mode 100644 index 000000000000..80f00ee6c8a1 --- /dev/null +++ b/Documentation/gpu/rfc/i915_vm_bind.h
You need to incldue this somewhere so it's rendered, see the previous examples.
Looking at previous examples, my understanding is this is just a documentation file at this point which goes into Documentation/gpu/rfc folder and will have to remove it later once the actual uapi changes lands in include/uapi/drm/i915_drm.h. Let me know if that is incorrect and needs change.
@@ -0,0 +1,176 @@ +/* SPDX-License-Identifier: MIT */ +/*
- Copyright © 2022 Intel Corporation
- */
+/* VM_BIND feature availability through drm_i915_getparam */ +#define I915_PARAM_HAS_VM_BIND 57
Needs to be kernel-docified, which means we need a prep patch that fixes up the existing mess.
Ok on kernel-doc, but as mentioned above, I am not sure we need prep patch that fixes up other existing fields at this point.
+/* VM_BIND related ioctls */ +#define DRM_I915_GEM_VM_BIND 0x3d +#define DRM_I915_GEM_VM_UNBIND 0x3e +#define DRM_I915_GEM_WAIT_USER_FENCE 0x3f
+#define DRM_IOCTL_I915_GEM_VM_BIND DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_BIND, struct drm_i915_gem_vm_bind) +#define DRM_IOCTL_I915_GEM_VM_UNBIND DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_VM_UNBIND, struct drm_i915_gem_vm_bind) +#define DRM_IOCTL_I915_GEM_WAIT_USER_FENCE DRM_IOWR(DRM_COMMAND_BASE + DRM_I915_GEM_WAIT_USER_FENCE, struct drm_i915_gem_wait_user_fence)
+/**
- struct drm_i915_gem_vm_bind - VA to object/buffer mapping to [un]bind.
Both binding and unbinding need to specify in excruciating detail what happens if there's overlaps (existing mappings, or unmapping a range which has no mapping, or only partially full of maps or different objects) and fun stuff like that.
Ok, will add those details.
- */
+struct drm_i915_gem_vm_bind {
- /** vm to [un]bind */
- __u32 vm_id;
- /**
* BO handle or file descriptor.
* 'fd' value of -1 is reserved for system pages (SVM)
*/
- union {
__u32 handle; /* For unbind, it is reserved and must be 0 */
I think it'd be a lot cleaner if we do a bind and an unbind struct for these, instead of mixing it up.
Ok
Also I thought mesa requested to be able to unmap an object from a vm without a range. Has that been dropped, and confirmed to not be needed.
Hmm...I think it was other way around. ie., to unmap with a range in vm but without an object. We already support that.
__s32 fd;
If we don't need it right away then don't add it yet. If it's planned to be used then it needs to be documented, but I kinda have no idea why you'd need an fd for svm?
It is not required for SVM, it was intended for future expanstions and '-1' was reserved for SVM. Ok, will remove it for now.
- }
- /** VA start to [un]bind */
- __u64 start;
- /** Offset in object to [un]bind */
- __u64 offset;
- /** VA length to [un]bind */
- __u64 length;
- /** Flags */
- __u64 flags;
- /** Bind the mapping immediately instead of during next submission */
This aint kerneldoc.
Also this needs to specify in much more detail what exactly this means, and also how it interacts with execbuf.
Ok
So the patch here probably needs to include the missing pieces on the execbuf side of things. Like how does execbuf work when it's used with a vm_bind managed vm? That means:
- document the pieces that are there
- then add a patch to document how that all changes with vm_bind
Hmm, I am bit confused. The current execbuff handling documentation is in i915_gem_execbuffer.c. Not sure how to update it in this design RFC patch. With VM_BIND support, we only support vm_bind vmas in the execbuff and based on comments from other patch in this series, we probably should not allow any execlist entries in vm_bind mode (no implicit syncing and use an extension for the batch address). May be I can update the rst file in this series for these information for now. Thoughts?
And do that for everything execbuf can do.
+#define I915_GEM_VM_BIND_IMMEDIATE (1 << 0)
- /** Read-only mapping */
+#define I915_GEM_VM_BIND_READONLY (1 << 1)
- /** Capture this mapping in the dump upon GPU error */
+#define I915_GEM_VM_BIND_CAPTURE (1 << 2)
- /** Zero-terminated chain of extensions */
- __u64 extensions;
+};
+/**
- struct drm_i915_vm_bind_ext_user_fence - Bind completion signaling extension.
- */
+struct drm_i915_vm_bind_ext_user_fence { +#define I915_VM_BIND_EXT_USER_FENCE 0
- /** @base: Extension link. See struct i915_user_extension. */
- struct i915_user_extension base;
- /** User/Memory fence qword alinged process virtual address */
- __u64 addr;
- /** User/Memory fence value to be written after bind completion */
- __u64 val;
- /** Reserved for future extensions */
- __u64 rsvd;
+};
+/**
- struct drm_i915_gem_execbuffer_ext_user_fence - First level batch completion
- signaling extension.
- This extension allows user to attach a user fence (<addr, value> pair) to an
- execbuf to be signaled by the command streamer after the completion of 1st
- level batch, by writing the <value> at specified <addr> and triggering an
- interrupt.
- User can either poll for this user fence to signal or can also wait on it
- with i915_gem_wait_user_fence ioctl.
- This is very much usefaul for long running contexts where waiting on dma-fence
- by user (like i915_gem_wait ioctl) is not supported.
- */
+struct drm_i915_gem_execbuffer_ext_user_fence { +#define DRM_I915_GEM_EXECBUFFER_EXT_USER_FENCE 0
- /** @base: Extension link. See struct i915_user_extension. */
- struct i915_user_extension base;
- /**
* User/Memory fence qword aligned GPU virtual address.
* Address has to be a valid GPU virtual address at the time of
* 1st level batch completion.
*/
- __u64 addr;
- /**
* User/Memory fence Value to be written to above address
* after 1st level batch completes.
*/
- __u64 value;
- /** Reserved for future extensions */
- __u64 rsvd;
+};
+struct drm_i915_gem_vm_control { +/** Flag to opt-in for VM_BIND mode of binding during VM creation */
This is very confusingly docunmented and I have no idea how you're going to use an empty extension. Also it's not kerneldoc.
Yah, I was also wondering how to define new flags bits for the flags in structures already defined in i915_drm.h. Ok, will just define the flag bit definition here and mention the sturcture field in the documentation part.
Please check that the stuff you're creating renders properly in the html output.
+#define I915_VM_CREATE_FLAGS_USE_VM_BIND (1 << 0) +};
+struct drm_i915_gem_create_ext { +/** Extension to make the object private to a specified VM */ +#define I915_GEM_CREATE_EXT_VM_PRIVATE 2
Why 2?
Also this all needs to be documented what it precisely means.
Because 0 and 1 are already taken (I915_GEM_CREATE_EXT_* in i915_drm.h). Ok, will add required documentation.
+};
+struct prelim_drm_i915_gem_context_create_ext { +/** Flag to declare context as long running */ +#define I915_CONTEXT_CREATE_FLAGS_LONG_RUNNING (1u << 2)
The compute mode context, again including full impact on execbuf, is not documented here. This also means any gaps in the context uapi documentation need to be filled first in prep patches.
Ok, will add documentation here. As mentioned above, I guess the prep patch will later once this RFC patch gets accepted?
Also memory fences are extremely tricky, we need to specify in detail when they're allowed to be used and when not. This needs to reference the relevant sections from the dma-fence docs.
Ok
+};
+/**
- struct drm_i915_gem_wait_user_fence
- Wait on user/memory fence. User/Memory fence can be woken up either by,
- GPU context indicated by 'ctx_id', or,
- Kerrnel driver async worker upon I915_UFENCE_WAIT_SOFT.
'ctx_id' is ignored when this flag is set.
- Wakeup when below condition is true.
- (*addr & MASK) OP (VALUE & MASK)
- */
+~struct drm_i915_gem_wait_user_fence {
- /** @base: Extension link. See struct i915_user_extension. */
- __u64 extensions;
- /** User/Memory fence address */
- __u64 addr;
- /** Id of the Context which will signal the fence. */
- __u32 ctx_id;
- /** Wakeup condition operator */
- __u16 op;
+#define I915_UFENCE_WAIT_EQ 0 +#define I915_UFENCE_WAIT_NEQ 1 +#define I915_UFENCE_WAIT_GT 2 +#define I915_UFENCE_WAIT_GTE 3 +#define I915_UFENCE_WAIT_LT 4 +#define I915_UFENCE_WAIT_LTE 5 +#define I915_UFENCE_WAIT_BEFORE 6 +#define I915_UFENCE_WAIT_AFTER 7
- /** Flags */
- __u16 flags;
+#define I915_UFENCE_WAIT_SOFT 0x1 +#define I915_UFENCE_WAIT_ABSTIME 0x2
- /** Wakeup value */
- __u64 value;
- /** Wakeup mask */
- __u64 mask;
+#define I915_UFENCE_WAIT_U8 0xffu +#define I915_UFENCE_WAIT_U16 0xffffu +#define I915_UFENCE_WAIT_U32 0xfffffffful +#define I915_UFENCE_WAIT_U64 0xffffffffffffffffull
Do we really need all these flags, and does the hw really support all the combinations? Anything the hw doesn't support in MI_SEMAPHORE is pretty much useless as a umf (userspace memory fence) mode.
Hmm...The PIPE_CONTROL/MI_FLUSH instructions (used for wakup) support 64-bit writes. The gem_wait_user_fence ioctl wakup condition is, (*addr & MASK) OP (VALUE & MASK) So, these values provide user options to configure wakeup.
The MI_SEMAPHORE seems to only support 32-bit value check for wakeup. But that is different from the above gem_wait_user_fence ioctl wakeup.
- /** Timeout */
Needs to specificy the clock source.
Ok,
Niranjana
-Daniel
- __s64 timeout;
+};
2.21.0.rc0.32.g243a4c7e27
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
dri-devel@lists.freedesktop.org