# Host1x/TegraDRM UAPI proposal
This is a proposal for a stable UAPI for Host1x and TegraDRM, to replace the current TegraDRM UAPI that is behind `STAGING` and quite obsolete in many ways.
I haven't written any implementation yet -- I'll do that once there is some agreement on the high-level design.
Current open items:
* The syncpoint UAPI allows userspace to create sync_file FDs with arbitrary syncpoint fences. dma_fence code currently seems to assume all fences will be signaled, which would not necessarily be the case with this interface. * Previously present GEM IOCTLs (GEM_CREATE, GEM_MMAP) are not present. Not sure if they are still needed.
## Introduction to the hardware
Tegra Host1x is a hardware block providing the following capabilities:
* Syncpoints, a unified whole-system synchronization primitive, allowing synchronization of work between graphics, compute and multimedia engines, CPUs including cross-VM synchronization, and devices on the PCIe bus, without incurring CPU overhead. * Channels, a command DMA mechanism allowing asynchronous programming of various engines, integrating with syncpoints. * Hardware virtualization support for syncpoints and channels. (On Tegra186 and newer)
This proposal defines APIs for userspace access to syncpoints and channels. Kernel drivers can additionally use syncpoints and channels internally, providing other userspace interfaces (e.g. V4L2).
Syncpoint and channel interfaces are split into separate parts, as syncpoints are useful as a system synchronization primitive even without using the engine drivers provided through TegraDRM. For example, a computer vision pipeline consisting of video capture, CPU processing and GPU processing would not necessarily use the engines provided by TegraDRM. See the Example workflows section for more details.
## Syncpoint interface
Syncpoints are a set of 32-bit values providing the following operations:
* Atomically increment value by one * Read current value * Wait until value reaches specified threshold. For waiting, the 32-bit value space is treated modulo 2^32; e.g. if the current value is 0xffffffff, then value 0x0 is considered to be one increment in the future.
Each syncpoint is identified by a system-global ID ranging between [0, number of syncpoints supported by hardware). The entire system has read-only access to all syncpoints based on their ID.
Syncpoints are managed through the device node /dev/host1x provided by the gpu/host1x driver.
### IOCTL HOST1X_ALLOCATE_SYNCPOINT (on /dev/host1x)
Allocates a free syncpoint, returning a file descriptor representing it. Only the owner of the file descriptor is allowed to mutate the value of the syncpoint.
``` struct host1x_ctrl_allocate_syncpoint { /** * @fd: * * [out] New file descriptor representing the allocated syncpoint. */ __s32 fd;
__u32 reserved[3]; }; ```
### IOCTL HOST1X_SYNCPOINT_INFO (on syncpoint file descriptor)
Allows retrieval of system-global syncpoint ID corresponding to syncpoint.
Use cases:
* Passing ID to other system components that identify syncpoints by ID * Debugging and testing
``` struct host1x_syncpoint_info { /** * @id: * * [out] System-global ID of the syncpoint. */ __u32 id;
__u32 reserved[3]; }; ```
### IOCTL HOST1X_SYNCPOINT_INCREMENT (on syncpoint file descriptor)
Allows incrementing of the syncpoint value.
Use cases:
* Signalling work completion when executing a pipeline step on the CPU * Debugging and testing
``` struct host1x_syncpoint_increment { /** * @count: * * [in] Number of times to increment syncpoint. Value can be * observed in in-between values, but increments are atomic. */ __u32 count; }; ```
### IOCTL HOST1X_READ_SYNCPOINT (on /dev/host1x)
Read the value of a syncpoint based on its ID.
Use cases:
* Allows more fine-grained tracking of task progression for debugging purposes
``` struct host1x_ctrl_read_syncpoint { /** * @id: * * [in] ID of syncpoint to read. */ __u32 id;
/** * @value: * * [out] Value of the syncpoint. */ __u32 value; }; ```
### IOCTL HOST1X_CREATE_FENCE (on /dev/host1x)
Creates a new SYNC_FILE fence file descriptor for the specified syncpoint ID and threshold.
Use cases:
* Creating a fence when receiving ID/threshold pair from another system component * Creating a postfence when executing a pipeline step on the CPU * Creating a postfence when executing a pipeline step controlled by userspace (e.g. GPU userspace submission)
``` struct host1x_ctrl_create_fence { /** * @id: * * [in] ID of the syncpoint for which to create a fence. */ __u32 id;
/** * @threshold: * * [in] Threshold value for fence. */ __u32 threshold;
/** * @fence_fd: * * [out] New sync_file FD corresponding to the ID and threshold. */ __s32 fence_fd;
__u32 reserved[1]; }; ```
### IOCTL HOST1X_GET_FENCE_INFO (on /dev/host1x)
Allows retrieval of the ID/threshold pairs corresponding to a SYNC_FILE fence or fence array.
Use cases:
* Debugging and testing * Transmitting a fence to another system component requiring ID/threshold * Getting ID/threshold for prefence when programming a pipeline step controlled by userspace (e.g. GPU userspace submission)
``` /* If set, corresponding fence is backed by Host1x syncpoints. */ #define HOST1X_CTRL_FENCE_INFO_SYNCPOINT_FENCE (1 << 0)
struct host1x_ctrl_fence_info { /** * @flags: * * [out] HOST1X_CTRL_FENCE_INFO flags. */ __u32 flags;
/** * @id: * * [out] ID of the syncpoint corresponding to this fence. * Only set if HOST1X_CTRL_FENCE_INFO_SYNCPOINT_FENCE is set. */ __u32 id;
/** * @threshold: * * [out] Signalling threshold of the fence. * Only set if HOST1X_CTRL_FENCE_INFO_SYNCPOINT_FENCE is set. */ __u32 threshold;
__u32 reserved[1]; };
struct host1x_ctrl_get_fence_info { /** * @fence_fd: * * [in] Syncpoint-backed sync_file FD for which to retrieve info. */ __s32 fence_fd;
/** * @num_fences: * * [in] Size of `fence_info` array in elements. * [out] Number of fences held by the FD. */ __u32 num_fences;
/** * @fence_info: * * [in] Pointer to array of 'struct host1x_ctrl_fence_info' where info will be stored. */ __u64 fence_info;
__u32 reserved[1]; }; ```
## Channel interface
### DRM_TEGRA_OPEN_CHANNEL
``` struct drm_tegra_open_channel { /** * @class: * * [in] Host1x class (engine) the channel will target. */ __u32 host1x_class;
/** * @flags: * * [in] Flags. Currently none are specified. */ __u32 flags;
/** * @channel_id: * * [out] Process-specific identifier corresponding to opened * channel. Not the hardware channel ID. */ __u32 channel_id;
/** * @hardware_version: * * [out] Hardware version of the engine targeted by the channel. * Userspace can use this to select appropriate programming * sequences. */ __u32 hardware_version;
/** * @mode: * * [out] Mode the hardware is executing in. Some engines can be * configured with different firmware supporting different * functionality depending on the system configuration. This * value allows userspace to detect if the engine is configured * for the intended use case. */ __u32 mode;
__u32 reserved[3]; }; ```
### DRM_TEGRA_CLOSE_CHANNEL
``` struct drm_tegra_close_channel { /** * @channel_id: * * [in] ID of channel to close. */ __u32 channel_id;
__u32 reserved[3]; }; ```
### DRM_TEGRA_CHANNEL_MAP
Make memory accessible by the engine while executing work on the channel.
``` #define DRM_TEGRA_CHANNEL_MAP_READWRITE (1<<0)
struct drm_tegra_channel_map { /* * [in] ID of the channel for which to map memory to. */ __u32 channel_id;
/* * [in] GEM handle of the memory to map. */ __u32 handle;
/* * [in] Offset in GEM handle of the memory area to map. * * Must be aligned by 4K. */ __u64 offset;
/* * [in] Length of memory area to map in bytes. * * Must be aligned by 4K. */ __u64 length;
/* * [out] IOVA of mapped memory. Userspace can use this IOVA * directly to refer to the memory to skip using relocations. * Only available if hardware memory isolation is enabled. * * Will be set to 0xffff_ffff_ffff_ffff if unavailable. */ __u64 iova;
/* * [out] ID corresponding to the mapped memory to be used for * relocations or unmapping. */ __u32 mapping_id;
/* * [in] Flags. */ __u32 flags;
__u32 reserved[6]; }; ```
### DRM_TEGRA_CHANNEL_UNMAP
Unmap previously mapped memory. Userspace shall do this only after it has determined the channel will no longer access the memory.
``` struct drm_tegra_channel_unmap { /* * [in] ID of the mapping to remove. */ __u32 mapping_id;
__u32 reserved[3]; }; ```
### DRM_TEGRA_CHANNEL_SUBMIT
Submit a job to the engine/class targeted by the channel.
``` struct drm_tegra_submit_syncpt_incr { /* * [in] Syncpoint FD of the syncpoint that the job will * increment. */ __s32 syncpt_fd;
/* * [in] Number of increments that the job will do. */ __u32 num_incrs;
/* * [out] Value the syncpoint will have once all increments have * executed. */ __u32 fence_value;
__u32 reserved[1]; };
/* Sets paddr/IOVA bit 39 on T194 to enable MC swizzling */ #define DRM_TEGRA_SUBMIT_RELOCATION_BLOCKLINEAR (1<<0)
struct drm_tegra_submit_relocation { /* [in] Index of GATHER or GATHER_UPTR command in commands. */ __u32 gather_command_index;
/* * [in] Mapping handle (obtained through CHANNEL_MAP) of the memory * the address of which will be patched in. */ __u32 mapping_id;
/* * [in] Offset in the gather that will be patched. */ __u64 gather_offset;
/* * [in] Offset in target buffer whose paddr/IOVA will be written * to the gather. */ __u64 target_offset;
/* * [in] Number of bits the resulting address will be logically * shifted right before writing to gather. */ __u32 shift;
__u32 reserved[1]; };
/* Command is an opcode gather from a GEM handle */ #define DRM_TEGRA_SUBMIT_COMMAND_GATHER 0 /* Command is an opcode gather from a user pointer */ #define DRM_TEGRA_SUBMIT_COMMAND_GATHER_UPTR 1 /* Command is a wait for syncpt fence completion */ #define DRM_TEGRA_SUBMIT_COMMAND_WAIT_SYNCPT 2 /* Command is a wait for SYNC_FILE FD completion */ #define DRM_TEGRA_SUBMIT_COMMAND_WAIT_SYNC_FILE 3 /* Command is a wait for DRM syncobj completion */ #define DRM_TEGRA_SUBMIT_COMMAND_WAIT_SYNCOBJ 4
/* * Allow driver to skip command execution if engine * was not accessed by another channel between * submissions. */ #define DRM_TEGRA_SUBMIT_CONTEXT_SETUP (1<<0)
struct drm_tegra_submit_command { __u16 type; __u16 flags;
union { struct { /* GEM handle */ __u32 handle;
/* * Offset into GEM object in bytes. * Must be aligned by 4. */ __u64 offset;
/* * Length of gather in bytes. * Must be aligned by 4. */ __u64 length; } gather;
struct { __u32 reserved[1];
/* * Pointer to gather data. * Must be aligned by 4 bytes. */ __u64 base;
/* * Length of gather in bytes. * Must be aligned by 4. */ __u64 length; } gather_uptr;
struct { __u32 syncpt_id; __u32 threshold;
__u32 reserved[1]; } wait_syncpt;
struct { __s32 fd; } wait_sync_file;
struct { __u32 handle; } wait_syncobj; }; };
#define DRM_TEGRA_SUBMIT_CREATE_POST_SYNC_FILE (1<<0) #define DRM_TEGRA_SUBMIT_CREATE_POST_SYNCOBJ (1<<1)
struct drm_tegra_channel_submit { __u32 channel_id; __u32 flags;
/** * [in] Timeout in microseconds after which the kernel may * consider the job to have hung and may reap it and * fast-forward its syncpoint increments. * * The value may be capped by the kernel. */ __u32 timeout;
__u32 num_syncpt_incrs; __u32 num_relocations; __u32 num_commands;
__u64 syncpt_incrs; __u64 relocations; __u64 commands;
/** * [out] Invalid, SYNC_FILE FD or syncobj handle, depending on * if DRM_TEGRA_SUBMIT_CREATE_POST_SYNC_FILE, * DRM_TEGRA_SUBMIT_CREATE_POST_SYNCOBJ, or neither are passed. * Passing both is an error. * * The created fence object is signaled when all syncpoint * increments specified in `syncpt_incrs' have executed. */ __u32 post_fence;
__u32 reserved[3]; }; ```
## Example workflows
### Image processing with TegraDRM/VIC
This example is a simple single-step operation using VIC through TegraDRM. For example, assume we have a dma-buf fd with an image we want to convert from YUV to RGB. This can occur for example as part of video decoding.
Syncpoint initialization
1. Allocate syncpoint (HOST1X_ALLOCATE_SYNCPOINT) 1. This is used to track VIC submission completion. 2. Retrieve syncpoint ID (HOST1X_SYNCPOINT_INFO) 1. The ID is required to program the increment as part of the submission.
Buffer allocation
3. Allocate memory for configuration buffers (DMA Heaps) 4. Import configuration buffer dma-buf as GEM object 5. Import input image dma-buf as GEM object
Channel initialization
6. Open VIC channel (DRM_TEGRA_OPEN_CHANNEL) 7. Map buffers for access by VIC (DRM_TEGRA_CHANNEL_MAP) 8. Create Host1x opcode buffer as userspace memory 1. If buffer mapping returned an IOVA, that IOVA can be placed directly into the buffer. Otherwise, a relocation has to be passed as part of the submission 2. The buffer should contain a syncpoint increment for the syncpoint allocated earlier. 9. Submit work, passing in the syncpoint file descriptor allocated in the beginning. The submit optionally returns a syncfd/syncobj that can be used to wait for submission completion. 1. If more fine-grained syncpoint waiting is required, the 'fence' out-parameter of 'drm_tegra_submit_syncpt_incr' can be used in conjunction with HOST1X_CREATE_FENCE to create specific fences.
### Camera-GPU-CPU pipeline without TegraDRM
This example shows a pipeline with image input from a camera being processed using the GPU programmed from userspace, and then finally analyzed by CPU. This kind of usecase can occur e.g. as part of a computer vision usecase.
Syncpoint initialization
1. Camera V4L2 driver allocates a syncpoint internally within the kernel. 2. For CPU job tracking, allocate a syncpoint as in "Image processing with TegraDRM/VIC". 3. For GPU job tracking, the GPU kernel driver would allocate a syncpoint and assign it such that the GPU channel can access it.
Camera pipeline step
4. Allocate a dma-buf to store the captured image. 5. Trigger camera capture and store the resulting sync_file fd.
GPU pipeline step
6. Use HOST1X_GET_FENCE_INFO to extract syncpoint ID/threshold pair(s) from camera step's post-fence sync_file FD. If the sync_file FD is not backed by syncpoints, wait for the sync_file FD to signal otherwise (e.g. through polling it). 7. Use HOST1X_CREATE_FENCE to create a postfence that is signaled when the GPU step is complete. 8. Program the GPU to 1. Wait for the syncpoint thresholds extracted from the camera postfence, if we were able to do so. 2. Execute image processing on GPU. 3. Increment GPU's job tracking syncpoint, causing the GPU post-fence FD to get signaled.
CPU pipeline step
9. Wait for GPU's post-fence sync_file FD 10. Map the dma-buf containing the image and retrieve results.
In place of the GPU pipeline step, a similar workflow would apply for a step executed on the CPU.
--
thanks, Mikko
23.06.2020 15:09, Mikko Perttunen пишет: ...
- Previously present GEM IOCTLs (GEM_CREATE, GEM_MMAP) are not present.
Not sure if they are still needed.
Hello Mikko!
A quick comment for the starter. Switching away from the Tegra-specific GEM IOCTLs to the dma-buf heaps is a such radical change! But I like it!
Previously I was curious about whether we could have multiple CMA regions (one shared/reusable and other private, for example) for the Tegra DRM driver and at a quick glance the dma-buf heaps could be a nice solution that avoids re-inventing a driver-specific things for that.
I'm instantly foreseeing these types of heaps:
1. Large reusable CMA heap. 2. Smaller non-reusable common CMA that could be used when allocation from a reusable CMA fails. Or vice versa, when we want to reduce the chance to block for a long time on allocation, for example. 3. Sparse (system) memory heap.
It's the first time I'm looking at the dma-buf heaps and it sounds like a very nice idea to utilize them!
The https://lkml.org/lkml/2019/11/18/787 says that the system-contiguous and carveout heaps we not added because they were not needed, but they will be needed for the Tegra20 drivers and for the case where IOMMU is disabled. It shouldn't be difficult to add these types of heaps.
I'll continue to examine the dma-buf heaps in a more details.
On 6/24/20 11:55 PM, Dmitry Osipenko wrote:
23.06.2020 15:09, Mikko Perttunen пишет: ...
- Previously present GEM IOCTLs (GEM_CREATE, GEM_MMAP) are not present.
Not sure if they are still needed.
Hello Mikko!
A quick comment for the starter. Switching away from the Tegra-specific GEM IOCTLs to the dma-buf heaps is a such radical change! But I like it!
Previously I was curious about whether we could have multiple CMA regions (one shared/reusable and other private, for example) for the Tegra DRM driver and at a quick glance the dma-buf heaps could be a nice solution that avoids re-inventing a driver-specific things for that.
I'm instantly foreseeing these types of heaps:
- Large reusable CMA heap.
- Smaller non-reusable common CMA that could be used when allocation
from a reusable CMA fails. Or vice versa, when we want to reduce the chance to block for a long time on allocation, for example. 3. Sparse (system) memory heap.
It's the first time I'm looking at the dma-buf heaps and it sounds like a very nice idea to utilize them!
The https://lkml.org/lkml/2019/11/18/787 says that the system-contiguous and carveout heaps we not added because they were not needed, but they will be needed for the Tegra20 drivers and for the case where IOMMU is disabled. It shouldn't be difficult to add these types of heaps.
I'll continue to examine the dma-buf heaps in a more details.
Sure, let's go with this for now. We can anyway add GEM IOCTLs later if they are needed, without any compatibility issues.
Mikko
23.06.2020 15:09, Mikko Perttunen пишет:
struct drm_tegra_submit_relocation { /* [in] Index of GATHER or GATHER_UPTR command in commands. */ __u32 gather_command_index;
/* * [in] Mapping handle (obtained through CHANNEL_MAP) of the memory * the address of which will be patched in. */ __u32 mapping_id;
/* * [in] Offset in the gather that will be patched. */ __u64 gather_offset;
/* * [in] Offset in target buffer whose paddr/IOVA will be written * to the gather. */ __u64 target_offset;
/* * [in] Number of bits the resulting address will be logically * shifted right before writing to gather. */ __u32 shift;
__u32 reserved[1]; };
We will also need read/write direction flag here for the DMA-reservations set up, it will be used for the implicit BO fencing by the job's scheduler.
On 6/25/20 1:33 AM, Dmitry Osipenko wrote:
23.06.2020 15:09, Mikko Perttunen пишет:
struct drm_tegra_submit_relocation { /* [in] Index of GATHER or GATHER_UPTR command in commands. */ __u32 gather_command_index;
/* * [in] Mapping handle (obtained through CHANNEL_MAP) of the memory * the address of which will be patched in. */ __u32 mapping_id;
/* * [in] Offset in the gather that will be patched. */ __u64 gather_offset;
/* * [in] Offset in target buffer whose paddr/IOVA will be written * to the gather. */ __u64 target_offset;
/* * [in] Number of bits the resulting address will be logically * shifted right before writing to gather. */ __u32 shift;
__u32 reserved[1]; };
We will also need read/write direction flag here for the DMA-reservations set up, it will be used for the implicit BO fencing by the job's scheduler.
Ideally on newer chips with context isolation, we no longer know what DMA-BUFs are being used by the job at submit time - they would just be pointers after being MAPped.
Do you know how other GPUs deal with DMA reservations - I expect separate MAP and SUBMIT steps would be pretty common? Do you have to pass the DMA-BUF to both steps (i.e. do IOMMU mapping during MAP, and manage reservations at SUBMIT)?
Mikko
25.06.2020 12:27, Mikko Perttunen пишет:
On 6/25/20 1:33 AM, Dmitry Osipenko wrote:
23.06.2020 15:09, Mikko Perttunen пишет:
struct drm_tegra_submit_relocation { /* [in] Index of GATHER or GATHER_UPTR command in commands. */ __u32 gather_command_index;
/* * [in] Mapping handle (obtained through CHANNEL_MAP) of the memory * the address of which will be patched in. */ __u32 mapping_id;
/* * [in] Offset in the gather that will be patched. */ __u64 gather_offset;
/* * [in] Offset in target buffer whose paddr/IOVA will be written * to the gather. */ __u64 target_offset;
/* * [in] Number of bits the resulting address will be logically * shifted right before writing to gather. */ __u32 shift;
__u32 reserved[1]; };
We will also need read/write direction flag here for the DMA-reservations set up, it will be used for the implicit BO fencing by the job's scheduler.
Ideally on newer chips with context isolation, we no longer know what DMA-BUFs are being used by the job at submit time - they would just be pointers after being MAPped.
The DMA-BUFs themselves shouldn't be needed, but GEMs should.
Do you know how other GPUs deal with DMA reservations - I expect separate MAP and SUBMIT steps would be pretty common?
I can't instantly recall what other drivers do, could be worthwhile to take a closer look.
Do you have to pass the DMA-BUF to both steps (i.e. do IOMMU mapping during MAP, and manage reservations at SUBMIT)?
Yes, this sounds good to me if DMA-BUF part is replaced with a GEM.
Please see my other reply regarding the MAP IOCTL where I'm suggesting to back mapping IDs with a GEM.
On 6/26/20 1:50 AM, Dmitry Osipenko wrote:
25.06.2020 12:27, Mikko Perttunen пишет:
On 6/25/20 1:33 AM, Dmitry Osipenko wrote:
23.06.2020 15:09, Mikko Perttunen пишет:
struct drm_tegra_submit_relocation { /* [in] Index of GATHER or GATHER_UPTR command in commands. */ __u32 gather_command_index;
/* * [in] Mapping handle (obtained through CHANNEL_MAP) of the memory * the address of which will be patched in. */ __u32 mapping_id;
/* * [in] Offset in the gather that will be patched. */ __u64 gather_offset;
/* * [in] Offset in target buffer whose paddr/IOVA will be written * to the gather. */ __u64 target_offset;
/* * [in] Number of bits the resulting address will be logically * shifted right before writing to gather. */ __u32 shift;
__u32 reserved[1]; };
We will also need read/write direction flag here for the DMA-reservations set up, it will be used for the implicit BO fencing by the job's scheduler.
Ideally on newer chips with context isolation, we no longer know what DMA-BUFs are being used by the job at submit time - they would just be pointers after being MAPped.
The DMA-BUFs themselves shouldn't be needed, but GEMs should.
Yes, I meant to say GEM instead of DMA-BUF.
Do you know how other GPUs deal with DMA reservations - I expect separate MAP and SUBMIT steps would be pretty common?
I can't instantly recall what other drivers do, could be worthwhile to take a closer look.
I'll see if I can find some similar situations in other drivers.
Mikko
Do you have to pass the DMA-BUF to both steps (i.e. do IOMMU mapping during MAP, and manage reservations at SUBMIT)?
Yes, this sounds good to me if DMA-BUF part is replaced with a GEM.
Please see my other reply regarding the MAP IOCTL where I'm suggesting to back mapping IDs with a GEM.
23.06.2020 15:09, Mikko Perttunen пишет:
/* Command is an opcode gather from a GEM handle */ #define DRM_TEGRA_SUBMIT_COMMAND_GATHER 0 /* Command is an opcode gather from a user pointer */ #define DRM_TEGRA_SUBMIT_COMMAND_GATHER_UPTR 1
I'm a bit dubious about whether we really need to retain the non-UPTR variant. The memory-copying overhead is negligible because cmdstream's data usually is hot in CPU's cache
IIRC, the most (if not all) of the modern DRM drivers drivers use the usrptr-only for the cmdstream.
At least there is no any real-world userspace example today that could benefit from a non-UPTR variant.
I'm suggesting to leave out the non-UPTR gather variant for now, keeping it in mind as a potential future extension of the submission UAPI. Any objections?
On 6/25/20 2:11 AM, Dmitry Osipenko wrote:
23.06.2020 15:09, Mikko Perttunen пишет:
/* Command is an opcode gather from a GEM handle */ #define DRM_TEGRA_SUBMIT_COMMAND_GATHER 0 /* Command is an opcode gather from a user pointer */ #define DRM_TEGRA_SUBMIT_COMMAND_GATHER_UPTR 1
I'm a bit dubious about whether we really need to retain the non-UPTR variant. The memory-copying overhead is negligible because cmdstream's data usually is hot in CPU's cache
IIRC, the most (if not all) of the modern DRM drivers drivers use the usrptr-only for the cmdstream.
At least there is no any real-world userspace example today that could benefit from a non-UPTR variant.
I'm suggesting to leave out the non-UPTR gather variant for now, keeping it in mind as a potential future extension of the submission UAPI. Any objections?
Sure, we should be able to drop it. Downstream userspace is using it, but we should be able to fix that. I was thinking that we can directly map the user pages and point the gather to them without copying - that way we wouldn't need to make DMA allocations inside the driver for every submit. (On early Tegras we could just copy into the pushbuffer but that won't work for newer ones).
Mikko
25.06.2020 12:16, Mikko Perttunen пишет:
On 6/25/20 2:11 AM, Dmitry Osipenko wrote:
23.06.2020 15:09, Mikko Perttunen пишет:
/* Command is an opcode gather from a GEM handle */ #define DRM_TEGRA_SUBMIT_COMMAND_GATHER 0 /* Command is an opcode gather from a user pointer */ #define DRM_TEGRA_SUBMIT_COMMAND_GATHER_UPTR 1
I'm a bit dubious about whether we really need to retain the non-UPTR variant. The memory-copying overhead is negligible because cmdstream's data usually is hot in CPU's cache
IIRC, the most (if not all) of the modern DRM drivers drivers use the usrptr-only for the cmdstream.
At least there is no any real-world userspace example today that could benefit from a non-UPTR variant.
I'm suggesting to leave out the non-UPTR gather variant for now, keeping it in mind as a potential future extension of the submission UAPI. Any objections?
Sure, we should be able to drop it. Downstream userspace is using it, but we should be able to fix that. I was thinking that we can directly map the user pages and point the gather to them without copying - that way we wouldn't need to make DMA allocations inside the driver for every submit.
We will need to create a Host1x DMA pool and then the dynamic allocations will be cheap. This is an implementation detail that we can discuss separately.
We will need the UPTR anyways for the older Tergas because we need to validate the cmdstreams and it's much more efficient to copy from UPTR than from the uncacheable memory.
The non-UPTR variant will be fine to add if you'll have a realworld example that demonstrates a noticeable performance difference.
Previously, I thought that there will be some perf difference if GR3D shaders are moved into the "insert-opcode" gather, but it was negligible once I implemented it and it should be even more negligible on a modern hardware.
(On early Tegras we could just copy into the pushbuffer but that won't work for newer ones).
Yes, we should copy data into a gather and then push it into channel's pushbuffer. Just like it works with the current upstream driver.
Once all the UAPI will be settled, we'll also need to discuss the pushbuffer's implementation because the current driver has some problems with it.
On 6/26/20 2:24 AM, Dmitry Osipenko wrote:
25.06.2020 12:16, Mikko Perttunen пишет:
On 6/25/20 2:11 AM, Dmitry Osipenko wrote:
23.06.2020 15:09, Mikko Perttunen пишет:
/* Command is an opcode gather from a GEM handle */ #define DRM_TEGRA_SUBMIT_COMMAND_GATHER 0 /* Command is an opcode gather from a user pointer */ #define DRM_TEGRA_SUBMIT_COMMAND_GATHER_UPTR 1
I'm a bit dubious about whether we really need to retain the non-UPTR variant. The memory-copying overhead is negligible because cmdstream's data usually is hot in CPU's cache
IIRC, the most (if not all) of the modern DRM drivers drivers use the usrptr-only for the cmdstream.
At least there is no any real-world userspace example today that could benefit from a non-UPTR variant.
I'm suggesting to leave out the non-UPTR gather variant for now, keeping it in mind as a potential future extension of the submission UAPI. Any objections?
Sure, we should be able to drop it. Downstream userspace is using it, but we should be able to fix that. I was thinking that we can directly map the user pages and point the gather to them without copying - that way we wouldn't need to make DMA allocations inside the driver for every submit.
We will need to create a Host1x DMA pool and then the dynamic allocations will be cheap. This is an implementation detail that we can discuss separately.
We will need the UPTR anyways for the older Tergas because we need to validate the cmdstreams and it's much more efficient to copy from UPTR than from the uncacheable memory.
The non-UPTR variant will be fine to add if you'll have a realworld example that demonstrates a noticeable performance difference.
Previously, I thought that there will be some perf difference if GR3D shaders are moved into the "insert-opcode" gather, but it was negligible once I implemented it and it should be even more negligible on a modern hardware.
(On early Tegras we could just copy into the pushbuffer but that won't work for newer ones).
Yes, we should copy data into a gather and then push it into channel's pushbuffer. Just like it works with the current upstream driver.
Once all the UAPI will be settled, we'll also need to discuss the pushbuffer's implementation because the current driver has some problems with it.
True, for earlier Tegras we'll need to copy anyway. So let's just implement copying for now, while making sure that extending to directly mapping pages will be possible later (don't know why it wouldn't be), and implement direct mapping or GEM gathers later if needed.
Mikko
23.06.2020 15:09, Mikko Perttunen пишет:
struct drm_tegra_channel_submit { __u32 channel_id; __u32 flags;
/** * [in] Timeout in microseconds after which the kernel may * consider the job to have hung and may reap it and * fast-forward its syncpoint increments. * * The value may be capped by the kernel. */ __u32 timeout;
__u32 num_syncpt_incrs; __u32 num_relocations; __u32 num_commands;
__u64 syncpt_incrs; __u64 relocations; __u64 commands;
Do we really need to retain the multi-gather support? The code-bloat (both userspace and kernel driver) is very significant just for preparing and patching of the multi-buffer cmdstreams.
If userspace runs out of a free space within the pushbuffer, then it should simply reallocate a larger pushbuffer.
I'm suggesting that we should have a single gather per-job, any objections?
25.06.2020 02:18, Dmitry Osipenko пишет:
23.06.2020 15:09, Mikko Perttunen пишет:
struct drm_tegra_channel_submit { __u32 channel_id; __u32 flags;
/** * [in] Timeout in microseconds after which the kernel may * consider the job to have hung and may reap it and * fast-forward its syncpoint increments. * * The value may be capped by the kernel. */ __u32 timeout;
__u32 num_syncpt_incrs; __u32 num_relocations; __u32 num_commands;
__u64 syncpt_incrs; __u64 relocations; __u64 commands;
Do we really need to retain the multi-gather support? The code-bloat (both userspace and kernel driver) is very significant just for preparing and patching of the multi-buffer cmdstreams.
If userspace runs out of a free space within the pushbuffer, then it should simply reallocate a larger pushbuffer.
I'm suggesting that we should have a single gather per-job, any objections?
Oh, I just recalled that the later Host1x versions do not allow to switch class from gather! Let me think a bit more about it..
23.06.2020 15:09, Mikko Perttunen пишет:
struct drm_tegra_channel_submit { __u32 channel_id; __u32 flags;
/** * [in] Timeout in microseconds after which the kernel may * consider the job to have hung and may reap it and * fast-forward its syncpoint increments. * * The value may be capped by the kernel. */ __u32 timeout;
What about to rename this to timeout_us? For clarity.
__u32 num_syncpt_incrs; __u32 num_relocations; __u32 num_commands;
__u64 syncpt_incrs; __u64 relocations; __u64 commands;
Let's also add "_ptr" postfix to all usrptr names, again for clarity.
It's always nice to have meaningful names :)
On 6/25/20 2:23 AM, Dmitry Osipenko wrote:
23.06.2020 15:09, Mikko Perttunen пишет:
struct drm_tegra_channel_submit { __u32 channel_id; __u32 flags;
/** * [in] Timeout in microseconds after which the kernel may * consider the job to have hung and may reap it and * fast-forward its syncpoint increments. * * The value may be capped by the kernel. */ __u32 timeout;
What about to rename this to timeout_us? For clarity.
__u32 num_syncpt_incrs; __u32 num_relocations; __u32 num_commands;
__u64 syncpt_incrs; __u64 relocations; __u64 commands;
Let's also add "_ptr" postfix to all usrptr names, again for clarity.
It's always nice to have meaningful names :)
Yep, good point. I'll fix this for next revision :)
Mikko
23.06.2020 15:09, Mikko Perttunen пишет:
#define DRM_TEGRA_SUBMIT_CREATE_POST_SYNC_FILE (1<<0) #define DRM_TEGRA_SUBMIT_CREATE_POST_SYNCOBJ (1<<1)
The sync object shouldn't be created by the kernel driver and we shouldn't need the sync-file FD.
Please consider this example of how syncobj may be used:
1. Syncobj is created by userspace.
2. Syncobj's handle (post_fence) is passed with the job to the kernel driver.
3. Userspace waits on syncobj for the job's submission.
4. Kernel driver attaches job-completion dma-fence(s) to the syncobj after starting to execute the job.
5. Userspace waits on syncobj for the job's completion.
Syncobj is a superset of the sync-file fence:
a) Syncobj doesn't have a backing file descriptor when it's created. For example, why would you need an FD if you're not going to share fence with other processes. It's possible to get FD out of syncobj later on, please see [1][2].
b) Syncobj is designed to be used with a GPU jobs. For example, userspace passes job to the GPU driver's scheduler and then waits until job has been started to execute on hardware, this is already supported by syncobj.
c) It is possible to attach sync-file's fence to a syncobj [1][2][3] and now:
- Userspace may attach sync-file's fence to a syncobj.
- Userspace may use this syncobj for the job's pre-fence.
- Kernel driver will take out the pre-fence from the syncobj during of the job's submission and reset the syncobj's fence to NULL.
- Job's scheduler will wait on this pre-fence before starting to execute job.
- Once the job is started to execute, the job's scheduler will attach the job's completion-fence to the syncobj. Now syncobj has a post-fence!
d) It is possible to get sync-file FD out of syncobj [1][2][4].
[1] https://elixir.bootlin.com/linux/v5.7.2/source/include/uapi/drm/drm.h#L730 [2] https://elixir.bootlin.com/linux/v5.7.2/source/include/uapi/drm/drm.h#L933 [3] https://elixir.bootlin.com/linux/v5.7.2/source/drivers/gpu/drm/drm_syncobj.c... [4] https://elixir.bootlin.com/linux/v5.7.2/source/drivers/gpu/drm/drm_syncobj.c...
===
So, sync object can carry both pre-fence and post-fence, and they could be sync-file FDs!
The corollary is that we can get away by using a single syncobj handle for the job's submission IOCTL.
On 6/25/20 3:47 AM, Dmitry Osipenko wrote:
23.06.2020 15:09, Mikko Perttunen пишет:
#define DRM_TEGRA_SUBMIT_CREATE_POST_SYNC_FILE (1<<0) #define DRM_TEGRA_SUBMIT_CREATE_POST_SYNCOBJ (1<<1)
The sync object shouldn't be created by the kernel driver and we shouldn't need the sync-file FD.
Please consider this example of how syncobj may be used:
Syncobj is created by userspace.
Syncobj's handle (post_fence) is passed with the job to the kernel
driver.
Userspace waits on syncobj for the job's submission.
Kernel driver attaches job-completion dma-fence(s) to the syncobj
after starting to execute the job.
- Userspace waits on syncobj for the job's completion.
Syncobj is a superset of the sync-file fence:
a) Syncobj doesn't have a backing file descriptor when it's created. For example, why would you need an FD if you're not going to share fence with other processes. It's possible to get FD out of syncobj later on, please see [1][2].
b) Syncobj is designed to be used with a GPU jobs. For example, userspace passes job to the GPU driver's scheduler and then waits until job has been started to execute on hardware, this is already supported by syncobj.
c) It is possible to attach sync-file's fence to a syncobj [1][2][3] and now:
Userspace may attach sync-file's fence to a syncobj.
Userspace may use this syncobj for the job's pre-fence.
Kernel driver will take out the pre-fence from the syncobj during of
the job's submission and reset the syncobj's fence to NULL.
- Job's scheduler will wait on this pre-fence before starting to
execute job.
- Once the job is started to execute, the job's scheduler will attach
the job's completion-fence to the syncobj. Now syncobj has a post-fence!
d) It is possible to get sync-file FD out of syncobj [1][2][4].
[1] https://elixir.bootlin.com/linux/v5.7.2/source/include/uapi/drm/drm.h#L730 [2] https://elixir.bootlin.com/linux/v5.7.2/source/include/uapi/drm/drm.h#L933 [3] https://elixir.bootlin.com/linux/v5.7.2/source/drivers/gpu/drm/drm_syncobj.c... [4] https://elixir.bootlin.com/linux/v5.7.2/source/drivers/gpu/drm/drm_syncobj.c...
===
So, sync object can carry both pre-fence and post-fence, and they could be sync-file FDs!
The corollary is that we can get away by using a single syncobj handle for the job's submission IOCTL.
Ah, clearly I hadn't understood syncobjs properly :) Last time I spent significant time with this they didn't exist yet.. I'll check this out.
Mikko
23.06.2020 15:09, Mikko Perttunen пишет:
### DRM_TEGRA_CHANNEL_MAP
Make memory accessible by the engine while executing work on the channel.
#define DRM_TEGRA_CHANNEL_MAP_READWRITE (1<<0) struct drm_tegra_channel_map { /* * [in] ID of the channel for which to map memory to. */ __u32 channel_id; /* * [in] GEM handle of the memory to map. */ __u32 handle; /* * [in] Offset in GEM handle of the memory area to map. * * Must be aligned by 4K. */ __u64 offset;
Could you please give a use-case example for this partial mapping?
I vaguely recalling that maybe it was me who suggested this in the past..
I kinda see that this could be useful for a case where userspace allocates a large chunk of memory and then performs sub-allocations in the userspace driver. But do we have a real-world example for this right now?
Please see more below.
/* * [in] Length of memory area to map in bytes. * * Must be aligned by 4K. */ __u64 length;
/* * [out] IOVA of mapped memory. Userspace can use this IOVA * directly to refer to the memory to skip using relocations. * Only available if hardware memory isolation is enabled. * * Will be set to 0xffff_ffff_ffff_ffff if unavailable. */ __u64 iova;
/* * [out] ID corresponding to the mapped memory to be used for * relocations or unmapping. */ __u32 mapping_id; /* * [in] Flags. */ __u32 flags;
__u32 reserved[6]; };
It looks to me that we actually need a bit different thing here.
This MAP IOCTL maps a portion of a GEM and then returns the mapping_id. And I think we need something more flexible that will allow us to use GEM handles for the relocation IDs, which should fit nicely with the DMA-reservations.
What about an IOCTL that wraps GEM into another GEM? We could wrap a portion of GEM_A into a GEM_B, and then map the GEM_B using the MAP IOCTL.
It could be something like this:
### DRM_TEGRA_BO_WRAP
struct drm_tegra_wrap_bo { __u32 bo_handle_wrapped; // out __u32 bo_handle; // in __u64 offset; __u64 length; };
### DRM_TEGRA_CHANNEL_MAP
struct drm_tegra_channel_map { __u32 channels_mask; __u32 mapping_id; __u32 bo_handle; __u32 flags; __u64 iova; };
===
This allows multiple mapping_ids to have the same backing GEM, so the mapping_id could be resolved into a BO during of job's submission for the DMA-reservations handling.
Also:
1. Maybe the WRAP IOCTL could be a generic GEM IOCTL?
2. I guess we could start easy without the WRAP IOCTL and add it later on once there will be a real-world need.
On Fri, Jun 26, 2020 at 01:47:46AM +0300, Dmitry Osipenko wrote:
23.06.2020 15:09, Mikko Perttunen пишет:
### DRM_TEGRA_CHANNEL_MAP
Make memory accessible by the engine while executing work on the channel.
#define DRM_TEGRA_CHANNEL_MAP_READWRITE (1<<0) struct drm_tegra_channel_map { /* * [in] ID of the channel for which to map memory to. */ __u32 channel_id; /* * [in] GEM handle of the memory to map. */ __u32 handle; /* * [in] Offset in GEM handle of the memory area to map. * * Must be aligned by 4K. */ __u64 offset;
Could you please give a use-case example for this partial mapping?
I vaguely recalling that maybe it was me who suggested this in the past..
I kinda see that this could be useful for a case where userspace allocates a large chunk of memory and then performs sub-allocations in the userspace driver. But do we have a real-world example for this right now?
I think the main point about this IOCTL was to make mapping/unmapping more efficient and avoid relocations for situations where we know it is safe to do so.
The fact that this can be used to create partial mappings is mostly just an added bonus, in my opinion. Doing this doesn't create much complexity but in turn gives us a lot more flexibility.
A couple of places where I think this could be useful are OpenGL and Vulkan, both of which support buffer suballocation. This has a couple of advantages on modern GPUs where sometimes you want to use very large allocation granularity, etc.
Now, I don't think that we'll see much of that in Tegra DRM directly, although grate could certainly make use of this, I suspect. However, I think for interoperation of dGPU and Tegra DRM (with VIC for post- processing, or hopefully some of the other hardware acceleration engines at some point), this might come in handy.
There are other possible use-cases within just Tegra DRM as well. We may want to only partially map planar buffers for video post-processing, for example. Or map each plane separately.
Please see more below.
/* * [in] Length of memory area to map in bytes. * * Must be aligned by 4K. */ __u64 length;
/* * [out] IOVA of mapped memory. Userspace can use this IOVA * directly to refer to the memory to skip using relocations. * Only available if hardware memory isolation is enabled. * * Will be set to 0xffff_ffff_ffff_ffff if unavailable. */ __u64 iova;
/* * [out] ID corresponding to the mapped memory to be used for * relocations or unmapping. */ __u32 mapping_id; /* * [in] Flags. */ __u32 flags;
__u32 reserved[6]; };
It looks to me that we actually need a bit different thing here.
This MAP IOCTL maps a portion of a GEM and then returns the mapping_id. And I think we need something more flexible that will allow us to use GEM handles for the relocation IDs, which should fit nicely with the DMA-reservations.
What about an IOCTL that wraps GEM into another GEM? We could wrap a portion of GEM_A into a GEM_B, and then map the GEM_B using the MAP IOCTL.
It could be something like this:
### DRM_TEGRA_BO_WRAP
struct drm_tegra_wrap_bo { __u32 bo_handle_wrapped; // out __u32 bo_handle; // in __u64 offset; __u64 length; };
### DRM_TEGRA_CHANNEL_MAP
struct drm_tegra_channel_map { __u32 channels_mask; __u32 mapping_id; __u32 bo_handle; __u32 flags; __u64 iova; };
===
This allows multiple mapping_ids to have the same backing GEM, so the mapping_id could be resolved into a BO during of job's submission for the DMA-reservations handling.
That's pretty much what we have already above, isn't it? Just because we call the field "mapping_id" doesn't mean that in the background we can't create a GEM object and return it's handle as "mapping_id".
One advantage of Mikko's proposal is that we have a single IOCTL rather than two to create the mapping, making it a bit more lightweight.
Thierry
26.06.2020 10:34, Thierry Reding пишет:
On Fri, Jun 26, 2020 at 01:47:46AM +0300, Dmitry Osipenko wrote:
23.06.2020 15:09, Mikko Perttunen пишет:
### DRM_TEGRA_CHANNEL_MAP
Make memory accessible by the engine while executing work on the channel.
#define DRM_TEGRA_CHANNEL_MAP_READWRITE (1<<0) struct drm_tegra_channel_map { /* * [in] ID of the channel for which to map memory to. */ __u32 channel_id; /* * [in] GEM handle of the memory to map. */ __u32 handle; /* * [in] Offset in GEM handle of the memory area to map. * * Must be aligned by 4K. */ __u64 offset;
Could you please give a use-case example for this partial mapping?
I vaguely recalling that maybe it was me who suggested this in the past..
I kinda see that this could be useful for a case where userspace allocates a large chunk of memory and then performs sub-allocations in the userspace driver. But do we have a real-world example for this right now?
I think the main point about this IOCTL was to make mapping/unmapping more efficient and avoid relocations for situations where we know it is safe to do so.
The fact that this can be used to create partial mappings is mostly just an added bonus, in my opinion. Doing this doesn't create much complexity but in turn gives us a lot more flexibility.
A couple of places where I think this could be useful are OpenGL and Vulkan, both of which support buffer suballocation. This has a couple of advantages on modern GPUs where sometimes you want to use very large allocation granularity, etc.
Now, I don't think that we'll see much of that in Tegra DRM directly, although grate could certainly make use of this, I suspect. However, I think for interoperation of dGPU and Tegra DRM (with VIC for post- processing, or hopefully some of the other hardware acceleration engines at some point), this might come in handy.
There are other possible use-cases within just Tegra DRM as well. We may want to only partially map planar buffers for video post-processing, for example. Or map each plane separately.
Please see more below.
/* * [in] Length of memory area to map in bytes. * * Must be aligned by 4K. */ __u64 length;
/* * [out] IOVA of mapped memory. Userspace can use this IOVA * directly to refer to the memory to skip using relocations. * Only available if hardware memory isolation is enabled. * * Will be set to 0xffff_ffff_ffff_ffff if unavailable. */ __u64 iova;
/* * [out] ID corresponding to the mapped memory to be used for * relocations or unmapping. */ __u32 mapping_id; /* * [in] Flags. */ __u32 flags;
__u32 reserved[6]; };
It looks to me that we actually need a bit different thing here.
This MAP IOCTL maps a portion of a GEM and then returns the mapping_id. And I think we need something more flexible that will allow us to use GEM handles for the relocation IDs, which should fit nicely with the DMA-reservations.
What about an IOCTL that wraps GEM into another GEM? We could wrap a portion of GEM_A into a GEM_B, and then map the GEM_B using the MAP IOCTL.
It could be something like this:
### DRM_TEGRA_BO_WRAP
struct drm_tegra_wrap_bo { __u32 bo_handle_wrapped; // out __u32 bo_handle; // in __u64 offset; __u64 length; };
### DRM_TEGRA_CHANNEL_MAP
struct drm_tegra_channel_map { __u32 channels_mask; __u32 mapping_id; __u32 bo_handle; __u32 flags; __u64 iova; };
===
This allows multiple mapping_ids to have the same backing GEM, so the mapping_id could be resolved into a BO during of job's submission for the DMA-reservations handling.
That's pretty much what we have already above, isn't it? Just because we call the field "mapping_id" doesn't mean that in the background we can't create a GEM object and return it's handle as "mapping_id".
One advantage of Mikko's proposal is that we have a single IOCTL rather than two to create the mapping, making it a bit more lightweight.
Thinking a bit more about it, I now changed my mind.
There is no need to perform implicit fencing on each suballocation, instead explicit fencing should be used for the suballocations.
So, we will need to add the relocation flags for the direction and explicit (or implicit) fencing per-relocation. The direction will tell how fence should be attached to the BO's DMA-reservation, while the fence-flag will tell whether job's scheduler should wait for the BO's reservation before executing job on hardware. This all will be needed for a proper DRI implementation on older Tegras.
Actually, during of my experiments with the UAPI, I added both these flags for the relocations [1], but really used only the direction flag so far, relying on the implicit fencing.
[1] https://github.com/grate-driver/linux/blob/master/include/uapi/drm/tegra_drm...
So, let's keep the current variant of this MAP IOCTL as-is.
On 6/26/20 7:35 PM, Dmitry Osipenko wrote:
26.06.2020 10:34, Thierry Reding пишет:
On Fri, Jun 26, 2020 at 01:47:46AM +0300, Dmitry Osipenko wrote:
23.06.2020 15:09, Mikko Perttunen пишет:
### DRM_TEGRA_CHANNEL_MAP
Make memory accessible by the engine while executing work on the channel.
#define DRM_TEGRA_CHANNEL_MAP_READWRITE (1<<0) struct drm_tegra_channel_map { /* * [in] ID of the channel for which to map memory to. */ __u32 channel_id; /* * [in] GEM handle of the memory to map. */ __u32 handle; /* * [in] Offset in GEM handle of the memory area to map. * * Must be aligned by 4K. */ __u64 offset;
Could you please give a use-case example for this partial mapping?
I vaguely recalling that maybe it was me who suggested this in the past..
I kinda see that this could be useful for a case where userspace allocates a large chunk of memory and then performs sub-allocations in the userspace driver. But do we have a real-world example for this right now?
I think the main point about this IOCTL was to make mapping/unmapping more efficient and avoid relocations for situations where we know it is safe to do so.
The fact that this can be used to create partial mappings is mostly just an added bonus, in my opinion. Doing this doesn't create much complexity but in turn gives us a lot more flexibility.
A couple of places where I think this could be useful are OpenGL and Vulkan, both of which support buffer suballocation. This has a couple of advantages on modern GPUs where sometimes you want to use very large allocation granularity, etc.
Now, I don't think that we'll see much of that in Tegra DRM directly, although grate could certainly make use of this, I suspect. However, I think for interoperation of dGPU and Tegra DRM (with VIC for post- processing, or hopefully some of the other hardware acceleration engines at some point), this might come in handy.
There are other possible use-cases within just Tegra DRM as well. We may want to only partially map planar buffers for video post-processing, for example. Or map each plane separately.
Please see more below.
/* * [in] Length of memory area to map in bytes. * * Must be aligned by 4K. */ __u64 length;
/* * [out] IOVA of mapped memory. Userspace can use this IOVA * directly to refer to the memory to skip using relocations. * Only available if hardware memory isolation is enabled. * * Will be set to 0xffff_ffff_ffff_ffff if unavailable. */ __u64 iova;
/* * [out] ID corresponding to the mapped memory to be used for * relocations or unmapping. */ __u32 mapping_id; /* * [in] Flags. */ __u32 flags;
__u32 reserved[6]; };
It looks to me that we actually need a bit different thing here.
This MAP IOCTL maps a portion of a GEM and then returns the mapping_id. And I think we need something more flexible that will allow us to use GEM handles for the relocation IDs, which should fit nicely with the DMA-reservations.
What about an IOCTL that wraps GEM into another GEM? We could wrap a portion of GEM_A into a GEM_B, and then map the GEM_B using the MAP IOCTL.
It could be something like this:
### DRM_TEGRA_BO_WRAP
struct drm_tegra_wrap_bo { __u32 bo_handle_wrapped; // out __u32 bo_handle; // in __u64 offset; __u64 length; };
### DRM_TEGRA_CHANNEL_MAP
struct drm_tegra_channel_map { __u32 channels_mask; __u32 mapping_id; __u32 bo_handle; __u32 flags; __u64 iova; };
===
This allows multiple mapping_ids to have the same backing GEM, so the mapping_id could be resolved into a BO during of job's submission for the DMA-reservations handling.
That's pretty much what we have already above, isn't it? Just because we call the field "mapping_id" doesn't mean that in the background we can't create a GEM object and return it's handle as "mapping_id".
One advantage of Mikko's proposal is that we have a single IOCTL rather than two to create the mapping, making it a bit more lightweight.
Thinking a bit more about it, I now changed my mind.
There is no need to perform implicit fencing on each suballocation, instead explicit fencing should be used for the suballocations.
So, we will need to add the relocation flags for the direction and explicit (or implicit) fencing per-relocation. The direction will tell how fence should be attached to the BO's DMA-reservation, while the fence-flag will tell whether job's scheduler should wait for the BO's reservation before executing job on hardware. This all will be needed for a proper DRI implementation on older Tegras.
Actually, during of my experiments with the UAPI, I added both these flags for the relocations [1], but really used only the direction flag so far, relying on the implicit fencing.
[1] https://github.com/grate-driver/linux/blob/master/include/uapi/drm/tegra_drm...
So, let's keep the current variant of this MAP IOCTL as-is.
Let me rephrase to make sure I understand:
For relocations, we should add flags for direction and fencing. This way at submit time we can do the proper fencing operations on the relocated BO's DMA reservation.
This sounds good to me, and I think it makes the "relocation" concept a bit more general than it is currently. I think we could rename it to something like "buffer_usage" (open to bikeshedding), and it can have fence-related flags and relocation-related flags. For newer Tegra chips we don't necessarily need to relocate, but we still may need to handle DMA reservations, so in these cases only the fencing flags would be set.
Mikko
28.06.2020 14:16, Mikko Perttunen пишет:
On 6/26/20 7:35 PM, Dmitry Osipenko wrote:
26.06.2020 10:34, Thierry Reding пишет:
On Fri, Jun 26, 2020 at 01:47:46AM +0300, Dmitry Osipenko wrote:
23.06.2020 15:09, Mikko Perttunen пишет:
### DRM_TEGRA_CHANNEL_MAP
Make memory accessible by the engine while executing work on the channel.
#define DRM_TEGRA_CHANNEL_MAP_READWRITE (1<<0) struct drm_tegra_channel_map { /* * [in] ID of the channel for which to map memory to. */ __u32 channel_id; /* * [in] GEM handle of the memory to map. */ __u32 handle; /* * [in] Offset in GEM handle of the memory area to map. * * Must be aligned by 4K. */ __u64 offset;
Could you please give a use-case example for this partial mapping?
I vaguely recalling that maybe it was me who suggested this in the past..
I kinda see that this could be useful for a case where userspace allocates a large chunk of memory and then performs sub-allocations in the userspace driver. But do we have a real-world example for this right now?
I think the main point about this IOCTL was to make mapping/unmapping more efficient and avoid relocations for situations where we know it is safe to do so.
The fact that this can be used to create partial mappings is mostly just an added bonus, in my opinion. Doing this doesn't create much complexity but in turn gives us a lot more flexibility.
A couple of places where I think this could be useful are OpenGL and Vulkan, both of which support buffer suballocation. This has a couple of advantages on modern GPUs where sometimes you want to use very large allocation granularity, etc.
Now, I don't think that we'll see much of that in Tegra DRM directly, although grate could certainly make use of this, I suspect. However, I think for interoperation of dGPU and Tegra DRM (with VIC for post- processing, or hopefully some of the other hardware acceleration engines at some point), this might come in handy.
There are other possible use-cases within just Tegra DRM as well. We may want to only partially map planar buffers for video post-processing, for example. Or map each plane separately.
Please see more below.
/* * [in] Length of memory area to map in bytes. * * Must be aligned by 4K. */ __u64 length;
/* * [out] IOVA of mapped memory. Userspace can use this IOVA * directly to refer to the memory to skip using relocations. * Only available if hardware memory isolation is enabled. * * Will be set to 0xffff_ffff_ffff_ffff if unavailable. */ __u64 iova;
/* * [out] ID corresponding to the mapped memory to be used for * relocations or unmapping. */ __u32 mapping_id; /* * [in] Flags. */ __u32 flags;
__u32 reserved[6]; };
It looks to me that we actually need a bit different thing here.
This MAP IOCTL maps a portion of a GEM and then returns the mapping_id. And I think we need something more flexible that will allow us to use GEM handles for the relocation IDs, which should fit nicely with the DMA-reservations.
What about an IOCTL that wraps GEM into another GEM? We could wrap a portion of GEM_A into a GEM_B, and then map the GEM_B using the MAP IOCTL.
It could be something like this:
### DRM_TEGRA_BO_WRAP
struct drm_tegra_wrap_bo { __u32 bo_handle_wrapped; // out __u32 bo_handle; // in __u64 offset; __u64 length; };
### DRM_TEGRA_CHANNEL_MAP
struct drm_tegra_channel_map { __u32 channels_mask; __u32 mapping_id; __u32 bo_handle; __u32 flags; __u64 iova; };
===
This allows multiple mapping_ids to have the same backing GEM, so the mapping_id could be resolved into a BO during of job's submission for the DMA-reservations handling.
That's pretty much what we have already above, isn't it? Just because we call the field "mapping_id" doesn't mean that in the background we can't create a GEM object and return it's handle as "mapping_id".
One advantage of Mikko's proposal is that we have a single IOCTL rather than two to create the mapping, making it a bit more lightweight.
Thinking a bit more about it, I now changed my mind.
There is no need to perform implicit fencing on each suballocation, instead explicit fencing should be used for the suballocations.
So, we will need to add the relocation flags for the direction and explicit (or implicit) fencing per-relocation. The direction will tell how fence should be attached to the BO's DMA-reservation, while the fence-flag will tell whether job's scheduler should wait for the BO's reservation before executing job on hardware. This all will be needed for a proper DRI implementation on older Tegras.
Actually, during of my experiments with the UAPI, I added both these flags for the relocations [1], but really used only the direction flag so far, relying on the implicit fencing.
[1] https://github.com/grate-driver/linux/blob/master/include/uapi/drm/tegra_drm...
So, let's keep the current variant of this MAP IOCTL as-is.
Let me rephrase to make sure I understand:
For relocations, we should add flags for direction and fencing. This way at submit time we can do the proper fencing operations on the relocated BO's DMA reservation.
Correct
This sounds good to me, and I think it makes the "relocation" concept a bit more general than it is currently. I think we could rename it to something like "buffer_usage" (open to bikeshedding), and it can have fence-related flags and relocation-related flags. For newer Tegra chips we don't necessarily need to relocate, but we still may need to handle DMA reservations, so in these cases only the fencing flags would be set.
Kernel driver already knows whether relocation needs to be patched since it returns 0xffffffff by the MAP if patching is needed, and thus, patching could be auto-skipped by the driver.
I don't think that a special flag is required for telling about whether relocation needs to be done. The direction and fence flags should be enough for now.
===
I'm curios what do you think about another variant:
In the grate-kernel I'm using a BO table which contains unique BO entries for each job's relocation [1] and then IDs of this table and target offsets are embedded into the cmdstream in-place of the memory addresses [2]. This way, when cmdstream is fully firewalled, we're reading the stream's data anyways, and thus, it's quite nice to embed the BO table IDs and offsets into cmdstream itself [3].
In a result:
- Each job's BO is resolved only once during submission.
- BO table is kept small if cmdstream contains duplicated relocations.
[1] https://github.com/grate-driver/linux/blob/master/include/uapi/drm/tegra_drm...
[2] https://github.com/grate-driver/linux/blob/master/include/uapi/drm/tegra_drm...
[3] https://github.com/grate-driver/linux/blob/master/drivers/gpu/drm/tegra/uapi...
Of course this limits the number of BOs per-job. In a case of grate-kernel it's max 64 BOs per-job + max 64MB for target offset. I guess the BOs number could be lowered to 32 per-job, which should be a bit more realistic, and then max target offset will be 128MB.
So, we could replace the BO table with a mapping table and have the MAPPING_TABLE! :) And it doesn't matter whether cmdstream patched or not, either way the MAPPING_TABLE will contain mapping ID + flags.
There are 3 possible variants of how job could be processed, depending on hardware generation and driver security configuration:
1. Job is fully-parsed and patched. 2. Job is patched-only (with relocations). 3. Job isn't parsed, nor patched.
My variant covers 1 and 3. I guess we could just ignore the case 2 for now and fall back to 1, for simplicity. It shouldn't be a problem to add support for the RELOCS_TABLE in addition to MAPPING_TABLE later on, if will be really needed.
On 6/29/20 1:59 AM, Dmitry Osipenko wrote:
28.06.2020 14:16, Mikko Perttunen пишет:
On 6/26/20 7:35 PM, Dmitry Osipenko wrote:
26.06.2020 10:34, Thierry Reding пишет:
On Fri, Jun 26, 2020 at 01:47:46AM +0300, Dmitry Osipenko wrote:
23.06.2020 15:09, Mikko Perttunen пишет:
### DRM_TEGRA_CHANNEL_MAP
Make memory accessible by the engine while executing work on the channel.
#define DRM_TEGRA_CHANNEL_MAP_READWRITE (1<<0) struct drm_tegra_channel_map { /* * [in] ID of the channel for which to map memory to. */ __u32 channel_id; /* * [in] GEM handle of the memory to map. */ __u32 handle; /* * [in] Offset in GEM handle of the memory area to map. * * Must be aligned by 4K. */ __u64 offset;
Could you please give a use-case example for this partial mapping?
I vaguely recalling that maybe it was me who suggested this in the past..
I kinda see that this could be useful for a case where userspace allocates a large chunk of memory and then performs sub-allocations in the userspace driver. But do we have a real-world example for this right now?
I think the main point about this IOCTL was to make mapping/unmapping more efficient and avoid relocations for situations where we know it is safe to do so.
The fact that this can be used to create partial mappings is mostly just an added bonus, in my opinion. Doing this doesn't create much complexity but in turn gives us a lot more flexibility.
A couple of places where I think this could be useful are OpenGL and Vulkan, both of which support buffer suballocation. This has a couple of advantages on modern GPUs where sometimes you want to use very large allocation granularity, etc.
Now, I don't think that we'll see much of that in Tegra DRM directly, although grate could certainly make use of this, I suspect. However, I think for interoperation of dGPU and Tegra DRM (with VIC for post- processing, or hopefully some of the other hardware acceleration engines at some point), this might come in handy.
There are other possible use-cases within just Tegra DRM as well. We may want to only partially map planar buffers for video post-processing, for example. Or map each plane separately.
Please see more below.
/* * [in] Length of memory area to map in bytes. * * Must be aligned by 4K. */ __u64 length;
/* * [out] IOVA of mapped memory. Userspace can use this IOVA * directly to refer to the memory to skip using relocations. * Only available if hardware memory isolation is enabled. * * Will be set to 0xffff_ffff_ffff_ffff if unavailable. */ __u64 iova;
/* * [out] ID corresponding to the mapped memory to be used for * relocations or unmapping. */ __u32 mapping_id; /* * [in] Flags. */ __u32 flags;
__u32 reserved[6]; };
It looks to me that we actually need a bit different thing here.
This MAP IOCTL maps a portion of a GEM and then returns the mapping_id. And I think we need something more flexible that will allow us to use GEM handles for the relocation IDs, which should fit nicely with the DMA-reservations.
What about an IOCTL that wraps GEM into another GEM? We could wrap a portion of GEM_A into a GEM_B, and then map the GEM_B using the MAP IOCTL.
It could be something like this:
### DRM_TEGRA_BO_WRAP
struct drm_tegra_wrap_bo { __u32 bo_handle_wrapped; // out __u32 bo_handle; // in __u64 offset; __u64 length; };
### DRM_TEGRA_CHANNEL_MAP
struct drm_tegra_channel_map { __u32 channels_mask; __u32 mapping_id; __u32 bo_handle; __u32 flags; __u64 iova; };
===
This allows multiple mapping_ids to have the same backing GEM, so the mapping_id could be resolved into a BO during of job's submission for the DMA-reservations handling.
That's pretty much what we have already above, isn't it? Just because we call the field "mapping_id" doesn't mean that in the background we can't create a GEM object and return it's handle as "mapping_id".
One advantage of Mikko's proposal is that we have a single IOCTL rather than two to create the mapping, making it a bit more lightweight.
Thinking a bit more about it, I now changed my mind.
There is no need to perform implicit fencing on each suballocation, instead explicit fencing should be used for the suballocations.
So, we will need to add the relocation flags for the direction and explicit (or implicit) fencing per-relocation. The direction will tell how fence should be attached to the BO's DMA-reservation, while the fence-flag will tell whether job's scheduler should wait for the BO's reservation before executing job on hardware. This all will be needed for a proper DRI implementation on older Tegras.
Actually, during of my experiments with the UAPI, I added both these flags for the relocations [1], but really used only the direction flag so far, relying on the implicit fencing.
[1] https://github.com/grate-driver/linux/blob/master/include/uapi/drm/tegra_drm...
So, let's keep the current variant of this MAP IOCTL as-is.
Let me rephrase to make sure I understand:
For relocations, we should add flags for direction and fencing. This way at submit time we can do the proper fencing operations on the relocated BO's DMA reservation.
Correct
This sounds good to me, and I think it makes the "relocation" concept a bit more general than it is currently. I think we could rename it to something like "buffer_usage" (open to bikeshedding), and it can have fence-related flags and relocation-related flags. For newer Tegra chips we don't necessarily need to relocate, but we still may need to handle DMA reservations, so in these cases only the fencing flags would be set.
Kernel driver already knows whether relocation needs to be patched since it returns 0xffffffff by the MAP if patching is needed, and thus, patching could be auto-skipped by the driver.
I don't think that a special flag is required for telling about whether relocation needs to be done. The direction and fence flags should be enough for now.
Yeah, I guess it's not really required.
===
I'm curios what do you think about another variant:
In the grate-kernel I'm using a BO table which contains unique BO entries for each job's relocation [1] and then IDs of this table and target offsets are embedded into the cmdstream in-place of the memory addresses [2]. This way, when cmdstream is fully firewalled, we're reading the stream's data anyways, and thus, it's quite nice to embed the BO table IDs and offsets into cmdstream itself [3].
In a result:
Each job's BO is resolved only once during submission.
BO table is kept small if cmdstream contains duplicated relocations.
[1] https://github.com/grate-driver/linux/blob/master/include/uapi/drm/tegra_drm...
[2] https://github.com/grate-driver/linux/blob/master/include/uapi/drm/tegra_drm...
[3] https://github.com/grate-driver/linux/blob/master/drivers/gpu/drm/tegra/uapi...
Of course this limits the number of BOs per-job. In a case of grate-kernel it's max 64 BOs per-job + max 64MB for target offset. I guess the BOs number could be lowered to 32 per-job, which should be a bit more realistic, and then max target offset will be 128MB.
So, we could replace the BO table with a mapping table and have the MAPPING_TABLE! :) And it doesn't matter whether cmdstream patched or not, either way the MAPPING_TABLE will contain mapping ID + flags.
There are 3 possible variants of how job could be processed, depending on hardware generation and driver security configuration:
- Job is fully-parsed and patched.
- Job is patched-only (with relocations).
- Job isn't parsed, nor patched.
My variant covers 1 and 3. I guess we could just ignore the case 2 for now and fall back to 1, for simplicity. It shouldn't be a problem to add support for the RELOCS_TABLE in addition to MAPPING_TABLE later on, if will be really needed.
I think the bitfield will get a bit tight once we add the 'shift' field and 'blocklinear' flag to it :) (which made me notice that apparently my original proposal has the define for DRM_TEGRA_SUBMIT_RELOCATION_BLOCKLINEAR, but doesn't have a flags field in the relocation structure, oops!) Both of those affect the relocation value.
FWIW, we should design option 1 to be functional for new chips as well, as we need to use it e.g. if IOMMU is disabled.
Mikko
30.06.2020 13:55, Mikko Perttunen пишет:
On 6/29/20 1:59 AM, Dmitry Osipenko wrote:
28.06.2020 14:16, Mikko Perttunen пишет:
On 6/26/20 7:35 PM, Dmitry Osipenko wrote:
26.06.2020 10:34, Thierry Reding пишет:
On Fri, Jun 26, 2020 at 01:47:46AM +0300, Dmitry Osipenko wrote:
23.06.2020 15:09, Mikko Perttunen пишет: > ### DRM_TEGRA_CHANNEL_MAP > > Make memory accessible by the engine while executing work on the > channel. > > ``` > #define DRM_TEGRA_CHANNEL_MAP_READWRITE (1<<0) > > struct drm_tegra_channel_map { > /* > * [in] ID of the channel for which to map memory to. > */ > __u32 channel_id; > /* > * [in] GEM handle of the memory to map. > */ > __u32 handle; > > /* > * [in] Offset in GEM handle of the memory area to map. > * > * Must be aligned by 4K. > */ > __u64 offset;
Could you please give a use-case example for this partial mapping?
I vaguely recalling that maybe it was me who suggested this in the past..
I kinda see that this could be useful for a case where userspace allocates a large chunk of memory and then performs sub-allocations in the userspace driver. But do we have a real-world example for this right now?
I think the main point about this IOCTL was to make mapping/unmapping more efficient and avoid relocations for situations where we know it is safe to do so.
The fact that this can be used to create partial mappings is mostly just an added bonus, in my opinion. Doing this doesn't create much complexity but in turn gives us a lot more flexibility.
A couple of places where I think this could be useful are OpenGL and Vulkan, both of which support buffer suballocation. This has a couple of advantages on modern GPUs where sometimes you want to use very large allocation granularity, etc.
Now, I don't think that we'll see much of that in Tegra DRM directly, although grate could certainly make use of this, I suspect. However, I think for interoperation of dGPU and Tegra DRM (with VIC for post- processing, or hopefully some of the other hardware acceleration engines at some point), this might come in handy.
There are other possible use-cases within just Tegra DRM as well. We may want to only partially map planar buffers for video post-processing, for example. Or map each plane separately.
Please see more below.
> /* > * [in] Length of memory area to map in bytes. > * > * Must be aligned by 4K. > */ > __u64 length; > > /* > * [out] IOVA of mapped memory. Userspace can use this > IOVA > * directly to refer to the memory to skip using > relocations. > * Only available if hardware memory isolation is > enabled. > * > * Will be set to 0xffff_ffff_ffff_ffff if unavailable. > */ > __u64 iova; > > /* > * [out] ID corresponding to the mapped memory to be > used for > * relocations or unmapping. > */ > __u32 mapping_id; > /* > * [in] Flags. > */ > __u32 flags; > > __u32 reserved[6]; > };
It looks to me that we actually need a bit different thing here.
This MAP IOCTL maps a portion of a GEM and then returns the mapping_id. And I think we need something more flexible that will allow us to use GEM handles for the relocation IDs, which should fit nicely with the DMA-reservations.
What about an IOCTL that wraps GEM into another GEM? We could wrap a portion of GEM_A into a GEM_B, and then map the GEM_B using the MAP IOCTL.
It could be something like this:
### DRM_TEGRA_BO_WRAP
struct drm_tegra_wrap_bo { __u32 bo_handle_wrapped; // out __u32 bo_handle; // in __u64 offset; __u64 length; };
### DRM_TEGRA_CHANNEL_MAP
struct drm_tegra_channel_map { __u32 channels_mask; __u32 mapping_id; __u32 bo_handle; __u32 flags; __u64 iova; };
===
This allows multiple mapping_ids to have the same backing GEM, so the mapping_id could be resolved into a BO during of job's submission for the DMA-reservations handling.
That's pretty much what we have already above, isn't it? Just because we call the field "mapping_id" doesn't mean that in the background we can't create a GEM object and return it's handle as "mapping_id".
One advantage of Mikko's proposal is that we have a single IOCTL rather than two to create the mapping, making it a bit more lightweight.
Thinking a bit more about it, I now changed my mind.
There is no need to perform implicit fencing on each suballocation, instead explicit fencing should be used for the suballocations.
So, we will need to add the relocation flags for the direction and explicit (or implicit) fencing per-relocation. The direction will tell how fence should be attached to the BO's DMA-reservation, while the fence-flag will tell whether job's scheduler should wait for the BO's reservation before executing job on hardware. This all will be needed for a proper DRI implementation on older Tegras.
Actually, during of my experiments with the UAPI, I added both these flags for the relocations [1], but really used only the direction flag so far, relying on the implicit fencing.
[1] https://github.com/grate-driver/linux/blob/master/include/uapi/drm/tegra_drm...
So, let's keep the current variant of this MAP IOCTL as-is.
Let me rephrase to make sure I understand:
For relocations, we should add flags for direction and fencing. This way at submit time we can do the proper fencing operations on the relocated BO's DMA reservation.
Correct
This sounds good to me, and I think it makes the "relocation" concept a bit more general than it is currently. I think we could rename it to something like "buffer_usage" (open to bikeshedding), and it can have fence-related flags and relocation-related flags. For newer Tegra chips we don't necessarily need to relocate, but we still may need to handle DMA reservations, so in these cases only the fencing flags would be set.
Kernel driver already knows whether relocation needs to be patched since it returns 0xffffffff by the MAP if patching is needed, and thus, patching could be auto-skipped by the driver.
I don't think that a special flag is required for telling about whether relocation needs to be done. The direction and fence flags should be enough for now.
Yeah, I guess it's not really required.
===
I'm curios what do you think about another variant:
In the grate-kernel I'm using a BO table which contains unique BO entries for each job's relocation [1] and then IDs of this table and target offsets are embedded into the cmdstream in-place of the memory addresses [2]. This way, when cmdstream is fully firewalled, we're reading the stream's data anyways, and thus, it's quite nice to embed the BO table IDs and offsets into cmdstream itself [3].
In a result:
- Each job's BO is resolved only once during submission.
- BO table is kept small if cmdstream contains duplicated relocations.
[1] https://github.com/grate-driver/linux/blob/master/include/uapi/drm/tegra_drm...
[2] https://github.com/grate-driver/linux/blob/master/include/uapi/drm/tegra_drm...
[3] https://github.com/grate-driver/linux/blob/master/drivers/gpu/drm/tegra/uapi...
Of course this limits the number of BOs per-job. In a case of grate-kernel it's max 64 BOs per-job + max 64MB for target offset. I guess the BOs number could be lowered to 32 per-job, which should be a bit more realistic, and then max target offset will be 128MB.
So, we could replace the BO table with a mapping table and have the MAPPING_TABLE! :) And it doesn't matter whether cmdstream patched or not, either way the MAPPING_TABLE will contain mapping ID + flags.
There are 3 possible variants of how job could be processed, depending on hardware generation and driver security configuration:
1. Job is fully-parsed and patched. 2. Job is patched-only (with relocations). 3. Job isn't parsed, nor patched.
My variant covers 1 and 3. I guess we could just ignore the case 2 for now and fall back to 1, for simplicity. It shouldn't be a problem to add support for the RELOCS_TABLE in addition to MAPPING_TABLE later on, if will be really needed.
I think the bitfield will get a bit tight once we add the 'shift' field and 'blocklinear' flag to it :) (which made me notice that apparently my original proposal has the define for DRM_TEGRA_SUBMIT_RELOCATION_BLOCKLINEAR, but doesn't have a flags field in the relocation structure, oops!) Both of those affect the relocation value.
The flags will be in the MAPPING_TABLE. If you'll have two mappings that use a different modifier flag, then it should be two different entries in the table.
The MAPPING_TABLE is similar to a RELOCATION_TABLE, but it's a bit more flexible since it allows to avoid the duplicated entries. For example, if we have a job that copies one area of the mapping to another area of the same mapping, then it will be a single entry in the table.
IIUC, the shift is always a constant value for each address register, correct? Hence, the shift value could be looked up from the address registers table of the kernel driver for each patched relocation during of the patching process.
FWIW, we should design option 1 to be functional for new chips as well, as we need to use it e.g. if IOMMU is disabled.
Firewall also could be useful for a debugging purposes. It should work for all hardware versions universally.
On Tue, Jun 23, 2020 at 3:03 PM Mikko Perttunen cyndis@kapsi.fi wrote:
# Host1x/TegraDRM UAPI proposal
This is a proposal for a stable UAPI for Host1x and TegraDRM, to replace the current TegraDRM UAPI that is behind `STAGING` and quite obsolete in many ways.
I haven't written any implementation yet -- I'll do that once there is some agreement on the high-level design.
Current open items:
- The syncpoint UAPI allows userspace to create sync_file FDs with
arbitrary syncpoint fences. dma_fence code currently seems to assume all fences will be signaled, which would not necessarily be the case with this interface.
- Previously present GEM IOCTLs (GEM_CREATE, GEM_MMAP) are not present.
Not sure if they are still needed.
Hi, as this wasn't addressed here (and sorry if I missed it): is there an open source userspace making use of this UAPI? Because this is something which needs to be seen before it can be included at all.
## Introduction to the hardware
Tegra Host1x is a hardware block providing the following capabilities:
- Syncpoints, a unified whole-system synchronization primitive, allowing
synchronization of work between graphics, compute and multimedia engines, CPUs including cross-VM synchronization, and devices on the PCIe bus, without incurring CPU overhead.
- Channels, a command DMA mechanism allowing asynchronous programming of
various engines, integrating with syncpoints.
- Hardware virtualization support for syncpoints and channels. (On
Tegra186 and newer)
This proposal defines APIs for userspace access to syncpoints and channels. Kernel drivers can additionally use syncpoints and channels internally, providing other userspace interfaces (e.g. V4L2).
Syncpoint and channel interfaces are split into separate parts, as syncpoints are useful as a system synchronization primitive even without using the engine drivers provided through TegraDRM. For example, a computer vision pipeline consisting of video capture, CPU processing and GPU processing would not necessarily use the engines provided by TegraDRM. See the Example workflows section for more details.
## Syncpoint interface
Syncpoints are a set of 32-bit values providing the following operations:
- Atomically increment value by one
- Read current value
- Wait until value reaches specified threshold. For waiting, the 32-bit
value space is treated modulo 2^32; e.g. if the current value is 0xffffffff, then value 0x0 is considered to be one increment in the future.
Each syncpoint is identified by a system-global ID ranging between [0, number of syncpoints supported by hardware). The entire system has read-only access to all syncpoints based on their ID.
Syncpoints are managed through the device node /dev/host1x provided by the gpu/host1x driver.
### IOCTL HOST1X_ALLOCATE_SYNCPOINT (on /dev/host1x)
Allocates a free syncpoint, returning a file descriptor representing it. Only the owner of the file descriptor is allowed to mutate the value of the syncpoint.
struct host1x_ctrl_allocate_syncpoint { /** * @fd: * * [out] New file descriptor representing the allocated syncpoint. */ __s32 fd; __u32 reserved[3]; };
### IOCTL HOST1X_SYNCPOINT_INFO (on syncpoint file descriptor)
Allows retrieval of system-global syncpoint ID corresponding to syncpoint.
Use cases:
- Passing ID to other system components that identify syncpoints by ID
- Debugging and testing
struct host1x_syncpoint_info { /** * @id: * * [out] System-global ID of the syncpoint. */ __u32 id; __u32 reserved[3]; };
### IOCTL HOST1X_SYNCPOINT_INCREMENT (on syncpoint file descriptor)
Allows incrementing of the syncpoint value.
Use cases:
- Signalling work completion when executing a pipeline step on the CPU
- Debugging and testing
struct host1x_syncpoint_increment { /** * @count: * * [in] Number of times to increment syncpoint. Value can be * observed in in-between values, but increments are atomic. */ __u32 count; };
### IOCTL HOST1X_READ_SYNCPOINT (on /dev/host1x)
Read the value of a syncpoint based on its ID.
Use cases:
- Allows more fine-grained tracking of task progression for debugging
purposes
struct host1x_ctrl_read_syncpoint { /** * @id: * * [in] ID of syncpoint to read. */ __u32 id; /** * @value: * * [out] Value of the syncpoint. */ __u32 value; };
### IOCTL HOST1X_CREATE_FENCE (on /dev/host1x)
Creates a new SYNC_FILE fence file descriptor for the specified syncpoint ID and threshold.
Use cases:
- Creating a fence when receiving ID/threshold pair from another system
component
- Creating a postfence when executing a pipeline step on the CPU
- Creating a postfence when executing a pipeline step controlled by
userspace (e.g. GPU userspace submission)
struct host1x_ctrl_create_fence { /** * @id: * * [in] ID of the syncpoint for which to create a fence. */ __u32 id; /** * @threshold: * * [in] Threshold value for fence. */ __u32 threshold; /** * @fence_fd: * * [out] New sync_file FD corresponding to the ID and threshold. */ __s32 fence_fd; __u32 reserved[1]; };
### IOCTL HOST1X_GET_FENCE_INFO (on /dev/host1x)
Allows retrieval of the ID/threshold pairs corresponding to a SYNC_FILE fence or fence array.
Use cases:
- Debugging and testing
- Transmitting a fence to another system component requiring ID/threshold
- Getting ID/threshold for prefence when programming a pipeline step
controlled by userspace (e.g. GPU userspace submission)
/* If set, corresponding fence is backed by Host1x syncpoints. */ #define HOST1X_CTRL_FENCE_INFO_SYNCPOINT_FENCE (1 << 0) struct host1x_ctrl_fence_info { /** * @flags: * * [out] HOST1X_CTRL_FENCE_INFO flags. */ __u32 flags; /** * @id: * * [out] ID of the syncpoint corresponding to this fence. * Only set if HOST1X_CTRL_FENCE_INFO_SYNCPOINT_FENCE is set. */ __u32 id; /** * @threshold: * * [out] Signalling threshold of the fence. * Only set if HOST1X_CTRL_FENCE_INFO_SYNCPOINT_FENCE is set. */ __u32 threshold; __u32 reserved[1]; }; struct host1x_ctrl_get_fence_info { /** * @fence_fd: * * [in] Syncpoint-backed sync_file FD for which to retrieve info. */ __s32 fence_fd; /** * @num_fences: * * [in] Size of `fence_info` array in elements. * [out] Number of fences held by the FD. */ __u32 num_fences; /** * @fence_info: * * [in] Pointer to array of 'struct host1x_ctrl_fence_info' where info will be stored. */ __u64 fence_info; __u32 reserved[1]; };
## Channel interface
### DRM_TEGRA_OPEN_CHANNEL
struct drm_tegra_open_channel { /** * @class: * * [in] Host1x class (engine) the channel will target. */ __u32 host1x_class; /** * @flags: * * [in] Flags. Currently none are specified. */ __u32 flags; /** * @channel_id: * * [out] Process-specific identifier corresponding to opened * channel. Not the hardware channel ID. */ __u32 channel_id; /** * @hardware_version: * * [out] Hardware version of the engine targeted by the channel. * Userspace can use this to select appropriate programming * sequences. */ __u32 hardware_version; /** * @mode: * * [out] Mode the hardware is executing in. Some engines can be * configured with different firmware supporting different * functionality depending on the system configuration. This * value allows userspace to detect if the engine is configured * for the intended use case. */ __u32 mode; __u32 reserved[3]; };
### DRM_TEGRA_CLOSE_CHANNEL
struct drm_tegra_close_channel { /** * @channel_id: * * [in] ID of channel to close. */ __u32 channel_id; __u32 reserved[3]; };
### DRM_TEGRA_CHANNEL_MAP
Make memory accessible by the engine while executing work on the channel.
#define DRM_TEGRA_CHANNEL_MAP_READWRITE (1<<0) struct drm_tegra_channel_map { /* * [in] ID of the channel for which to map memory to. */ __u32 channel_id; /* * [in] GEM handle of the memory to map. */ __u32 handle; /* * [in] Offset in GEM handle of the memory area to map. * * Must be aligned by 4K. */ __u64 offset; /* * [in] Length of memory area to map in bytes. * * Must be aligned by 4K. */ __u64 length; /* * [out] IOVA of mapped memory. Userspace can use this IOVA * directly to refer to the memory to skip using relocations. * Only available if hardware memory isolation is enabled. * * Will be set to 0xffff_ffff_ffff_ffff if unavailable. */ __u64 iova; /* * [out] ID corresponding to the mapped memory to be used for * relocations or unmapping. */ __u32 mapping_id; /* * [in] Flags. */ __u32 flags; __u32 reserved[6]; };
### DRM_TEGRA_CHANNEL_UNMAP
Unmap previously mapped memory. Userspace shall do this only after it has determined the channel will no longer access the memory.
struct drm_tegra_channel_unmap { /* * [in] ID of the mapping to remove. */ __u32 mapping_id; __u32 reserved[3]; };
### DRM_TEGRA_CHANNEL_SUBMIT
Submit a job to the engine/class targeted by the channel.
struct drm_tegra_submit_syncpt_incr { /* * [in] Syncpoint FD of the syncpoint that the job will * increment. */ __s32 syncpt_fd; /* * [in] Number of increments that the job will do. */ __u32 num_incrs; /* * [out] Value the syncpoint will have once all increments have * executed. */ __u32 fence_value; __u32 reserved[1]; }; /* Sets paddr/IOVA bit 39 on T194 to enable MC swizzling */ #define DRM_TEGRA_SUBMIT_RELOCATION_BLOCKLINEAR (1<<0) struct drm_tegra_submit_relocation { /* [in] Index of GATHER or GATHER_UPTR command in commands. */ __u32 gather_command_index; /* * [in] Mapping handle (obtained through CHANNEL_MAP) of the memory * the address of which will be patched in. */ __u32 mapping_id; /* * [in] Offset in the gather that will be patched. */ __u64 gather_offset; /* * [in] Offset in target buffer whose paddr/IOVA will be written * to the gather. */ __u64 target_offset; /* * [in] Number of bits the resulting address will be logically * shifted right before writing to gather. */ __u32 shift; __u32 reserved[1]; }; /* Command is an opcode gather from a GEM handle */ #define DRM_TEGRA_SUBMIT_COMMAND_GATHER 0 /* Command is an opcode gather from a user pointer */ #define DRM_TEGRA_SUBMIT_COMMAND_GATHER_UPTR 1 /* Command is a wait for syncpt fence completion */ #define DRM_TEGRA_SUBMIT_COMMAND_WAIT_SYNCPT 2 /* Command is a wait for SYNC_FILE FD completion */ #define DRM_TEGRA_SUBMIT_COMMAND_WAIT_SYNC_FILE 3 /* Command is a wait for DRM syncobj completion */ #define DRM_TEGRA_SUBMIT_COMMAND_WAIT_SYNCOBJ 4 /* * Allow driver to skip command execution if engine * was not accessed by another channel between * submissions. */ #define DRM_TEGRA_SUBMIT_CONTEXT_SETUP (1<<0) struct drm_tegra_submit_command { __u16 type; __u16 flags; union { struct { /* GEM handle */ __u32 handle; /* * Offset into GEM object in bytes. * Must be aligned by 4. */ __u64 offset; /* * Length of gather in bytes. * Must be aligned by 4. */ __u64 length; } gather; struct { __u32 reserved[1]; /* * Pointer to gather data. * Must be aligned by 4 bytes. */ __u64 base; /* * Length of gather in bytes. * Must be aligned by 4. */ __u64 length; } gather_uptr; struct { __u32 syncpt_id; __u32 threshold; __u32 reserved[1]; } wait_syncpt; struct { __s32 fd; } wait_sync_file; struct { __u32 handle; } wait_syncobj; }; }; #define DRM_TEGRA_SUBMIT_CREATE_POST_SYNC_FILE (1<<0) #define DRM_TEGRA_SUBMIT_CREATE_POST_SYNCOBJ (1<<1) struct drm_tegra_channel_submit { __u32 channel_id; __u32 flags; /** * [in] Timeout in microseconds after which the kernel may * consider the job to have hung and may reap it and * fast-forward its syncpoint increments. * * The value may be capped by the kernel. */ __u32 timeout; __u32 num_syncpt_incrs; __u32 num_relocations; __u32 num_commands; __u64 syncpt_incrs; __u64 relocations; __u64 commands; /** * [out] Invalid, SYNC_FILE FD or syncobj handle, depending on * if DRM_TEGRA_SUBMIT_CREATE_POST_SYNC_FILE, * DRM_TEGRA_SUBMIT_CREATE_POST_SYNCOBJ, or neither are passed. * Passing both is an error. * * The created fence object is signaled when all syncpoint * increments specified in `syncpt_incrs' have executed. */ __u32 post_fence; __u32 reserved[3]; };
## Example workflows
### Image processing with TegraDRM/VIC
This example is a simple single-step operation using VIC through TegraDRM. For example, assume we have a dma-buf fd with an image we want to convert from YUV to RGB. This can occur for example as part of video decoding.
Syncpoint initialization
- Allocate syncpoint (HOST1X_ALLOCATE_SYNCPOINT)
- This is used to track VIC submission completion.
- Retrieve syncpoint ID (HOST1X_SYNCPOINT_INFO)
- The ID is required to program the increment as part of the
submission.
Buffer allocation
- Allocate memory for configuration buffers (DMA Heaps)
- Import configuration buffer dma-buf as GEM object
- Import input image dma-buf as GEM object
Channel initialization
- Open VIC channel (DRM_TEGRA_OPEN_CHANNEL)
- Map buffers for access by VIC (DRM_TEGRA_CHANNEL_MAP)
- Create Host1x opcode buffer as userspace memory
- If buffer mapping returned an IOVA, that IOVA can be placed
directly into the buffer. Otherwise, a relocation has to be passed as part of the submission 2. The buffer should contain a syncpoint increment for the syncpoint allocated earlier. 9. Submit work, passing in the syncpoint file descriptor allocated in the beginning. The submit optionally returns a syncfd/syncobj that can be used to wait for submission completion. 1. If more fine-grained syncpoint waiting is required, the 'fence' out-parameter of 'drm_tegra_submit_syncpt_incr' can be used in conjunction with HOST1X_CREATE_FENCE to create specific fences.
### Camera-GPU-CPU pipeline without TegraDRM
This example shows a pipeline with image input from a camera being processed using the GPU programmed from userspace, and then finally analyzed by CPU. This kind of usecase can occur e.g. as part of a computer vision usecase.
Syncpoint initialization
- Camera V4L2 driver allocates a syncpoint internally within the kernel.
- For CPU job tracking, allocate a syncpoint as in "Image processing
with TegraDRM/VIC". 3. For GPU job tracking, the GPU kernel driver would allocate a syncpoint and assign it such that the GPU channel can access it.
Camera pipeline step
- Allocate a dma-buf to store the captured image.
- Trigger camera capture and store the resulting sync_file fd.
GPU pipeline step
- Use HOST1X_GET_FENCE_INFO to extract syncpoint ID/threshold pair(s)
from camera step's post-fence sync_file FD. If the sync_file FD is not backed by syncpoints, wait for the sync_file FD to signal otherwise (e.g. through polling it). 7. Use HOST1X_CREATE_FENCE to create a postfence that is signaled when the GPU step is complete. 8. Program the GPU to 1. Wait for the syncpoint thresholds extracted from the camera postfence, if we were able to do so. 2. Execute image processing on GPU. 3. Increment GPU's job tracking syncpoint, causing the GPU post-fence FD to get signaled.
CPU pipeline step
- Wait for GPU's post-fence sync_file FD
- Map the dma-buf containing the image and retrieve results.
In place of the GPU pipeline step, a similar workflow would apply for a step executed on the CPU.
--
thanks, Mikko
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On 6/26/20 2:06 PM, Karol Herbst wrote:
On Tue, Jun 23, 2020 at 3:03 PM Mikko Perttunen cyndis@kapsi.fi wrote:
# Host1x/TegraDRM UAPI proposal
This is a proposal for a stable UAPI for Host1x and TegraDRM, to replace the current TegraDRM UAPI that is behind `STAGING` and quite obsolete in many ways.
I haven't written any implementation yet -- I'll do that once there is some agreement on the high-level design.
Current open items:
- The syncpoint UAPI allows userspace to create sync_file FDs with
arbitrary syncpoint fences. dma_fence code currently seems to assume all fences will be signaled, which would not necessarily be the case with this interface.
- Previously present GEM IOCTLs (GEM_CREATE, GEM_MMAP) are not present.
Not sure if they are still needed.
Hi, as this wasn't addressed here (and sorry if I missed it): is there an open source userspace making use of this UAPI? Because this is something which needs to be seen before it can be included at all.
Hi Karol,
not currently, but once we have hashed out the design a little bit (and I'm back from vacation), I'll work on open source userspace and converting existing code using the staging UAPI to this one.
Mikko
On Fri, Jun 26, 2020 at 1:13 PM Mikko Perttunen cyndis@kapsi.fi wrote:
On 6/26/20 2:06 PM, Karol Herbst wrote:
On Tue, Jun 23, 2020 at 3:03 PM Mikko Perttunen cyndis@kapsi.fi wrote:
# Host1x/TegraDRM UAPI proposal
This is a proposal for a stable UAPI for Host1x and TegraDRM, to replace the current TegraDRM UAPI that is behind `STAGING` and quite obsolete in many ways.
I haven't written any implementation yet -- I'll do that once there is some agreement on the high-level design.
Current open items:
- The syncpoint UAPI allows userspace to create sync_file FDs with
arbitrary syncpoint fences. dma_fence code currently seems to assume all fences will be signaled, which would not necessarily be the case with this interface.
- Previously present GEM IOCTLs (GEM_CREATE, GEM_MMAP) are not present.
Not sure if they are still needed.
Hi, as this wasn't addressed here (and sorry if I missed it): is there an open source userspace making use of this UAPI? Because this is something which needs to be seen before it can be included at all.
Hi Karol,
not currently, but once we have hashed out the design a little bit (and I'm back from vacation), I'll work on open source userspace and converting existing code using the staging UAPI to this one.
Mikko
okay, cool, sounds good!
On Fri, Jun 26, 2020 at 01:06:58PM +0200, Karol Herbst wrote:
On Tue, Jun 23, 2020 at 3:03 PM Mikko Perttunen cyndis@kapsi.fi wrote:
# Host1x/TegraDRM UAPI proposal
This is a proposal for a stable UAPI for Host1x and TegraDRM, to replace the current TegraDRM UAPI that is behind `STAGING` and quite obsolete in many ways.
I haven't written any implementation yet -- I'll do that once there is some agreement on the high-level design.
Current open items:
- The syncpoint UAPI allows userspace to create sync_file FDs with
arbitrary syncpoint fences. dma_fence code currently seems to assume all fences will be signaled, which would not necessarily be the case with this interface.
- Previously present GEM IOCTLs (GEM_CREATE, GEM_MMAP) are not present.
Not sure if they are still needed.
Hi, as this wasn't addressed here (and sorry if I missed it): is there an open source userspace making use of this UAPI? Because this is something which needs to be seen before it can be included at all.
There's a set of commits that implement an earlier draft here:
https://github.com/thierryreding/linux/tree/for-5.6/drm-tegra-destage-abi
and corresponding changes to libdrm to make use of that and implement tests using the various generations of VIC here:
https://cgit.freedesktop.org/~tagr/drm/log/
Before actually jumping to an implementation we wanted to go over the design a bit more to avoid wasting a lot of work, like we've done in the past. Turns out it isn't easy to get everyone to agree on a good ABI. Who would've thought? =)
I expect that once the discussion around the ABI settles Mikko will start on implementing this ABI in the kernel and we'll update the libdrm patches to make use of the new ABI as well.
Thierry
On Fri, Jun 26, 2020 at 01:40:40PM +0200, Thierry Reding wrote:
On Fri, Jun 26, 2020 at 01:06:58PM +0200, Karol Herbst wrote:
On Tue, Jun 23, 2020 at 3:03 PM Mikko Perttunen cyndis@kapsi.fi wrote:
# Host1x/TegraDRM UAPI proposal
This is a proposal for a stable UAPI for Host1x and TegraDRM, to replace the current TegraDRM UAPI that is behind `STAGING` and quite obsolete in many ways.
I haven't written any implementation yet -- I'll do that once there is some agreement on the high-level design.
Current open items:
- The syncpoint UAPI allows userspace to create sync_file FDs with
arbitrary syncpoint fences. dma_fence code currently seems to assume all fences will be signaled, which would not necessarily be the case with this interface.
- Previously present GEM IOCTLs (GEM_CREATE, GEM_MMAP) are not present.
Not sure if they are still needed.
Hi, as this wasn't addressed here (and sorry if I missed it): is there an open source userspace making use of this UAPI? Because this is something which needs to be seen before it can be included at all.
There's a set of commits that implement an earlier draft here:
https://github.com/thierryreding/linux/tree/for-5.6/drm-tegra-destage-abi
and corresponding changes to libdrm to make use of that and implement tests using the various generations of VIC here:
https://cgit.freedesktop.org/~tagr/drm/log/
Before actually jumping to an implementation we wanted to go over the design a bit more to avoid wasting a lot of work, like we've done in the past. Turns out it isn't easy to get everyone to agree on a good ABI. Who would've thought? =)
I expect that once the discussion around the ABI settles Mikko will start on implementing this ABI in the kernel and we'll update the libdrm patches to make use of the new ABI as well.
Do we have more than libdrm and tests for this, like something somewhat functional? Since tegradrm landed years ago we've refined the criteria a bit in this regard, latest version is explicit in that it needs to be something that's functional (for end-users in some form), not just infrastructure and tests. -Daniel
26.06.2020 16:38, Daniel Vetter пишет:
On Fri, Jun 26, 2020 at 01:40:40PM +0200, Thierry Reding wrote:
On Fri, Jun 26, 2020 at 01:06:58PM +0200, Karol Herbst wrote:
On Tue, Jun 23, 2020 at 3:03 PM Mikko Perttunen cyndis@kapsi.fi wrote:
# Host1x/TegraDRM UAPI proposal
This is a proposal for a stable UAPI for Host1x and TegraDRM, to replace the current TegraDRM UAPI that is behind `STAGING` and quite obsolete in many ways.
I haven't written any implementation yet -- I'll do that once there is some agreement on the high-level design.
Current open items:
- The syncpoint UAPI allows userspace to create sync_file FDs with
arbitrary syncpoint fences. dma_fence code currently seems to assume all fences will be signaled, which would not necessarily be the case with this interface.
- Previously present GEM IOCTLs (GEM_CREATE, GEM_MMAP) are not present.
Not sure if they are still needed.
Hi, as this wasn't addressed here (and sorry if I missed it): is there an open source userspace making use of this UAPI? Because this is something which needs to be seen before it can be included at all.
There's a set of commits that implement an earlier draft here:
https://github.com/thierryreding/linux/tree/for-5.6/drm-tegra-destage-abi
and corresponding changes to libdrm to make use of that and implement tests using the various generations of VIC here:
https://cgit.freedesktop.org/~tagr/drm/log/
Before actually jumping to an implementation we wanted to go over the design a bit more to avoid wasting a lot of work, like we've done in the past. Turns out it isn't easy to get everyone to agree on a good ABI. Who would've thought? =)
I expect that once the discussion around the ABI settles Mikko will start on implementing this ABI in the kernel and we'll update the libdrm patches to make use of the new ABI as well.
Do we have more than libdrm and tests for this, like something somewhat functional? Since tegradrm landed years ago we've refined the criteria a bit in this regard, latest version is explicit in that it needs to be something that's functional (for end-users in some form), not just infrastructure and tests.
We have Opentegra [1] and VDPAU [2] userspace drivers for older Tegra SoCs that have been using the staging UAPI for years now.
[1] https://github.com/grate-driver/xf86-video-opentegra [2] https://github.com/grate-driver/libvdpau-tegra
The UAPI and the kernel driver updates are very needed for these drivers because of the missing UAPI synchronization features and performance problems of the kernel driver.
So we already have some real-world userspace for the testing!
It's not the first and not the second time we're discussing the Tegra DRM UAPI changes, but this time it feels like there is a good chance that it will finally materialize and I'm very happy about it :)
On Fri, Jun 26, 2020 at 04:59:45PM +0300, Dmitry Osipenko wrote:
26.06.2020 16:38, Daniel Vetter пишет:
On Fri, Jun 26, 2020 at 01:40:40PM +0200, Thierry Reding wrote:
On Fri, Jun 26, 2020 at 01:06:58PM +0200, Karol Herbst wrote:
On Tue, Jun 23, 2020 at 3:03 PM Mikko Perttunen cyndis@kapsi.fi wrote:
# Host1x/TegraDRM UAPI proposal
This is a proposal for a stable UAPI for Host1x and TegraDRM, to replace the current TegraDRM UAPI that is behind `STAGING` and quite obsolete in many ways.
I haven't written any implementation yet -- I'll do that once there is some agreement on the high-level design.
Current open items:
- The syncpoint UAPI allows userspace to create sync_file FDs with
arbitrary syncpoint fences. dma_fence code currently seems to assume all fences will be signaled, which would not necessarily be the case with this interface.
- Previously present GEM IOCTLs (GEM_CREATE, GEM_MMAP) are not present.
Not sure if they are still needed.
Hi, as this wasn't addressed here (and sorry if I missed it): is there an open source userspace making use of this UAPI? Because this is something which needs to be seen before it can be included at all.
There's a set of commits that implement an earlier draft here:
https://github.com/thierryreding/linux/tree/for-5.6/drm-tegra-destage-abi
and corresponding changes to libdrm to make use of that and implement tests using the various generations of VIC here:
https://cgit.freedesktop.org/~tagr/drm/log/
Before actually jumping to an implementation we wanted to go over the design a bit more to avoid wasting a lot of work, like we've done in the past. Turns out it isn't easy to get everyone to agree on a good ABI. Who would've thought? =)
I expect that once the discussion around the ABI settles Mikko will start on implementing this ABI in the kernel and we'll update the libdrm patches to make use of the new ABI as well.
Do we have more than libdrm and tests for this, like something somewhat functional? Since tegradrm landed years ago we've refined the criteria a bit in this regard, latest version is explicit in that it needs to be something that's functional (for end-users in some form), not just infrastructure and tests.
We have Opentegra [1] and VDPAU [2] userspace drivers for older Tegra SoCs that have been using the staging UAPI for years now.
[1] https://github.com/grate-driver/xf86-video-opentegra [2] https://github.com/grate-driver/libvdpau-tegra
The UAPI and the kernel driver updates are very needed for these drivers because of the missing UAPI synchronization features and performance problems of the kernel driver.
So we already have some real-world userspace for the testing!
Awesome! Maybe for future rounds include the links for the vdpau driver. I didn't know about that one at all. -opentegra is probably not so relevant anymore (and I flat out forgot it exists) ... Including the userspace side links is good so that forgetful people like me don't nag :-) -Daniel
It's not the first and not the second time we're discussing the Tegra DRM UAPI changes, but this time it feels like there is a good chance that it will finally materialize and I'm very happy about it :)
23.06.2020 15:09, Mikko Perttunen пишет:
struct drm_tegra_submit_syncpt_incr { /* * [in] Syncpoint FD of the syncpoint that the job will * increment. */ __s32 syncpt_fd;
/* * [in] Number of increments that the job will do. */ __u32 num_incrs;
/* * [out] Value the syncpoint will have once all increments have * executed. */ __u32 fence_value;
__u32 reserved[1]; };
The job should be considered executed once the final sync point is incremented.
Hence, there should be only one sync point per-job for increment, why would you ever need more than one?
Could you please explain what this submit_syncpt_incr is about?
On 6/28/20 12:47 AM, Dmitry Osipenko wrote:
23.06.2020 15:09, Mikko Perttunen пишет:
struct drm_tegra_submit_syncpt_incr { /* * [in] Syncpoint FD of the syncpoint that the job will * increment. */ __s32 syncpt_fd;
/* * [in] Number of increments that the job will do. */ __u32 num_incrs;
/* * [out] Value the syncpoint will have once all increments have * executed. */ __u32 fence_value;
__u32 reserved[1]; };
The job should be considered executed once the final sync point is incremented.
Hence, there should be only one sync point per-job for increment, why would you ever need more than one?
Could you please explain what this submit_syncpt_incr is about?
This tells the kernel which syncpoint will be incremented and how many times for job tracking and verifying the user has access to that syncpoint.
A second syncpoint is used for NVENC in slice encoding mode, where the engine can be programmed to count encoded slices by incrementing a syncpoint. I'll ask around to see if I can find some more details on this.
Since the usecase is somewhat niche, we could see if we can have a design where it's only one syncpoint, but extensible later if needed.
Mikko
23.06.2020 15:09, Mikko Perttunen пишет:
/* Command is an opcode gather from a GEM handle */ #define DRM_TEGRA_SUBMIT_COMMAND_GATHER 0 /* Command is an opcode gather from a user pointer */ #define DRM_TEGRA_SUBMIT_COMMAND_GATHER_UPTR 1 /* Command is a wait for syncpt fence completion */ #define DRM_TEGRA_SUBMIT_COMMAND_WAIT_SYNCPT 2 /* Command is a wait for SYNC_FILE FD completion */ #define DRM_TEGRA_SUBMIT_COMMAND_WAIT_SYNC_FILE 3 /* Command is a wait for DRM syncobj completion */ #define DRM_TEGRA_SUBMIT_COMMAND_WAIT_SYNCOBJ 4
/* * Allow driver to skip command execution if engine * was not accessed by another channel between * submissions. */ #define DRM_TEGRA_SUBMIT_CONTEXT_SETUP (1<<0)
struct drm_tegra_submit_command { __u16 type; __u16 flags;
Shouldn't the "packed" attribute be needed if a non-32bit aligned fields are used?
union { struct { /* GEM handle */ __u32 handle;
/* * Offset into GEM object in bytes. * Must be aligned by 4. */ __u64 offset;
64bits for a gather offset is a bit too much, in most cases gathers are under 4K.
u32 should be more than enough (maybe even u16 if offset is given in a dword granularity).
/* * Length of gather in bytes. * Must be aligned by 4. */ __u64 length;
u32/16
} gather;
struct { __u32 reserved[1];
/* * Pointer to gather data. * Must be aligned by 4 bytes. */ __u64 base; /* * Length of gather in bytes. * Must be aligned by 4. */ __u64 length; } gather_uptr;
What about to inline the UPTR gather data and relocs into the drm_tegra_submit_command[] buffer:
struct drm_tegra_submit_command { struct { u16 num_words; u16 num_relocs;
gather_data[]; drm_tegra_submit_relocation relocs[]; } gather_uptr; };
struct drm_tegra_channel_submit { __u32 num_syncpt_incrs; __u32 syncpt_idx;
__u64 commands_ptr; __u32 commands_size; ... };
struct drm_tegra_submit_command example[] = { cmd.gather_uptr{}, ... gather_data[], gather_relocs[], cmd.wait_syncpt{}, ... };
This way we will have only a single copy_from_user() for the whole cmdstream, which should be more efficient to do and nicer from both userspace and kernel perspectives in regards to forming and parsing the commands.
On 6/28/20 1:32 AM, Dmitry Osipenko wrote:
23.06.2020 15:09, Mikko Perttunen пишет:
/* Command is an opcode gather from a GEM handle */ #define DRM_TEGRA_SUBMIT_COMMAND_GATHER 0 /* Command is an opcode gather from a user pointer */ #define DRM_TEGRA_SUBMIT_COMMAND_GATHER_UPTR 1 /* Command is a wait for syncpt fence completion */ #define DRM_TEGRA_SUBMIT_COMMAND_WAIT_SYNCPT 2 /* Command is a wait for SYNC_FILE FD completion */ #define DRM_TEGRA_SUBMIT_COMMAND_WAIT_SYNC_FILE 3 /* Command is a wait for DRM syncobj completion */ #define DRM_TEGRA_SUBMIT_COMMAND_WAIT_SYNCOBJ 4
/* * Allow driver to skip command execution if engine * was not accessed by another channel between * submissions. */ #define DRM_TEGRA_SUBMIT_CONTEXT_SETUP (1<<0)
struct drm_tegra_submit_command { __u16 type; __u16 flags;
Shouldn't the "packed" attribute be needed if a non-32bit aligned fields are used?
I guess we should change these to u32 for consistency.
union { struct { /* GEM handle */ __u32 handle;
/* * Offset into GEM object in bytes. * Must be aligned by 4. */ __u64 offset;
64bits for a gather offset is a bit too much, in most cases gathers are under 4K.
u32 should be more than enough (maybe even u16 if offset is given in a dword granularity).
I guess this can be changed to u32, though I don't think there is any particularly pressing reason not to use u64.
In any case, I think we concluded that we'll drop the GEM gather command for now.
/* * Length of gather in bytes. * Must be aligned by 4. */ __u64 length;
u32/16
} gather;
struct { __u32 reserved[1];
/* * Pointer to gather data. * Must be aligned by 4 bytes. */ __u64 base; /* * Length of gather in bytes. * Must be aligned by 4. */ __u64 length; } gather_uptr;
What about to inline the UPTR gather data and relocs into the drm_tegra_submit_command[] buffer:
struct drm_tegra_submit_command { struct { u16 num_words; u16 num_relocs;
gather_data[]; drm_tegra_submit_relocation relocs[];
} gather_uptr; };
struct drm_tegra_channel_submit { __u32 num_syncpt_incrs; __u32 syncpt_idx;
__u64 commands_ptr;
__u32 commands_size; ... };
struct drm_tegra_submit_command example[] = { cmd.gather_uptr{}, ... gather_data[], gather_relocs[], cmd.wait_syncpt{}, ... };
This way we will have only a single copy_from_user() for the whole cmdstream, which should be more efficient to do and nicer from both userspace and kernel perspectives in regards to forming and parsing the commands.
I'm not sure I agree it being nicer with regard to forming and parsing - Can you actually place multiple unsized arrays in a struct without pointers? - gather_data's length is a multiple of 4, and so since we have u64's in drm_tegra_submit_relocation, alignment needs to be taken care of manually, both when forming and kernel needs to validate it while parsing. Depending on number of words in the gather, padding would need to be inserted. We could swap the two around but it still feels more brittle.
Also, if we read the gather from a separate address, userspace doesn't necessarily need to know the length of the gather (and number of relocs) upfront, so it's easier to have a builder pattern without needing to copy things later.
If we allow direct page mappings for gathers later, a separate address would make things also a little bit easier. For direct page mappings userspace would need to keep the opcode data unchanged while the job is being executed, so userspace could keep an arena where the gathers are placed, and refer to segments of that, instead of having to keep the drm_tegra_submit_commands alive.
Mikko
28.06.2020 13:28, Mikko Perttunen пишет:
On 6/28/20 1:32 AM, Dmitry Osipenko wrote:
23.06.2020 15:09, Mikko Perttunen пишет:
/* Command is an opcode gather from a GEM handle */ #define DRM_TEGRA_SUBMIT_COMMAND_GATHER 0 /* Command is an opcode gather from a user pointer */ #define DRM_TEGRA_SUBMIT_COMMAND_GATHER_UPTR 1 /* Command is a wait for syncpt fence completion */ #define DRM_TEGRA_SUBMIT_COMMAND_WAIT_SYNCPT 2 /* Command is a wait for SYNC_FILE FD completion */ #define DRM_TEGRA_SUBMIT_COMMAND_WAIT_SYNC_FILE 3 /* Command is a wait for DRM syncobj completion */ #define DRM_TEGRA_SUBMIT_COMMAND_WAIT_SYNCOBJ 4
/* * Allow driver to skip command execution if engine * was not accessed by another channel between * submissions. */ #define DRM_TEGRA_SUBMIT_CONTEXT_SETUP (1<<0)
struct drm_tegra_submit_command { __u16 type; __u16 flags;
Shouldn't the "packed" attribute be needed if a non-32bit aligned fields are used?
I guess we should change these to u32 for consistency.
union { struct { /* GEM handle */ __u32 handle;
/* * Offset into GEM object in bytes. * Must be aligned by 4. */ __u64 offset;
64bits for a gather offset is a bit too much, in most cases gathers are under 4K.
u32 should be more than enough (maybe even u16 if offset is given in a dword granularity).
I guess this can be changed to u32, though I don't think there is any particularly pressing reason not to use u64.
In any case, I think we concluded that we'll drop the GEM gather command for now.
Sure, I commented this just in a case, for the future :)
/* * Length of gather in bytes. * Must be aligned by 4. */ __u64 length;
u32/16
} gather;
struct { __u32 reserved[1];
/* * Pointer to gather data. * Must be aligned by 4 bytes. */ __u64 base; /* * Length of gather in bytes. * Must be aligned by 4. */ __u64 length; } gather_uptr;
What about to inline the UPTR gather data and relocs into the drm_tegra_submit_command[] buffer:
struct drm_tegra_submit_command { struct { u16 num_words; u16 num_relocs;
gather_data[]; drm_tegra_submit_relocation relocs[]; } gather_uptr; };
struct drm_tegra_channel_submit { __u32 num_syncpt_incrs; __u32 syncpt_idx;
__u64 commands_ptr; __u32 commands_size; ... };
struct drm_tegra_submit_command example[] = { cmd.gather_uptr{}, ... gather_data[], gather_relocs[], cmd.wait_syncpt{}, ... };
This way we will have only a single copy_from_user() for the whole cmdstream, which should be more efficient to do and nicer from both userspace and kernel perspectives in regards to forming and parsing the commands.
I'm not sure I agree it being nicer with regard to forming and parsing
- Can you actually place multiple unsized arrays in a struct without
pointers?
No :) But there are no unsized arrays here. The parser will read first command and then move pointer to the next command based on size of the parsed command.
- gather_data's length is a multiple of 4, and so since we have u64's in
drm_tegra_submit_relocation, alignment needs to be taken care of manually, both when forming and kernel needs to validate it while parsing. Depending on number of words in the gather, padding would need to be inserted. We could swap the two around but it still feels more brittle.
Yes, there will be unaligned reads on ARM64, but I don't think it's a problem. Isn't it?
Also, if we read the gather from a separate address, userspace doesn't necessarily need to know the length of the gather (and number of relocs) upfront, so it's easier to have a builder pattern without needing to copy things later.
Usually there are 2-3 relocs per gather, so userspace could simply maintain a fixed-sized static buffer for the relocs (say 32 relocs). Once gather is sliced by userspace, the stashed relocs will be appended to the comands buffer and next gather will be formed.
The relocs copying doesn't really costs anything for userspace, the price is incomparable with the cost of UPTR copying of each gather for the kernel.
Well, the UPTR copying isn't expensive, but if there is a single copy for the whole IOCTL, then it's even much better!
If we allow direct page mappings for gathers later, a separate address would make things also a little bit easier. For direct page mappings userspace would need to keep the opcode data unchanged while the job is being executed, so userspace could keep an arena where the gathers are placed, and refer to segments of that, instead of having to keep the drm_tegra_submit_commands alive.
Okay, what about to have a single UPTR buffer for all gathers of a job?
struct drm_tegra_channel_submit { u64 commands_ptr; u64 gathers_ptr;
u32 commands_size; u32 gathers_size; ... };
struct drm_tegra_submit_command { ... union { struct { /* * Length of gather in bytes. * Must be aligned by 4. */ __u32 length; } gather_uptr; }; };
Now we have a single UPTR gather that could be sliced into sub-gathers during of job's submission and also the whole data could be copied at once by kernel driver for the parsing purposes.
The userspace will have to preallocate a large-enough buffer upfront for the gathers. If it runs out of space in the buffer, then it should reallocate a larger buffer. Nice and simple :)
On 6/29/20 3:00 AM, Dmitry Osipenko wrote:
28.06.2020 13:28, Mikko Perttunen пишет:
On 6/28/20 1:32 AM, Dmitry Osipenko wrote:
23.06.2020 15:09, Mikko Perttunen пишет:
/* Command is an opcode gather from a GEM handle */ #define DRM_TEGRA_SUBMIT_COMMAND_GATHER 0 /* Command is an opcode gather from a user pointer */ #define DRM_TEGRA_SUBMIT_COMMAND_GATHER_UPTR 1 /* Command is a wait for syncpt fence completion */ #define DRM_TEGRA_SUBMIT_COMMAND_WAIT_SYNCPT 2 /* Command is a wait for SYNC_FILE FD completion */ #define DRM_TEGRA_SUBMIT_COMMAND_WAIT_SYNC_FILE 3 /* Command is a wait for DRM syncobj completion */ #define DRM_TEGRA_SUBMIT_COMMAND_WAIT_SYNCOBJ 4
/* * Allow driver to skip command execution if engine * was not accessed by another channel between * submissions. */ #define DRM_TEGRA_SUBMIT_CONTEXT_SETUP (1<<0)
struct drm_tegra_submit_command { __u16 type; __u16 flags;
Shouldn't the "packed" attribute be needed if a non-32bit aligned fields are used?
I guess we should change these to u32 for consistency.
union { struct { /* GEM handle */ __u32 handle;
/* * Offset into GEM object in bytes. * Must be aligned by 4. */ __u64 offset;
64bits for a gather offset is a bit too much, in most cases gathers are under 4K.
u32 should be more than enough (maybe even u16 if offset is given in a dword granularity).
I guess this can be changed to u32, though I don't think there is any particularly pressing reason not to use u64.
In any case, I think we concluded that we'll drop the GEM gather command for now.
Sure, I commented this just in a case, for the future :)
Yep, OK :)
/* * Length of gather in bytes. * Must be aligned by 4. */ __u64 length;
u32/16
} gather;
struct { __u32 reserved[1];
/* * Pointer to gather data. * Must be aligned by 4 bytes. */ __u64 base; /* * Length of gather in bytes. * Must be aligned by 4. */ __u64 length; } gather_uptr;
What about to inline the UPTR gather data and relocs into the drm_tegra_submit_command[] buffer:
struct drm_tegra_submit_command { struct { u16 num_words; u16 num_relocs;
gather_data[]; drm_tegra_submit_relocation relocs[]; } gather_uptr; };
struct drm_tegra_channel_submit { __u32 num_syncpt_incrs; __u32 syncpt_idx;
__u64 commands_ptr; __u32 commands_size; ... };
struct drm_tegra_submit_command example[] = { cmd.gather_uptr{}, ... gather_data[], gather_relocs[], cmd.wait_syncpt{}, ... };
This way we will have only a single copy_from_user() for the whole cmdstream, which should be more efficient to do and nicer from both userspace and kernel perspectives in regards to forming and parsing the commands.
I'm not sure I agree it being nicer with regard to forming and parsing
- Can you actually place multiple unsized arrays in a struct without
pointers?
No :) But there are no unsized arrays here. The parser will read first command and then move pointer to the next command based on size of the parsed command.
- gather_data's length is a multiple of 4, and so since we have u64's in
drm_tegra_submit_relocation, alignment needs to be taken care of manually, both when forming and kernel needs to validate it while parsing. Depending on number of words in the gather, padding would need to be inserted. We could swap the two around but it still feels more brittle.
Yes, there will be unaligned reads on ARM64, but I don't think it's a problem. Isn't it?
It's not a big problem, but I think it would be less error-prone to have naturally aligned data.
Also, if we read the gather from a separate address, userspace doesn't necessarily need to know the length of the gather (and number of relocs) upfront, so it's easier to have a builder pattern without needing to copy things later.
Usually there are 2-3 relocs per gather, so userspace could simply maintain a fixed-sized static buffer for the relocs (say 32 relocs). Once gather is sliced by userspace, the stashed relocs will be appended to the comands buffer and next gather will be formed.
The relocs copying doesn't really costs anything for userspace, the price is incomparable with the cost of UPTR copying of each gather for the kernel.
Well, the UPTR copying isn't expensive, but if there is a single copy for the whole IOCTL, then it's even much better!
If we allow direct page mappings for gathers later, a separate address would make things also a little bit easier. For direct page mappings userspace would need to keep the opcode data unchanged while the job is being executed, so userspace could keep an arena where the gathers are placed, and refer to segments of that, instead of having to keep the drm_tegra_submit_commands alive.
Okay, what about to have a single UPTR buffer for all gathers of a job?
struct drm_tegra_channel_submit { u64 commands_ptr; u64 gathers_ptr;
u32 commands_size; u32 gathers_size; ... };
struct drm_tegra_submit_command { ... union { struct { /* * Length of gather in bytes. * Must be aligned by 4. */ __u32 length; } gather_uptr; }; };
Now we have a single UPTR gather that could be sliced into sub-gathers during of job's submission and also the whole data could be copied at once by kernel driver for the parsing purposes.
The userspace will have to preallocate a large-enough buffer upfront for the gathers. If it runs out of space in the buffer, then it should reallocate a larger buffer. Nice and simple :)
I think that would work fine for the usecases that I can think of.
Mikko
23.06.2020 15:09, Mikko Perttunen пишет:
### IOCTL HOST1X_ALLOCATE_SYNCPOINT (on /dev/host1x)
Allocates a free syncpoint, returning a file descriptor representing it. Only the owner of the file descriptor is allowed to mutate the value of the syncpoint.
struct host1x_ctrl_allocate_syncpoint { /** * @fd: * * [out] New file descriptor representing the allocated syncpoint. */ __s32 fd; __u32 reserved[3]; };
We should need at least these basic things from the sync points API:
- Execution context shouldn't be able to tamper sync points of the other contexts.
- Sync point could be shared with other contexts for explicit fencing.
- Sync points should work reliably.
Some problems of the current Host1x driver, like where it falls over if sync point value is out-of-sync + all the hang-job recovery labor could be easily reduced if sync point health is protected by extra UAPI constraints.
So I think we may want the following:
1. We still should need to assign sync point ID to a DRM-channel's context. This sync point ID will be used for a commands stream forming, like it is done by the current staging UAPI.
So we should need to retain the DRM_TEGRA_GET_SYNCPT IOCTL, but improve it.
2. Allocated sync point must have a clean hardware state.
3. Sync points should be properly refcounted. Job's sync points shouldn't be re-used while job is alive.
4. The job's sync point can't be re-used after job's submission (UAPI constraint!). Userspace must free sync point and allocate a new one for the next job submission. And now we:
- Know that job's sync point is always in a healthy state!
- We're not limited by a number of physically available hardware sync points! Allocation should block until free sync point is available.
- The logical number of job's sync point increments matches the SP hardware state! Which is handy for a job's debugging.
Optionally, the job's sync point could be auto-removed from the DRM's context after job's submission, avoiding a need for an extra SYNCPT_PUT IOCTL invocation to be done by userspace after the job's submission. Could be a job's flag.
We could avoid a need for a statically-allocated sync points at all for a patched cmdstreams! The sync point could be dynamically allocated at a job's submission time by the kernel driver and then cmdstream will be patched with this sync point.
On 6/28/20 2:27 AM, Dmitry Osipenko wrote:
23.06.2020 15:09, Mikko Perttunen пишет:
### IOCTL HOST1X_ALLOCATE_SYNCPOINT (on /dev/host1x)
Allocates a free syncpoint, returning a file descriptor representing it. Only the owner of the file descriptor is allowed to mutate the value of the syncpoint.
struct host1x_ctrl_allocate_syncpoint { /** * @fd: * * [out] New file descriptor representing the allocated syncpoint. */ __s32 fd; __u32 reserved[3]; };
We should need at least these basic things from the sync points API >
- Execution context shouldn't be able to tamper sync points of the other
contexts.
This is covered by this UAPI - when submitting, as part of the syncpt_incr struct you pass the syncpoint FD. This way the driver can check the syncpoints used are correct, or program HW protection.
- Sync point could be shared with other contexts for explicit fencing.
Not sure what you specifically mean; you can get the ID out of the syncpoint fd and share the ID for read-only access. (Or the FD for read-write access)
- Sync points should work reliably.
Some problems of the current Host1x driver, like where it falls over if sync point value is out-of-sync + all the hang-job recovery labor could be easily reduced if sync point health is protected by extra UAPI constraints. > So I think we may want the following:
- We still should need to assign sync point ID to a DRM-channel's
context. This sync point ID will be used for a commands stream forming, like it is done by the current staging UAPI.
So we should need to retain the DRM_TEGRA_GET_SYNCPT IOCTL, but improve it.
- Allocated sync point must have a clean hardware state.
What do you mean by clean hardware state?
- Sync points should be properly refcounted. Job's sync points
shouldn't be re-used while job is alive.
- The job's sync point can't be re-used after job's submission (UAPI
constraint!). Userspace must free sync point and allocate a new one for the next job submission. And now we:
Know that job's sync point is always in a healthy state!
We're not limited by a number of physically available hardware sync
points! Allocation should block until free sync point is available.
- The logical number of job's sync point increments matches the SP
hardware state! Which is handy for a job's debugging.
Optionally, the job's sync point could be auto-removed from the DRM's context after job's submission, avoiding a need for an extra SYNCPT_PUT IOCTL invocation to be done by userspace after the job's submission. Could be a job's flag.
I think this would cause problems where after a job completes but before the fence has been waited, the syncpoint is already recycled (especially if the syncpoint is reset into some clean state).
I would prefer having a syncpoint for each userspace channel context (several of which could share a hardware channel if MLOCKing is not used).
In my experience it's then not difficult to pinpoint which job has failed, and if each userspace channel context uses a separate syncpoint, a hanging job wouldn't mess with other application's jobs, either.
Mikko
We could avoid a need for a statically-allocated sync points at all for a patched cmdstreams! The sync point could be dynamically allocated at a job's submission time by the kernel driver and then cmdstream will be patched with this sync point.
28.06.2020 12:44, Mikko Perttunen пишет:
On 6/28/20 2:27 AM, Dmitry Osipenko wrote:
23.06.2020 15:09, Mikko Perttunen пишет:
### IOCTL HOST1X_ALLOCATE_SYNCPOINT (on /dev/host1x)
Allocates a free syncpoint, returning a file descriptor representing it. Only the owner of the file descriptor is allowed to mutate the value of the syncpoint.
struct host1x_ctrl_allocate_syncpoint { /** * @fd: * * [out] New file descriptor representing the allocated syncpoint. */ __s32 fd; __u32 reserved[3]; };
We should need at least these basic things from the sync points API >
- Execution context shouldn't be able to tamper sync points of the other
contexts.
This is covered by this UAPI - when submitting, as part of the syncpt_incr struct you pass the syncpoint FD. This way the driver can check the syncpoints used are correct, or program HW protection.
- Sync point could be shared with other contexts for explicit fencing.
Not sure what you specifically mean; you can get the ID out of the syncpoint fd and share the ID for read-only access. (Or the FD for read-write access)
I enumerated the overall points that UAPI should provide to us, just for clarity. Not like you haven't covered any of them, sorry for the confusion! :)
Please see more comments below!
- Sync points should work reliably.
Some problems of the current Host1x driver, like where it falls over if sync point value is out-of-sync + all the hang-job recovery labor could be easily reduced if sync point health is protected by extra UAPI constraints. > So I think we may want the following:
- We still should need to assign sync point ID to a DRM-channel's
context. This sync point ID will be used for a commands stream forming, like it is done by the current staging UAPI.
So we should need to retain the DRM_TEGRA_GET_SYNCPT IOCTL, but improve it.
My point here is that the UAPI shouldn't be able to increment the job's sync point using SYNCPOINT_INCREMENT IOCTL, which is another UAPI constraint.
I'm suggesting that we should have two methods of sync point allocations:
1) Sync point that could be used only by a submitted job.
2) Sync point that could be incremented by CPU.
The first method will allocate a raw sync point ID that is assigned to the channel's context. This ID will be used for the job's completion tracking. Perhaps this method also could optionally return a sync point FD if you'd want to wait on this sync point by another job.
We don't need a dedicated sync point FD for all kinds of jobs, don't we? For example, I don't see why a sync point FD may be needed in a case of Opentegra jobs.
- Allocated sync point must have a clean hardware state.
What do you mean by clean hardware state?
I mean that sync point should have a predictable state [1], it shouldn't accidentally fire during of hardware programming for example.
[1] https://github.com/grate-driver/linux/blob/master/drivers/gpu/host1x/soc/syn...
For a submitted job, the job's sync point state could be reset at a submission time, for example like I did it in the grate-kernel's experimental driver [2].
[2] https://github.com/grate-driver/linux/blob/master/drivers/gpu/host1x/soc/cha...
- Sync points should be properly refcounted. Job's sync points
shouldn't be re-used while job is alive.
- The job's sync point can't be re-used after job's submission (UAPI
constraint!). Userspace must free sync point and allocate a new one for the next job submission. And now we:
- Know that job's sync point is always in a healthy state!
- We're not limited by a number of physically available hardware sync points! Allocation should block until free sync point is available.
- The logical number of job's sync point increments matches the SP hardware state! Which is handy for a job's debugging.
Optionally, the job's sync point could be auto-removed from the DRM's context after job's submission, avoiding a need for an extra SYNCPT_PUT IOCTL invocation to be done by userspace after the job's submission. Could be a job's flag.
I think this would cause problems where after a job completes but before the fence has been waited, the syncpoint is already recycled (especially if the syncpoint is reset into some clean state).
Exactly, good point! The dma-fence shouldn't be hardwired to the sync point in order to avoid this situation :)
Please take a look at the fence implementation that I made for the grate-driver [3]. The host1x-fence is a dma-fence [4] that is attached to a sync point by host1x_fence_create(). Once job is completed, the host1x-fence is detached from the sync point [5][6] and sync point could be recycled safely!
[3] https://github.com/grate-driver/linux/blob/master/drivers/gpu/host1x/fence.c
[4] https://github.com/grate-driver/linux/blob/master/include/linux/host1x.h#L45...
[5] https://github.com/grate-driver/linux/blob/master/drivers/gpu/host1x/soc/syn...
[6] https://github.com/grate-driver/linux/blob/master/drivers/gpu/drm/tegra/uapi...
Please try to take a closer look at the grate-driver's implementation if you haven't yet. I think we should be able to reuse or improve some of the ideas. That implementation isn't 100% complete, it doesn't cover things like CPU-incremented or exported sync points for example, but basics are there.
I would prefer having a syncpoint for each userspace channel context (several of which could share a hardware channel if MLOCKing is not used).
In my experience it's then not difficult to pinpoint which job has failed, and if each userspace channel context uses a separate syncpoint, a hanging job wouldn't mess with other application's jobs, either.
I agree that there shouldn't be any problems with finding what job is hanged. The timed out job is always the hanged job, no? :)
Also, please take a look at the DRM scheduler. Once I started to wire up the DRM scheduler support in the grate-driver, I realized that there is no real need to try to recover sync point's counter and etc, like the current upstream host1x driver does for a hanged job. When job is hanged, the whole channel should be turned down and reset, the job's sync point state reset, client's HW engine reset, all the pre-queued jobs re-submitted. And the generic DRM job scheduler helps us with that! It also has other neat features which I haven't tried yet, like job priorities for example.
On 6/29/20 5:36 AM, Dmitry Osipenko wrote:
28.06.2020 12:44, Mikko Perttunen пишет:
On 6/28/20 2:27 AM, Dmitry Osipenko wrote:
23.06.2020 15:09, Mikko Perttunen пишет:
### IOCTL HOST1X_ALLOCATE_SYNCPOINT (on /dev/host1x)
Allocates a free syncpoint, returning a file descriptor representing it. Only the owner of the file descriptor is allowed to mutate the value of the syncpoint.
struct host1x_ctrl_allocate_syncpoint { /** * @fd: * * [out] New file descriptor representing the allocated syncpoint. */ __s32 fd; __u32 reserved[3]; };
We should need at least these basic things from the sync points API >
- Execution context shouldn't be able to tamper sync points of the other
contexts.
This is covered by this UAPI - when submitting, as part of the syncpt_incr struct you pass the syncpoint FD. This way the driver can check the syncpoints used are correct, or program HW protection.
- Sync point could be shared with other contexts for explicit fencing.
Not sure what you specifically mean; you can get the ID out of the syncpoint fd and share the ID for read-only access. (Or the FD for read-write access)
I enumerated the overall points that UAPI should provide to us, just for clarity. Not like you haven't covered any of them, sorry for the confusion! :)
Please see more comments below!
Ok, good :)
- Sync points should work reliably.
Some problems of the current Host1x driver, like where it falls over if sync point value is out-of-sync + all the hang-job recovery labor could be easily reduced if sync point health is protected by extra UAPI constraints. > So I think we may want the following:
- We still should need to assign sync point ID to a DRM-channel's
context. This sync point ID will be used for a commands stream forming, like it is done by the current staging UAPI.
So we should need to retain the DRM_TEGRA_GET_SYNCPT IOCTL, but improve it.
My point here is that the UAPI shouldn't be able to increment the job's sync point using SYNCPOINT_INCREMENT IOCTL, which is another UAPI constraint.
I'm suggesting that we should have two methods of sync point allocations:
Sync point that could be used only by a submitted job.
Sync point that could be incremented by CPU.
The first method will allocate a raw sync point ID that is assigned to the channel's context. This ID will be used for the job's completion tracking. Perhaps this method also could optionally return a sync point FD if you'd want to wait on this sync point by another job.
We don't need a dedicated sync point FD for all kinds of jobs, don't we? For example, I don't see why a sync point FD may be needed in a case of Opentegra jobs.
I think it's cleaner if we have just one way to allocate syncpoints, and then those syncpoints can be passed to different things depending on the situation.
If we want to protect direct incrementing while a job is submitted, we could have a locking API where an ongoing job can take a lock refcount in the syncpoint, and incrementing would return -EBUSY.
- Allocated sync point must have a clean hardware state.
What do you mean by clean hardware state?
I mean that sync point should have a predictable state [1], it shouldn't accidentally fire during of hardware programming for example.
[1] https://github.com/grate-driver/linux/blob/master/drivers/gpu/host1x/soc/syn...
For a submitted job, the job's sync point state could be reset at a submission time, for example like I did it in the grate-kernel's experimental driver [2].
[2] https://github.com/grate-driver/linux/blob/master/drivers/gpu/host1x/soc/cha...
- Sync points should be properly refcounted. Job's sync points
shouldn't be re-used while job is alive.
- The job's sync point can't be re-used after job's submission (UAPI
constraint!). Userspace must free sync point and allocate a new one for the next job submission. And now we:
- Know that job's sync point is always in a healthy state!
- We're not limited by a number of physically available hardware sync points! Allocation should block until free sync point is available.
- The logical number of job's sync point increments matches the SP hardware state! Which is handy for a job's debugging.
Optionally, the job's sync point could be auto-removed from the DRM's context after job's submission, avoiding a need for an extra SYNCPT_PUT IOCTL invocation to be done by userspace after the job's submission. Could be a job's flag.
I think this would cause problems where after a job completes but before the fence has been waited, the syncpoint is already recycled (especially if the syncpoint is reset into some clean state).
Exactly, good point! The dma-fence shouldn't be hardwired to the sync point in order to avoid this situation :)
Please take a look at the fence implementation that I made for the grate-driver [3]. The host1x-fence is a dma-fence [4] that is attached to a sync point by host1x_fence_create(). Once job is completed, the host1x-fence is detached from the sync point [5][6] and sync point could be recycled safely!
What if the fence has been programmed as a prefence to another channel (that is getting delayed), or to the GPU, or some other accelerator like DLA, or maybe some other VM? Those don't know the dma_fence has been signaled, they can only rely on the syncpoint ID/threshold pair.
[3] https://github.com/grate-driver/linux/blob/master/drivers/gpu/host1x/fence.c
[4] https://github.com/grate-driver/linux/blob/master/include/linux/host1x.h#L45...
[5] https://github.com/grate-driver/linux/blob/master/drivers/gpu/host1x/soc/syn...
[6] https://github.com/grate-driver/linux/blob/master/drivers/gpu/drm/tegra/uapi...
Please try to take a closer look at the grate-driver's implementation if you haven't yet. I think we should be able to reuse or improve some of the ideas. That implementation isn't 100% complete, it doesn't cover things like CPU-incremented or exported sync points for example, but basics are there.
Sure, I'll take a look.
I would prefer having a syncpoint for each userspace channel context (several of which could share a hardware channel if MLOCKing is not used).
In my experience it's then not difficult to pinpoint which job has failed, and if each userspace channel context uses a separate syncpoint, a hanging job wouldn't mess with other application's jobs, either.
I agree that there shouldn't be any problems with finding what job is hanged. The timed out job is always the hanged job, no? :)
Also, please take a look at the DRM scheduler. Once I started to wire up the DRM scheduler support in the grate-driver, I realized that there is no real need to try to recover sync point's counter and etc, like the current upstream host1x driver does for a hanged job. When job is hanged, the whole channel should be turned down and reset, the job's sync point state reset, client's HW engine reset, all the pre-queued jobs re-submitted. And the generic DRM job scheduler helps us with that! It also has other neat features which I haven't tried yet, like job priorities for example.
Thanks, I'll take a look at this as well.
Mikko
29.06.2020 13:27, Mikko Perttunen пишет: ...
We don't need a dedicated sync point FD for all kinds of jobs, don't we? For example, I don't see why a sync point FD may be needed in a case of Opentegra jobs.
I think it's cleaner if we have just one way to allocate syncpoints, and then those syncpoints can be passed to different things depending on the situation.
If we want to protect direct incrementing while a job is submitted, we could have a locking API where an ongoing job can take a lock refcount in the syncpoint, and incrementing would return -EBUSY.
Okay, let's go with this for now.
29.06.2020 13:27, Mikko Perttunen пишет: ...
- Sync points should be properly refcounted. Job's sync points
shouldn't be re-used while job is alive.
- The job's sync point can't be re-used after job's submission (UAPI
constraint!). Userspace must free sync point and allocate a new one for the next job submission. And now we:
- Know that job's sync point is always in a healthy state!
- We're not limited by a number of physically available hardware sync points! Allocation should block until free sync point is available.
It also occurred to me that if allocation is blocking and if there is a need to allocate multiple sync points for a job, then the IOCTL should be able to allocate multiple sync points at once. This will prevent interlock situation where two context could block on each other.
29.06.2020 13:27, Mikko Perttunen пишет: ...
- The job's sync point can't be re-used after job's submission (UAPI
constraint!). Userspace must free sync point and allocate a new one for the next job submission. And now we:
- Know that job's sync point is always in a healthy state!
- We're not limited by a number of physically available hardware sync points! Allocation should block until free sync point is available.
- The logical number of job's sync point increments matches the SP hardware state! Which is handy for a job's debugging.
Optionally, the job's sync point could be auto-removed from the DRM's context after job's submission, avoiding a need for an extra SYNCPT_PUT IOCTL invocation to be done by userspace after the job's submission. Could be a job's flag.
I think this would cause problems where after a job completes but before the fence has been waited, the syncpoint is already recycled (especially if the syncpoint is reset into some clean state).
Exactly, good point! The dma-fence shouldn't be hardwired to the sync point in order to avoid this situation :)
Please take a look at the fence implementation that I made for the grate-driver [3]. The host1x-fence is a dma-fence [4] that is attached to a sync point by host1x_fence_create(). Once job is completed, the host1x-fence is detached from the sync point [5][6] and sync point could be recycled safely!
What if the fence has been programmed as a prefence to another channel (that is getting delayed), or to the GPU, or some other accelerator like DLA, or maybe some other VM? Those don't know the dma_fence has been signaled, they can only rely on the syncpoint ID/threshold pair.
The explicit job's fence is always just a dma-fence, it's not tied to a host1x-fence and it should be waited for a signal on CPU.
If you want to make job to wait for a sync point on hardware, then you should use the drm_tegra_submit_command wait-command.
Again, please notice that DRM scheduler supports the job-submitted fence! This dma-fence will signal once job is pushed to hardware for execution, so it shouldn't be a problem to maintain jobs order for a complex jobs without much hassle. We'll need to write some userspace to check how it works in practice :) For now I really had experience with a simple jobs only.
Secondly, I suppose neither GPU, nor DLA could wait on a host1x sync point, correct? Or are they integrated with Host1x HW?
Anyways, it shouldn't be difficult to resolve dma-fence into host1x-fence, get SP ID and maintain the SP's liveness. Please see more below.
In the grate-driver I made all sync points refcounted, so sync point won't be recycled while it has active users [1][2][3] in kernel (or userspace).
[1] https://github.com/grate-driver/linux/blob/master/include/linux/host1x.h#L42... [2] https://github.com/grate-driver/linux/blob/master/include/linux/host1x.h#L12... [3] https://github.com/grate-driver/linux/blob/master/drivers/gpu/host1x/soc/syn...
Now, grate-kernel isn't a 100% complete implementation, as I already mentioned before. The host1x-fence doesn't have a reference to a sync point as you may see in the code, this is because the userspace sync points are not implemented in the grate-driver.
But nothing stops us to add that SP reference and then we could simply do the following in the code:
struct dma_fence *host1x_fence_create(syncpt,...) { ... fence->sp = syncpt; ... return &fence->base; }
void host1x_syncpt_signal_fence(struct host1x_fence *fence) { ... fence->sp = NULL; }
irqreturn_t host1x_hw_syncpt_isr() { spin_lock(&host1x_syncpts_lock); ... host1x_syncpt_signal_fence(sp->fence); ... spin_unlock(&host1x_syncpts_lock); }
void host1x_submit_job(job) { ... spin_lock_irqsave(&host1x_syncpts_lock); sp = host1x_syncpt_get(host1x_fence->sp); spin_unlock_irqrestore(&host1x_syncpts_lock); ... if (sp) { push(WAIT(sp->id, host1x_fence->threshold)); job->sync_points = sp; } }
void host1x_free_job(job) { host1x_syncpt_put(job->sync_points); ... }
For example: if you'll share host1-fence (dma-fence) with a GPU context, then the fence's SP won't be released until GPU's context will be done with the SP. The GPU's job will be timed out if shared SP won't get incremented, and this is totally okay situation.
Does this answer yours question?
===
I'm not familiar with the Host1x VMs, so please let me use my imagination here:
In a case of VM we could have a special VM-shared sync point type. The userspace will allocate this special VM SP using ALLOCATE_SYNCPOINT and this SP won't be used for a job(!). This is the case where job will need to increment multiple sync points, its own SP + VM's SP. If job hangs, then there should be a way to tell VM to release the SP and try again next time with a freshly-allocated SP. The shared SP should stay alive as long as VM uses it, so there should be a way for VM to tell that it's done with SP.
Alternatively, we could add a SP recovery (or whatever is needed) for the VM, but this should be left specific to T194+. Older Tegras shouldn't ever need this complexity if I'm not missing anything.
Please provide a detailed information about the VM's workflow if the above doesn't sound good.
Perhaps we shouldn't focus on the VM support for now, but may left some room for a potential future expansion if necessary.
On 6/29/20 10:42 PM, Dmitry Osipenko wrote:
Secondly, I suppose neither GPU, nor DLA could wait on a host1x sync point, correct? Or are they integrated with Host1x HW?
They can access syncpoints directly. (That's what I alluded to in the "Introduction to the hardware" section :) all those things have hardware access to syncpoints)
.. rest ..
Let me try to summarize once more for my own understanding:
* When submitting a job, you would allocate new syncpoints for the job * After submitting the job, those syncpoints are not usable anymore * Postfences of that job would keep references to those syncpoints so they aren't freed and cleared before the fences have been released * Once postfences have been released, syncpoints would be returned to the pool and reset to zero
The advantage of this would be that at any point in time, there would be a 1:1 correspondence between allocated syncpoints and jobs; so you could shuffle the jobs around channels or reorder them.
Please correct if I got that wrong :)
---
I have two concerns:
* A lot of churn on syncpoints - any time you submit a job you might not get a syncpoint for an indefinite time. If we allocate syncpoints up-front at least you know beforehand, and then you have the syncpoint as long as you need it. * Plumbing the dma-fence/sync_file everywhere, and keeping it alive until waits on it have completed, is more work than just having the ID/threshold. This is probably mainly a problem for downstream, where updating code for this would be difficult. I know that's not a proper argument but I hope we can reach something that works for both worlds.
Here's a proposal in between:
* Keep syncpoint allocation and submission in jobs as in my original proposal * Don't attempt to recover user channel contexts. What this means: * If we have a hardware channel per context (MLOCKing), just tear down the channel * Otherwise, we can just remove (either by patching or by full teardown/resubmit of the channel) all jobs submitted by the user channel context that submitted the hanging job. Jobs of other contexts would be undisturbed (though potentially delayed, which could be taken into account and timeouts adjusted) * If this happens, we can set removed jobs' post-fences to error status and user will have to resubmit them. * We should be able to keep the syncpoint refcounting based on fences.
This can be made more fine-grained by not caring about the user channel context, but tearing down all jobs with the same syncpoint. I think the result would be that we can get either what you described (or how I understood it in the summary in the beginning of the message), or a more traditional syncpoint-per-userctx workflow, depending on how the userspace decides to allocate syncpoints.
If needed, the kernel can still do e.g. reordering (you mentioned job priorities) at syncpoint granularity, which, if the userspace followed the model you described, would be the same thing as job granularity.
(Maybe it would be more difficult with current drm_scheduler, sorry, haven't had the time yet to read up on that. Dealing with clearing work stuff up before summer vacation)
Mikko
30.06.2020 13:26, Mikko Perttunen пишет:
On 6/29/20 10:42 PM, Dmitry Osipenko wrote:
Secondly, I suppose neither GPU, nor DLA could wait on a host1x sync point, correct? Or are they integrated with Host1x HW?
They can access syncpoints directly. (That's what I alluded to in the "Introduction to the hardware" section :) all those things have hardware access to syncpoints)
Should we CC all the Nouveau developers then, or is it a bit too early? :)
.. rest ..
Let me try to summarize once more for my own understanding:
- When submitting a job, you would allocate new syncpoints for the job
- Yes
- After submitting the job, those syncpoints are not usable anymore
- Yes
Although, thinking a bit more about it, this needs to be relaxed.
It should be a userspace agreement/policy how to utilize sync points.
For example, if we know that userspace will have multiple application instances all using Tegra DRM UAPI, like a mesa or VDPAU drivers, then this userspace should consider to return sync points into the pool for sharing them with others. While something like an Opentegra Xorg driver, which usually has a single instance, could keep sync points pre-allocated.
The job's sync point counter will be reset to 0 by the kernel driver during of the submission process for each job, so we won't have the sync point recovery problem.
- Postfences of that job would keep references to those syncpoints so
they aren't freed and cleared before the fences have been released
- No
I suggested that fence shouldn't refcount the sync point and *only* have a reference to it, this reference will be invalidated once fence is signaled by sync point reaching the threshold or once sync point is released.
The sync point will have a reference to every active fence (waiting for the signal) that is using this sync point until the threshold is reached.
So fence could detach itself from the sync point + sync point could detach all the fences from itself.
There will be more about this below, please see example with a dead process in the end of this email.
- Once postfences have been released, syncpoints would be returned to
the pool and reset to zero
- No
I'm suggesting that sync point should be returned to the pool once its usage refcount reaches 0. This means that both userspace that created this sync point + the executed job will both keep the sync point alive until it is closed by userspace + job is completed.
The advantage of this would be that at any point in time, there would be a 1:1 correspondence between allocated syncpoints and jobs; so you could shuffle the jobs around channels or reorder them.
- Yes
Please correct if I got that wrong :)
I have two concerns:
- A lot of churn on syncpoints - any time you submit a job you might not
get a syncpoint for an indefinite time. If we allocate syncpoints up-front at least you know beforehand, and then you have the syncpoint as long as you need it.
If you'll have a lot of active application instances all allocating sync points, then inevitably the sync points pool will be exhausted.
But my proposal doesn't differ from yours in this regards, correct?
And maybe there is a nice solution, please see more below!
- Plumbing the dma-fence/sync_file everywhere, and keeping it alive
until waits on it have completed, is more work than just having the ID/threshold. This is probably mainly a problem for downstream, where updating code for this would be difficult. I know that's not a proper argument but I hope we can reach something that works for both worlds.
You could have ID/threshold! :)
But, you can't use the *job's* ID/threshold because you won't know them until kernel driver scheduler will *complete(!)* the job's execution! The job may be re-pushed multiple times by the scheduler to recovered channel if a previous jobs hang!
Now, you could allocate *two* sync points:
1. For the job itself (job's sync point).
2. For the userspace to wait (user's sync point).
The job will have to increment both these sync points (example of multiple sync points usage) and you know the user's sync point ID/threshold!
If job times out, you *could* increment the user's sync point on CPU from userspace!
The counter of the user's sync point won't be touched by the kernel driver if job hangs!
Here's a proposal in between:
- Keep syncpoint allocation and submission in jobs as in my original
proposal
Yes, we could keep it.
But, as I suggested in my other email, we may want to extend the allocation IOCTL for the multi-syncpoint allocation support.
Secondly, if we'll want to have the multi-syncpoint support for the job, then we may want improve the SUBMIT IOCTL like this:
struct drm_tegra_channel_submit { __u32 num_usr_syncpt_incrs; __u64 usr_sync_points_ptr;
__u32 num_job_syncpt_incrs; __u32 job_syncpt_handle; };
If job doesn't need to increment user's sync points, then there is no need to copy them from userspace, hence num_usr_syncpt_incrs should be 0. I.e. one less copy_from_user() operation.
- Don't attempt to recover user channel contexts. What this means:
* If we have a hardware channel per context (MLOCKing), just tear down the channel
!!!
Hmm, we actually should be able to have a one sync point per-channel for the job submission, similarly to what the current driver does!
I'm keep forgetting about the waitbase existence!
Please read more below.
* Otherwise, we can just remove (either by patching or by full teardown/resubmit of the channel) all jobs submitted by the user channel context that submitted the hanging job. Jobs of other contexts would be undisturbed (though potentially delayed, which could be taken into account and timeouts adjusted)
The DRM scheduler itself has an assumption/requirement that when channel hangs, it must be fully reset. The hanged job will be killed by the scheduler (maybe dependent jobs will be killed too, but I don't remember details right now) and then scheduler will re-submit jobs to the recovered channel [1].
[1] https://github.com/grate-driver/linux/blob/master/drivers/gpu/drm/tegra/uapi...
Hence, if we could assign a sync point per-channel, then during of the channel's recovery, the channel's sync point will be reset as well! Only the waitbases of the re-submitted jobs will differ!
It also means that userspace won't need to allocate sync point for each job!
So far it sounds great! I'll try to think more thoroughly about this.
- If this happens, we can set removed jobs' post-fences to error status
and user will have to resubmit them.
- We should be able to keep the syncpoint refcounting based on fences.
The fence doesn't need the sync point itself, it only needs to get a signal when the threshold is reached or when sync point is ceased.
Imagine:
- Process A creates sync point - Process A creates dma-fence from this sync point - Process A exports dma-fence to process B - Process A dies
What should happen to process B?
- Should dma-fence of the process B get a error signal when process A dies? - Should process B get stuck waiting endlessly for the dma-fence?
This is one example of why I'm proposing that fence shouldn't be coupled tightly to a sync point.
This can be made more fine-grained by not caring about the user channel context, but tearing down all jobs with the same syncpoint. I think the result would be that we can get either what you described (or how I understood it in the summary in the beginning of the message), or a more traditional syncpoint-per-userctx workflow, depending on how the userspace decides to allocate syncpoints.
If needed, the kernel can still do e.g. reordering (you mentioned job priorities) at syncpoint granularity, which, if the userspace followed the model you described, would be the same thing as job granularity.
(Maybe it would be more difficult with current drm_scheduler, sorry, haven't had the time yet to read up on that. Dealing with clearing work stuff up before summer vacation)
Please take yours time! You definitely will need take a closer look at the DRM scheduler.
On 7/1/20 3:22 AM, Dmitry Osipenko wrote:
30.06.2020 13:26, Mikko Perttunen пишет:
On 6/29/20 10:42 PM, Dmitry Osipenko wrote:
Secondly, I suppose neither GPU, nor DLA could wait on a host1x sync point, correct? Or are they integrated with Host1x HW?
They can access syncpoints directly. (That's what I alluded to in the "Introduction to the hardware" section :) all those things have hardware access to syncpoints)
Should we CC all the Nouveau developers then, or is it a bit too early? :)
I think we have a few other issues still to resolve before that :)
.. rest ..
Let me try to summarize once more for my own understanding:
- When submitting a job, you would allocate new syncpoints for the job
- Yes
- After submitting the job, those syncpoints are not usable anymore
- Yes
Although, thinking a bit more about it, this needs to be relaxed.
It should be a userspace agreement/policy how to utilize sync points.
For example, if we know that userspace will have multiple application instances all using Tegra DRM UAPI, like a mesa or VDPAU drivers, then this userspace should consider to return sync points into the pool for sharing them with others. While something like an Opentegra Xorg driver, which usually has a single instance, could keep sync points pre-allocated.
The job's sync point counter will be reset to 0 by the kernel driver during of the submission process for each job, so we won't have the sync point recovery problem.
- Postfences of that job would keep references to those syncpoints so
they aren't freed and cleared before the fences have been released
- No
I suggested that fence shouldn't refcount the sync point and *only* have a reference to it, this reference will be invalidated once fence is signaled by sync point reaching the threshold or once sync point is released.
The sync point will have a reference to every active fence (waiting for the signal) that is using this sync point until the threshold is reached.
So fence could detach itself from the sync point + sync point could detach all the fences from itself.
There will be more about this below, please see example with a dead process in the end of this email.
- Once postfences have been released, syncpoints would be returned to
the pool and reset to zero
- No
I'm suggesting that sync point should be returned to the pool once its usage refcount reaches 0. This means that both userspace that created this sync point + the executed job will both keep the sync point alive until it is closed by userspace + job is completed.
The advantage of this would be that at any point in time, there would be a 1:1 correspondence between allocated syncpoints and jobs; so you could shuffle the jobs around channels or reorder them.
- Yes
Please correct if I got that wrong :)
I have two concerns:
- A lot of churn on syncpoints - any time you submit a job you might not
get a syncpoint for an indefinite time. If we allocate syncpoints up-front at least you know beforehand, and then you have the syncpoint as long as you need it.
If you'll have a lot of active application instances all allocating sync points, then inevitably the sync points pool will be exhausted.
But my proposal doesn't differ from yours in this regards, correct?
And maybe there is a nice solution, please see more below!
- Plumbing the dma-fence/sync_file everywhere, and keeping it alive
until waits on it have completed, is more work than just having the ID/threshold. This is probably mainly a problem for downstream, where updating code for this would be difficult. I know that's not a proper argument but I hope we can reach something that works for both worlds.
You could have ID/threshold! :)
But, you can't use the *job's* ID/threshold because you won't know them until kernel driver scheduler will *complete(!)* the job's execution! The job may be re-pushed multiple times by the scheduler to recovered channel if a previous jobs hang!
Now, you could allocate *two* sync points:
For the job itself (job's sync point).
For the userspace to wait (user's sync point).
The job will have to increment both these sync points (example of multiple sync points usage) and you know the user's sync point ID/threshold!
If job times out, you *could* increment the user's sync point on CPU from userspace!
The counter of the user's sync point won't be touched by the kernel driver if job hangs!
Ok, so we would have two kinds of syncpoints for the job; one for kernel job tracking; and one that userspace can manipulate as it wants to.
Could we handle the job tracking syncpoint completely inside the kernel, i.e. allocate it in kernel during job submission, and add an increment for it at the end of the job (with condition OP_DONE)? For MLOCKing, the kernel already needs to insert a SYNCPT_INCR(OP_DONE) + WAIT + MLOCK_RELEASE sequence at the end of each job.
Here's a proposal in between:
- Keep syncpoint allocation and submission in jobs as in my original
proposal
Yes, we could keep it.
But, as I suggested in my other email, we may want to extend the allocation IOCTL for the multi-syncpoint allocation support.
Secondly, if we'll want to have the multi-syncpoint support for the job, then we may want improve the SUBMIT IOCTL like this:
struct drm_tegra_channel_submit { __u32 num_usr_syncpt_incrs; __u64 usr_sync_points_ptr;
__u32 num_job_syncpt_incrs; __u32 job_syncpt_handle;
};
If job doesn't need to increment user's sync points, then there is no need to copy them from userspace, hence num_usr_syncpt_incrs should be 0. I.e. one less copy_from_user() operation.
- Don't attempt to recover user channel contexts. What this means:
* If we have a hardware channel per context (MLOCKing), just tear down the channel
!!!
Hmm, we actually should be able to have a one sync point per-channel for the job submission, similarly to what the current driver does!
I'm keep forgetting about the waitbase existence!
Tegra194 doesn't have waitbases, but if we are resubmitting all the jobs anyway, can't we just recalculate wait thresholds at that time?
Maybe a more detailed sequence list or diagram of what happens during submission and recovery would be useful.
Please read more below.
* Otherwise, we can just remove (either by patching or by full teardown/resubmit of the channel) all jobs submitted by the user channel context that submitted the hanging job. Jobs of other contexts would be undisturbed (though potentially delayed, which could be taken into account and timeouts adjusted)
The DRM scheduler itself has an assumption/requirement that when channel hangs, it must be fully reset. The hanged job will be killed by the scheduler (maybe dependent jobs will be killed too, but I don't remember details right now) and then scheduler will re-submit jobs to the recovered channel [1].
[1] https://github.com/grate-driver/linux/blob/master/drivers/gpu/drm/tegra/uapi...
Hence, if we could assign a sync point per-channel, then during of the channel's recovery, the channel's sync point will be reset as well! Only the waitbases of the re-submitted jobs will differ!
It also means that userspace won't need to allocate sync point for each job!
So far it sounds great! I'll try to think more thoroughly about this.
- If this happens, we can set removed jobs' post-fences to error status
and user will have to resubmit them.
- We should be able to keep the syncpoint refcounting based on fences.
The fence doesn't need the sync point itself, it only needs to get a signal when the threshold is reached or when sync point is ceased.
Imagine:
- Process A creates sync point
- Process A creates dma-fence from this sync point
- Process A exports dma-fence to process B
- Process A dies
What should happen to process B?
- Should dma-fence of the process B get a error signal when process A
dies?
- Should process B get stuck waiting endlessly for the dma-fence?
This is one example of why I'm proposing that fence shouldn't be coupled tightly to a sync point.
As a baseline, we should consider process B to get stuck endlessly (until a timeout of its choosing) for the fence. In this case it is avoidable, but if the ID/threshold pair is exported out of the fence and is waited for otherwise, it is unavoidable. I.e. once the ID/threshold are exported out of a fence, the waiter can only see the fence being signaled by the threshold being reached, not by the syncpoint getting freed.
This can be made more fine-grained by not caring about the user channel context, but tearing down all jobs with the same syncpoint. I think the result would be that we can get either what you described (or how I understood it in the summary in the beginning of the message), or a more traditional syncpoint-per-userctx workflow, depending on how the userspace decides to allocate syncpoints.
If needed, the kernel can still do e.g. reordering (you mentioned job priorities) at syncpoint granularity, which, if the userspace followed the model you described, would be the same thing as job granularity.
(Maybe it would be more difficult with current drm_scheduler, sorry, haven't had the time yet to read up on that. Dealing with clearing work stuff up before summer vacation)
Please take yours time! You definitely will need take a closer look at the DRM scheduler.
02.07.2020 15:10, Mikko Perttunen пишет:
Ok, so we would have two kinds of syncpoints for the job; one for kernel job tracking; and one that userspace can manipulate as it wants to.
Could we handle the job tracking syncpoint completely inside the kernel, i.e. allocate it in kernel during job submission, and add an increment for it at the end of the job (with condition OP_DONE)? For MLOCKing, the kernel already needs to insert a SYNCPT_INCR(OP_DONE) + WAIT + MLOCK_RELEASE sequence at the end of each job.
If sync point is allocated within kernel, then we'll need to always patch all job's sync point increments with the ID of the allocated sync point, regardless of whether firewall enabled or not.
Secondly, I'm now recalling that only one sync point could be assigned to a channel at a time on newer Tegras which support sync point protection. So it sounds like we don't really have variants other than to allocate one sync point per channel for the jobs usage if we want to be able to push multiple jobs into channel's pushbuffer, correct?
...
Hmm, we actually should be able to have a one sync point per-channel for the job submission, similarly to what the current driver does!
I'm keep forgetting about the waitbase existence!
Tegra194 doesn't have waitbases, but if we are resubmitting all the jobs anyway, can't we just recalculate wait thresholds at that time?
Yes, thresholds could be recalculated + job should be re-formed at the push-time then.
It also means that jobs always should be formed only at the push-time if wait-command is utilized by cmdstream since the waits always need to be patched because we won't know the thresholds until scheduler actually runs the job.
Maybe a more detailed sequence list or diagram of what happens during submission and recovery would be useful.
The textual form + code is already good enough to me. A diagram could be nice to have, although it may take a bit too much effort to create + maintain it. But I don't mind at all if you'd want to make one :)
...
- We should be able to keep the syncpoint refcounting based on fences.
The fence doesn't need the sync point itself, it only needs to get a signal when the threshold is reached or when sync point is ceased.
Imagine:
- Process A creates sync point - Process A creates dma-fence from this sync point - Process A exports dma-fence to process B - Process A dies
What should happen to process B?
- Should dma-fence of the process B get a error signal when process A dies? - Should process B get stuck waiting endlessly for the dma-fence?
This is one example of why I'm proposing that fence shouldn't be coupled tightly to a sync point.
As a baseline, we should consider process B to get stuck endlessly (until a timeout of its choosing) for the fence. In this case it is avoidable, but if the ID/threshold pair is exported out of the fence and is waited for otherwise, it is unavoidable. I.e. once the ID/threshold are exported out of a fence, the waiter can only see the fence being signaled by the threshold being reached, not by the syncpoint getting freed.
This is correct. If sync point's FD is exported or once sync point is resolved from a dma-fence, then sync point will stay alive until the last reference to the sync point is dropped. I.e. if Process A dies *after* process B started to wait on the sync point, then it will get stuck.
On 7/7/20 2:06 PM, Dmitry Osipenko wrote:
02.07.2020 15:10, Mikko Perttunen пишет:
Ok, so we would have two kinds of syncpoints for the job; one for kernel job tracking; and one that userspace can manipulate as it wants to.
Could we handle the job tracking syncpoint completely inside the kernel, i.e. allocate it in kernel during job submission, and add an increment for it at the end of the job (with condition OP_DONE)? For MLOCKing, the kernel already needs to insert a SYNCPT_INCR(OP_DONE) + WAIT + MLOCK_RELEASE sequence at the end of each job.
If sync point is allocated within kernel, then we'll need to always patch all job's sync point increments with the ID of the allocated sync point, regardless of whether firewall enabled or not.
The idea was that the job tracking increment would also be added to the pushbuffer in the kernel, so gathers would only have increments for the "user syncpoints", if any. I think this should work for THI-based engines (e.g. VIC), you probably have better information about GR2D/GR3D. On newer Tegras we could use CHANNEL/APPID protection to prevent the gathers from incrementing these job tracking syncpoints.
Secondly, I'm now recalling that only one sync point could be assigned to a channel at a time on newer Tegras which support sync point protection. So it sounds like we don't really have variants other than to allocate one sync point per channel for the jobs usage if we want to be able to push multiple jobs into channel's pushbuffer, correct?
The other way around; each syncpoint can be assigned to one channel. One channel may have multiple syncpoints.
...
Hmm, we actually should be able to have a one sync point per-channel for the job submission, similarly to what the current driver does!
I'm keep forgetting about the waitbase existence!
Tegra194 doesn't have waitbases, but if we are resubmitting all the jobs anyway, can't we just recalculate wait thresholds at that time?
Yes, thresholds could be recalculated + job should be re-formed at the push-time then.
It also means that jobs always should be formed only at the push-time if wait-command is utilized by cmdstream since the waits always need to be patched because we won't know the thresholds until scheduler actually runs the job.
Could we keep the job tracking syncpoints entirely within the kernel, and have all wait commands and other stuff that userspace does use the user syncpoints? Then kernel job tracking and userspace activity would be separate from each other.
Alternatively, if we know that jobs can only be removed from the middle of pushbuffers, and not added, we could replace any removed jobs with syncpoint increments in the pushbuffer and any thresholds would stay intact.
Maybe a more detailed sequence list or diagram of what happens during submission and recovery would be useful.
The textual form + code is already good enough to me. A diagram could be nice to have, although it may take a bit too much effort to create + maintain it. But I don't mind at all if you'd want to make one :)
...
- We should be able to keep the syncpoint refcounting based on fences.
The fence doesn't need the sync point itself, it only needs to get a signal when the threshold is reached or when sync point is ceased.
Imagine:
- Process A creates sync point - Process A creates dma-fence from this sync point - Process A exports dma-fence to process B - Process A dies
What should happen to process B?
- Should dma-fence of the process B get a error signal when process A dies? - Should process B get stuck waiting endlessly for the dma-fence?
This is one example of why I'm proposing that fence shouldn't be coupled tightly to a sync point.
As a baseline, we should consider process B to get stuck endlessly (until a timeout of its choosing) for the fence. In this case it is avoidable, but if the ID/threshold pair is exported out of the fence and is waited for otherwise, it is unavoidable. I.e. once the ID/threshold are exported out of a fence, the waiter can only see the fence being signaled by the threshold being reached, not by the syncpoint getting freed.
This is correct. If sync point's FD is exported or once sync point is resolved from a dma-fence, then sync point will stay alive until the last reference to the sync point is dropped. I.e. if Process A dies *after* process B started to wait on the sync point, then it will get stuck.
08.07.2020 13:06, Mikko Perttunen пишет:
On 7/7/20 2:06 PM, Dmitry Osipenko wrote:
02.07.2020 15:10, Mikko Perttunen пишет:
Ok, so we would have two kinds of syncpoints for the job; one for kernel job tracking; and one that userspace can manipulate as it wants to.
Could we handle the job tracking syncpoint completely inside the kernel, i.e. allocate it in kernel during job submission, and add an increment for it at the end of the job (with condition OP_DONE)? For MLOCKing, the kernel already needs to insert a SYNCPT_INCR(OP_DONE) + WAIT + MLOCK_RELEASE sequence at the end of each job.
If sync point is allocated within kernel, then we'll need to always patch all job's sync point increments with the ID of the allocated sync point, regardless of whether firewall enabled or not.
The idea was that the job tracking increment would also be added to the pushbuffer in the kernel, so gathers would only have increments for the "user syncpoints", if any. I think this should work for THI-based engines (e.g. VIC), you probably have better information about GR2D/GR3D. On newer Tegras we could use CHANNEL/APPID protection to prevent the gathers from incrementing these job tracking syncpoints.
Could you please clarify what is THI?
Secondly, I'm now recalling that only one sync point could be assigned to a channel at a time on newer Tegras which support sync point protection. So it sounds like we don't really have variants other than to allocate one sync point per channel for the jobs usage if we want to be able to push multiple jobs into channel's pushbuffer, correct?
The other way around; each syncpoint can be assigned to one channel. One channel may have multiple syncpoints.
Okay! So we actually could use a single sync point per-job for user increments + job tracking, correct?
...
Hmm, we actually should be able to have a one sync point per-channel for the job submission, similarly to what the current driver does!
I'm keep forgetting about the waitbase existence!
Tegra194 doesn't have waitbases, but if we are resubmitting all the jobs anyway, can't we just recalculate wait thresholds at that time?
Yes, thresholds could be recalculated + job should be re-formed at the push-time then.
It also means that jobs always should be formed only at the push-time if wait-command is utilized by cmdstream since the waits always need to be patched because we won't know the thresholds until scheduler actually runs the job.
Could we keep the job tracking syncpoints entirely within the kernel, and have all wait commands and other stuff that userspace does use the user syncpoints? Then kernel job tracking and userspace activity would be separate from each other.
I think this should work, but it's not clear to me what benefits this brings to us if we could re-use the same user-allocated sync point for both user increments + kernel job tracking.
Is the idea to protect from userspace incrementing sync point too much? I.e. job could be treated as completed before CDMA is actually done with it.
Alternatively, if we know that jobs can only be removed from the middle of pushbuffers, and not added, we could replace any removed jobs with syncpoint increments in the pushbuffer and any thresholds would stay intact.
A bit unlikely that jobs could ever be removed from the middle of hardware queue by the DRM scheduler.
Anyways, it should be nicer to shoot down everything and restart with a clean slate when necessary, like it's already supposed to happen by the scheduler.
dri-devel@lists.freedesktop.org