[RFC] drm/radeon: userfence IOCTL

List overview All Threads
Download

newer

older

[RFC/PATCH v2 0/5] Add live source...

[i915] assert_device_not_suspended

Christian König

13 Apr 2015 13 Apr '15

2:52 p.m.

Hello everyone,

we have a requirement for a bit different kind of fence handling. Currently we handle fences completely inside the kernel, but in the future we would like to emit multiple fences inside the same IB as well.

This works by adding multiple fence commands into an IB which just write their value to a specific location inside a BO and trigger the appropriate hardware interrupt.

The user part of the driver stack should then be able to call an IOCTL to wait for the interrupt and block for the value (or something larger) to be written to the specific location.

This has the advantage that you can have multiple synchronization points in the same IB and don't need to split up your draw commands over several IBs so that the kernel can insert kernel fences in between.

The following set of patches tries to implement exactly this IOCTL. The big problem with that IOCTL is that TTM needs the BO to be kept in the same place while it is mapped inside the kernel page table. So this requires that we pin down the BO for the duration of the wait IOCTL.

This practically gives userspace a way of pinning down BOs for as long as it wants, without the ability for the kernel for intervention.

Any ideas how to avoid those problems? Or better ideas how to handle the new requirements?

Please note that the patches are only hacked together quick&dirty to demonstrate the problem and not very well tested.

Best regards, Christian.

Show replies by date

Christian König

13 Apr 13 Apr

2:52 p.m.

New subject: [PATCH 1/3] drm/radeon: wait for BO move to finish on kmap

From: Christian König christian.koenig@amd.com

After resume buffer need to move back into VRAM. We already had this problem with UVD, but solve it more general to be on the safe side.

Signed-off-by: Christian König christian.koenig@amd.com --- drivers/gpu/drm/radeon/radeon.h | 2 ++ drivers/gpu/drm/radeon/radeon_object.c | 11 +++++++++++ drivers/gpu/drm/radeon/radeon_ttm.c | 3 +++ drivers/gpu/drm/radeon/radeon_uvd.c | 10 ---------- 4 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h index 33d5a4f..6e6b49a 100644 --- a/drivers/gpu/drm/radeon/radeon.h +++ b/drivers/gpu/drm/radeon/radeon.h @@ -506,6 +506,8 @@ struct radeon_bo {

struct radeon_mn *mn; struct interval_tree_node mn_it; + + struct radeon_fence *last_move; }; #define gem_to_radeon_bo(gobj) container_of((gobj), struct radeon_bo, gem_base)

diff --git a/drivers/gpu/drm/radeon/radeon_object.c b/drivers/gpu/drm/radeon/radeon_object.c index 318165d..14a0f87 100644 --- a/drivers/gpu/drm/radeon/radeon_object.c +++ b/drivers/gpu/drm/radeon/radeon_object.c @@ -263,6 +263,17 @@ int radeon_bo_kmap(struct radeon_bo *bo, void **ptr) bool is_iomem; int r;

+ if (bo->last_move) { + r = radeon_fence_wait(bo->last_move, false); + if (r) { + radeon_bo_kunmap(bo); + DRM_ERROR("Failed waiting BO move (%d)!\n", r); + return r; + } + + radeon_fence_unref(&bo->last_move); + } + if (bo->kptr) { if (ptr) { *ptr = bo->kptr; diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c index b292aca..1fa4f2d 100644 --- a/drivers/gpu/drm/radeon/radeon_ttm.c +++ b/drivers/gpu/drm/radeon/radeon_ttm.c @@ -253,6 +253,7 @@ static int radeon_move_blit(struct ttm_buffer_object *bo, struct ttm_mem_reg *new_mem, struct ttm_mem_reg *old_mem) { + struct radeon_bo *rbo = container_of(bo, struct radeon_bo, tbo); struct radeon_device *rdev; uint64_t old_start, new_start; struct radeon_fence *fence; @@ -300,6 +301,8 @@ static int radeon_move_blit(struct ttm_buffer_object *bo,

r = ttm_bo_move_accel_cleanup(bo, &fence->base, evict, no_wait_gpu, new_mem); + radeon_fence_unref(&rbo->last_move); + rbo->last_move = radeon_fence_ref(fence); radeon_fence_unref(&fence); return r; } diff --git a/drivers/gpu/drm/radeon/radeon_uvd.c b/drivers/gpu/drm/radeon/radeon_uvd.c index c10b2ae..60f96a3 100644 --- a/drivers/gpu/drm/radeon/radeon_uvd.c +++ b/drivers/gpu/drm/radeon/radeon_uvd.c @@ -401,7 +401,6 @@ static int radeon_uvd_cs_msg(struct radeon_cs_parser *p, struct radeon_bo *bo, { int32_t *msg, msg_type, handle; unsigned img_size = 0; - struct fence *f; void *ptr;

int i, r; @@ -411,15 +410,6 @@ static int radeon_uvd_cs_msg(struct radeon_cs_parser *p, struct radeon_bo *bo, return -EINVAL; }

- f = reservation_object_get_excl(bo->tbo.resv); - if (f) { - r = radeon_fence_wait((struct radeon_fence *)f, false); - if (r) { - DRM_ERROR("Failed waiting for UVD message (%d)!\n", r); - return r; - } - } - r = radeon_bo_kmap(bo, &ptr); if (r) { DRM_ERROR("Failed mapping the UVD message (%d)!\n", r);

-- 1.9.1

Christian König

2:52 p.m.

New subject: [PATCH 2/3] drm/radeon: cleanup radeon_cs_get_ring

From: Christian König christian.koenig@amd.com

That makes it possible to call it from elsewhere as well.

Signed-off-by: Christian König christian.koenig@amd.com --- drivers/gpu/drm/radeon/radeon.h | 2 +- drivers/gpu/drm/radeon/radeon_cs.c | 55 ++++++++++++++++++++------------------ 2 files changed, 30 insertions(+), 27 deletions(-)

diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h index 6e6b49a..57d63c4 100644 --- a/drivers/gpu/drm/radeon/radeon.h +++ b/drivers/gpu/drm/radeon/radeon.h @@ -1093,7 +1093,6 @@ struct radeon_cs_parser { int parser_error; u32 cs_flags; u32 ring; - s32 priority; struct ww_acquire_ctx ticket; };

@@ -1122,6 +1121,7 @@ typedef int (*radeon_packet0_check_t)(struct radeon_cs_parser *p, typedef int (*radeon_packet3_check_t)(struct radeon_cs_parser *p, struct radeon_cs_packet *pkt);

+int radeon_cs_get_ring(struct radeon_device *rdev, u32 ring, s32 priority);

/* * AGP diff --git a/drivers/gpu/drm/radeon/radeon_cs.c b/drivers/gpu/drm/radeon/radeon_cs.c index 4d0f96c..4b92762 100644 --- a/drivers/gpu/drm/radeon/radeon_cs.c +++ b/drivers/gpu/drm/radeon/radeon_cs.c @@ -187,47 +187,46 @@ static int radeon_cs_parser_relocs(struct radeon_cs_parser *p) return r; }

-static int radeon_cs_get_ring(struct radeon_cs_parser *p, u32 ring, s32 priority) +int radeon_cs_get_ring(struct radeon_device *rdev, u32 ring, s32 priority) { - p->priority = priority; - switch (ring) { default: DRM_ERROR("unknown ring id: %d\n", ring); - return -EINVAL; - case RADEON_CS_RING_GFX: - p->ring = RADEON_RING_TYPE_GFX_INDEX; break; + + case RADEON_CS_RING_GFX: + return RADEON_RING_TYPE_GFX_INDEX; + case RADEON_CS_RING_COMPUTE: - if (p->rdev->family >= CHIP_TAHITI) { - if (p->priority > 0) - p->ring = CAYMAN_RING_TYPE_CP1_INDEX; + if (rdev->family >= CHIP_TAHITI) { + if (priority > 0) + return CAYMAN_RING_TYPE_CP1_INDEX; else - p->ring = CAYMAN_RING_TYPE_CP2_INDEX; + return CAYMAN_RING_TYPE_CP2_INDEX; } else - p->ring = RADEON_RING_TYPE_GFX_INDEX; - break; + return RADEON_RING_TYPE_GFX_INDEX; + case RADEON_CS_RING_DMA: - if (p->rdev->family >= CHIP_CAYMAN) { - if (p->priority > 0) - p->ring = R600_RING_TYPE_DMA_INDEX; + if (rdev->family >= CHIP_CAYMAN) { + if (priority > 0) + return R600_RING_TYPE_DMA_INDEX; else - p->ring = CAYMAN_RING_TYPE_DMA1_INDEX; - } else if (p->rdev->family >= CHIP_RV770) { - p->ring = R600_RING_TYPE_DMA_INDEX; + return CAYMAN_RING_TYPE_DMA1_INDEX; + } else if (rdev->family >= CHIP_RV770) { + return R600_RING_TYPE_DMA_INDEX; } else { return -EINVAL; } - break; + case RADEON_CS_RING_UVD: - p->ring = R600_RING_TYPE_UVD_INDEX; - break; + return R600_RING_TYPE_UVD_INDEX; + case RADEON_CS_RING_VCE: /* TODO: only use the low priority ring for now */ - p->ring = TN_RING_TYPE_VCE1_INDEX; - break; + return TN_RING_TYPE_VCE1_INDEX; + } - return 0; + return -EINVAL; }

static int radeon_cs_sync_rings(struct radeon_cs_parser *p) @@ -348,14 +347,18 @@ int radeon_cs_parser_init(struct radeon_cs_parser *p, void *data)

/* these are KMS only */ if (p->rdev) { + int r; + if ((p->cs_flags & RADEON_CS_USE_VM) && !p->rdev->vm_manager.enabled) { DRM_ERROR("VM not active on asic!\n"); return -EINVAL; }

- if (radeon_cs_get_ring(p, ring, priority)) - return -EINVAL; + r = radeon_cs_get_ring(p->rdev, ring, priority); + if (r < 0) + return r; + p->ring = r;

/* we only support VM on some SI+ rings */ if ((p->cs_flags & RADEON_CS_USE_VM) == 0) {

-- 1.9.1

Christian König

2:52 p.m.

New subject: [PATCH 3/3] drm/radeon: add userfence IOCTL

From: Christian König christian.koenig@amd.com

WIP patch which adds an user fence IOCTL.

Signed-off-by: Christian König christian.koenig@amd.com --- drivers/gpu/drm/radeon/radeon.h | 2 ++ drivers/gpu/drm/radeon/radeon_gem.c | 71 +++++++++++++++++++++++++++++++++++++ drivers/gpu/drm/radeon/radeon_kms.c | 1 + include/uapi/drm/radeon_drm.h | 41 +++++++++++++-------- 4 files changed, 100 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h index 57d63c4..110baae 100644 --- a/drivers/gpu/drm/radeon/radeon.h +++ b/drivers/gpu/drm/radeon/radeon.h @@ -2215,6 +2215,8 @@ int radeon_gem_create_ioctl(struct drm_device *dev, void *data, struct drm_file *filp); int radeon_gem_userptr_ioctl(struct drm_device *dev, void *data, struct drm_file *filp); +int radeon_gem_wait_userfence_ioctl(struct drm_device *dev, void *data, + struct drm_file *filp); int radeon_gem_pin_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv); int radeon_gem_unpin_ioctl(struct drm_device *dev, void *data, diff --git a/drivers/gpu/drm/radeon/radeon_gem.c b/drivers/gpu/drm/radeon/radeon_gem.c index ac3c131..094b3d5 100644 --- a/drivers/gpu/drm/radeon/radeon_gem.c +++ b/drivers/gpu/drm/radeon/radeon_gem.c @@ -365,6 +365,77 @@ handle_lockup: return r; }

+int radeon_gem_wait_userfence_ioctl(struct drm_device *dev, void *data, + struct drm_file *filp) +{ + struct radeon_device *rdev = dev->dev_private; + struct drm_radeon_gem_wait_userfence *args = data; + struct drm_gem_object *gobj; + struct radeon_bo *robj; + uint64_t *fence_addr; + void *cpu_addr; + int r, ring; + + down_read(&rdev->exclusive_lock); + + ring = radeon_cs_get_ring(rdev, args->ring, args->priority); + if (ring < 0) + return ring; + + gobj = drm_gem_object_lookup(dev, filp, args->handle); + if (gobj == NULL) { + r = -ENOENT; + goto error_unref; + } + robj = gem_to_radeon_bo(gobj); + + if (args->offset & 0x7 || args->offset + 8 > radeon_bo_size(robj)) { + r = -EINVAL; + goto error_unref; + } + + r = radeon_bo_reserve(robj, false); + if (r) + goto error_unref; + + r = radeon_bo_pin(rdev->uvd.vcpu_bo, RADEON_GEM_DOMAIN_VRAM | + RADEON_GEM_DOMAIN_GTT, NULL); + if (r) + goto error_unreserve; + + r = radeon_bo_kmap(robj, &cpu_addr); + if (r) + goto error_unpin; + + radeon_bo_unreserve(robj); + + fence_addr = (uint64_t *)(((uint8_t*)cpu_addr) + args->offset); + + radeon_irq_kms_sw_irq_get(rdev, ring); + r = wait_event_interruptible(rdev->fence_queue, ( + *fence_addr >= args->fence || rdev->needs_reset + )); + radeon_irq_kms_sw_irq_put(rdev, ring); + + if (rdev->needs_reset) + r = -EDEADLK; + + radeon_bo_reserve(robj, false); + radeon_bo_kunmap(robj); + +error_unpin: + radeon_bo_unpin(robj); + +error_unreserve: + radeon_bo_unreserve(robj); + +error_unref: + drm_gem_object_unreference_unlocked(gobj); + up_read(&rdev->exclusive_lock); + r = radeon_gem_handle_lockup(robj->rdev, r); + return r; +} + int radeon_gem_set_domain_ioctl(struct drm_device *dev, void *data, struct drm_file *filp) { diff --git a/drivers/gpu/drm/radeon/radeon_kms.c b/drivers/gpu/drm/radeon/radeon_kms.c index 686411e..4757f25 100644 --- a/drivers/gpu/drm/radeon/radeon_kms.c +++ b/drivers/gpu/drm/radeon/radeon_kms.c @@ -893,5 +893,6 @@ const struct drm_ioctl_desc radeon_ioctls_kms[] = { DRM_IOCTL_DEF_DRV(RADEON_GEM_VA, radeon_gem_va_ioctl, DRM_AUTH|DRM_UNLOCKED|DRM_RENDER_ALLOW), DRM_IOCTL_DEF_DRV(RADEON_GEM_OP, radeon_gem_op_ioctl, DRM_AUTH|DRM_UNLOCKED|DRM_RENDER_ALLOW), DRM_IOCTL_DEF_DRV(RADEON_GEM_USERPTR, radeon_gem_userptr_ioctl, DRM_AUTH|DRM_UNLOCKED|DRM_RENDER_ALLOW), + DRM_IOCTL_DEF_DRV(RADEON_GEM_WAIT_USERFENCE, radeon_gem_wait_userfence_ioctl, DRM_AUTH|DRM_UNLOCKED|DRM_RENDER_ALLOW), }; int radeon_max_kms_ioctl = ARRAY_SIZE(radeon_ioctls_kms); diff --git a/include/uapi/drm/radeon_drm.h b/include/uapi/drm/radeon_drm.h index 50d0fb4..43fe660 100644 --- a/include/uapi/drm/radeon_drm.h +++ b/include/uapi/drm/radeon_drm.h @@ -512,6 +512,7 @@ typedef struct { #define DRM_RADEON_GEM_VA 0x2b #define DRM_RADEON_GEM_OP 0x2c #define DRM_RADEON_GEM_USERPTR 0x2d +#define DRM_RADEON_GEM_WAIT_USERFENCE 0x2e

#define DRM_IOCTL_RADEON_CP_INIT DRM_IOW( DRM_COMMAND_BASE + DRM_RADEON_CP_INIT, drm_radeon_init_t) #define DRM_IOCTL_RADEON_CP_START DRM_IO( DRM_COMMAND_BASE + DRM_RADEON_CP_START) @@ -541,21 +542,22 @@ typedef struct { #define DRM_IOCTL_RADEON_SURF_ALLOC DRM_IOW( DRM_COMMAND_BASE + DRM_RADEON_SURF_ALLOC, drm_radeon_surface_alloc_t) #define DRM_IOCTL_RADEON_SURF_FREE DRM_IOW( DRM_COMMAND_BASE + DRM_RADEON_SURF_FREE, drm_radeon_surface_free_t) /* KMS */ -#define DRM_IOCTL_RADEON_GEM_INFO DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_INFO, struct drm_radeon_gem_info) -#define DRM_IOCTL_RADEON_GEM_CREATE DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_CREATE, struct drm_radeon_gem_create) -#define DRM_IOCTL_RADEON_GEM_MMAP DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_MMAP, struct drm_radeon_gem_mmap) -#define DRM_IOCTL_RADEON_GEM_PREAD DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_PREAD, struct drm_radeon_gem_pread) -#define DRM_IOCTL_RADEON_GEM_PWRITE DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_PWRITE, struct drm_radeon_gem_pwrite) -#define DRM_IOCTL_RADEON_GEM_SET_DOMAIN DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_SET_DOMAIN, struct drm_radeon_gem_set_domain) -#define DRM_IOCTL_RADEON_GEM_WAIT_IDLE DRM_IOW(DRM_COMMAND_BASE + DRM_RADEON_GEM_WAIT_IDLE, struct drm_radeon_gem_wait_idle) -#define DRM_IOCTL_RADEON_CS DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_CS, struct drm_radeon_cs) -#define DRM_IOCTL_RADEON_INFO DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_INFO, struct drm_radeon_info) -#define DRM_IOCTL_RADEON_GEM_SET_TILING DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_SET_TILING, struct drm_radeon_gem_set_tiling) -#define DRM_IOCTL_RADEON_GEM_GET_TILING DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_GET_TILING, struct drm_radeon_gem_get_tiling) -#define DRM_IOCTL_RADEON_GEM_BUSY DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_BUSY, struct drm_radeon_gem_busy) -#define DRM_IOCTL_RADEON_GEM_VA DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_VA, struct drm_radeon_gem_va) -#define DRM_IOCTL_RADEON_GEM_OP DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_OP, struct drm_radeon_gem_op) -#define DRM_IOCTL_RADEON_GEM_USERPTR DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_USERPTR, struct drm_radeon_gem_userptr) +#define DRM_IOCTL_RADEON_GEM_INFO DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_INFO, struct drm_radeon_gem_info) +#define DRM_IOCTL_RADEON_GEM_CREATE DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_CREATE, struct drm_radeon_gem_create) +#define DRM_IOCTL_RADEON_GEM_MMAP DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_MMAP, struct drm_radeon_gem_mmap) +#define DRM_IOCTL_RADEON_GEM_PREAD DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_PREAD, struct drm_radeon_gem_pread) +#define DRM_IOCTL_RADEON_GEM_PWRITE DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_PWRITE, struct drm_radeon_gem_pwrite) +#define DRM_IOCTL_RADEON_GEM_SET_DOMAIN DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_SET_DOMAIN, struct drm_radeon_gem_set_domain) +#define DRM_IOCTL_RADEON_GEM_WAIT_IDLE DRM_IOW(DRM_COMMAND_BASE + DRM_RADEON_GEM_WAIT_IDLE, struct drm_radeon_gem_wait_idle) +#define DRM_IOCTL_RADEON_CS DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_CS, struct drm_radeon_cs) +#define DRM_IOCTL_RADEON_INFO DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_INFO, struct drm_radeon_info) +#define DRM_IOCTL_RADEON_GEM_SET_TILING DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_SET_TILING, struct drm_radeon_gem_set_tiling) +#define DRM_IOCTL_RADEON_GEM_GET_TILING DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_GET_TILING, struct drm_radeon_gem_get_tiling) +#define DRM_IOCTL_RADEON_GEM_BUSY DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_BUSY, struct drm_radeon_gem_busy) +#define DRM_IOCTL_RADEON_GEM_VA DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_VA, struct drm_radeon_gem_va) +#define DRM_IOCTL_RADEON_GEM_OP DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_OP, struct drm_radeon_gem_op) +#define DRM_IOCTL_RADEON_GEM_USERPTR DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_USERPTR, struct drm_radeon_gem_userptr) +#define DRM_IOCTL_RADEON_GEM_WAIT_USERFENCE DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_WAIT_USERFENCE, struct drm_radeon_gem_wait_userfence)

typedef struct drm_radeon_init { enum { @@ -831,6 +833,15 @@ struct drm_radeon_gem_userptr { uint32_t handle; };

+struct drm_radeon_gem_wait_userfence { + uint32_t handle; + uint32_t offset; + uint64_t fence; + uint32_t ring; + int32_t priority; + uint32_t flags; +}; + #define RADEON_TILING_MACRO 0x1 #define RADEON_TILING_MICRO 0x2 #define RADEON_TILING_SWAP_16BIT 0x4

-- 1.9.1

Daniel Vetter

5:23 p.m.

New subject: [PATCH 3/3] drm/radeon: add userfence IOCTL

On Mon, Apr 13, 2015 at 04:52:17PM +0200, Christian König wrote:

...

From: Christian König christian.koenig@amd.com

WIP patch which adds an user fence IOCTL.

Signed-off-by: Christian König christian.koenig@amd.com

I've discussed userspace fences a lot with Jerome last XDC, so here's my comments:

My primary concern with mid-batch fences is that if we create real kernel fences (which might even escape to other places using android syncpts or dma-buf) then we end up relying upon correct userspace to not hang the kernel, which isn't good.

So imo any kind of mid-batch fence must be done completely in userspace and never show up as a fence object on the kernel side. I thought that just busy-spinning in userspace would be all that's needed, but adding an ioctl to wait on such user fences seems like a nice idea too. On i915 we even have 2 interrupt sources per ring, so we could split the irq processing between kernel fences and userspace fences.

One thing to keep in mind (I dunno radeon/ttm internals enough to know) is to make sure that while being blocked for a userspace fence in the ioctl you're not starving anyone else. But it doesn't look like you're holding any reservation objects or something similar which might prevent concurrent cs.

Anyway if there's nothing more to this then I think this is very sane idea.

Cheers, Daniel

...

drivers/gpu/drm/radeon/radeon.h | 2 ++ drivers/gpu/drm/radeon/radeon_gem.c | 71 +++++++++++++++++++++++++++++++++++++ drivers/gpu/drm/radeon/radeon_kms.c | 1 + include/uapi/drm/radeon_drm.h | 41 +++++++++++++-------- 4 files changed, 100 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h index 57d63c4..110baae 100644 --- a/drivers/gpu/drm/radeon/radeon.h +++ b/drivers/gpu/drm/radeon/radeon.h @@ -2215,6 +2215,8 @@ int radeon_gem_create_ioctl(struct drm_device *dev, void *data, struct drm_file *filp); int radeon_gem_userptr_ioctl(struct drm_device *dev, void *data, struct drm_file *filp); +int radeon_gem_wait_userfence_ioctl(struct drm_device *dev, void *data,
		    struct drm_file *filp);
int radeon_gem_pin_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv); int radeon_gem_unpin_ioctl(struct drm_device *dev, void *data, diff --git a/drivers/gpu/drm/radeon/radeon_gem.c b/drivers/gpu/drm/radeon/radeon_gem.c index ac3c131..094b3d5 100644 --- a/drivers/gpu/drm/radeon/radeon_gem.c +++ b/drivers/gpu/drm/radeon/radeon_gem.c @@ -365,6 +365,77 @@ handle_lockup: return r; }

+int radeon_gem_wait_userfence_ioctl(struct drm_device *dev, void *data,
		    struct drm_file *filp)
+{
struct radeon_device *rdev = dev->dev_private;

struct drm_radeon_gem_wait_userfence *args = data;

struct drm_gem_object *gobj;

struct radeon_bo *robj;

uint64_t *fence_addr;

void *cpu_addr;

int r, ring;

down_read(&rdev->exclusive_lock);

ring = radeon_cs_get_ring(rdev, args->ring, args->priority);

if (ring < 0)
return ring;
gobj = drm_gem_object_lookup(dev, filp, args->handle);

if (gobj == NULL) {
r = -ENOENT;
goto error_unref;
}

robj = gem_to_radeon_bo(gobj);

if (args->offset & 0x7 || args->offset + 8 > radeon_bo_size(robj)) {
r = -EINVAL;
goto error_unref;
}

r = radeon_bo_reserve(robj, false);
   if (r)
goto error_unref;
   r = radeon_bo_pin(rdev->uvd.vcpu_bo, RADEON_GEM_DOMAIN_VRAM |
RADEON_GEM_DOMAIN_GTT, NULL);
if (r)
goto error_unreserve;
   r = radeon_bo_kmap(robj, &cpu_addr);
   if (r)
goto error_unpin;
   radeon_bo_unreserve(robj);
fence_addr = (uint64_t *)(((uint8_t*)cpu_addr) + args->offset);

radeon_irq_kms_sw_irq_get(rdev, ring);

r = wait_event_interruptible(rdev->fence_queue, (
*fence_addr >= args->fence || rdev->needs_reset
));

radeon_irq_kms_sw_irq_put(rdev, ring);

if (rdev->needs_reset)
r = -EDEADLK;
radeon_bo_reserve(robj, false);

radeon_bo_kunmap(robj);
+error_unpin:

radeon_bo_unpin(robj);

+error_unreserve:
   radeon_bo_unreserve(robj);
+error_unref:

drm_gem_object_unreference_unlocked(gobj);

up_read(&rdev->exclusive_lock);

r = radeon_gem_handle_lockup(robj->rdev, r);

return r;

+}

int radeon_gem_set_domain_ioctl(struct drm_device *dev, void *data, struct drm_file *filp) { diff --git a/drivers/gpu/drm/radeon/radeon_kms.c b/drivers/gpu/drm/radeon/radeon_kms.c index 686411e..4757f25 100644 --- a/drivers/gpu/drm/radeon/radeon_kms.c +++ b/drivers/gpu/drm/radeon/radeon_kms.c @@ -893,5 +893,6 @@ const struct drm_ioctl_desc radeon_ioctls_kms[] = { DRM_IOCTL_DEF_DRV(RADEON_GEM_VA, radeon_gem_va_ioctl, DRM_AUTH|DRM_UNLOCKED|DRM_RENDER_ALLOW), DRM_IOCTL_DEF_DRV(RADEON_GEM_OP, radeon_gem_op_ioctl, DRM_AUTH|DRM_UNLOCKED|DRM_RENDER_ALLOW), DRM_IOCTL_DEF_DRV(RADEON_GEM_USERPTR, radeon_gem_userptr_ioctl, DRM_AUTH|DRM_UNLOCKED|DRM_RENDER_ALLOW),

DRM_IOCTL_DEF_DRV(RADEON_GEM_WAIT_USERFENCE, radeon_gem_wait_userfence_ioctl, DRM_AUTH|DRM_UNLOCKED|DRM_RENDER_ALLOW),

}; int radeon_max_kms_ioctl = ARRAY_SIZE(radeon_ioctls_kms); diff --git a/include/uapi/drm/radeon_drm.h b/include/uapi/drm/radeon_drm.h index 50d0fb4..43fe660 100644 --- a/include/uapi/drm/radeon_drm.h +++ b/include/uapi/drm/radeon_drm.h @@ -512,6 +512,7 @@ typedef struct { #define DRM_RADEON_GEM_VA 0x2b #define DRM_RADEON_GEM_OP 0x2c #define DRM_RADEON_GEM_USERPTR 0x2d +#define DRM_RADEON_GEM_WAIT_USERFENCE 0x2e

#define DRM_IOCTL_RADEON_CP_INIT DRM_IOW( DRM_COMMAND_BASE + DRM_RADEON_CP_INIT, drm_radeon_init_t) #define DRM_IOCTL_RADEON_CP_START DRM_IO( DRM_COMMAND_BASE + DRM_RADEON_CP_START) @@ -541,21 +542,22 @@ typedef struct { #define DRM_IOCTL_RADEON_SURF_ALLOC DRM_IOW( DRM_COMMAND_BASE + DRM_RADEON_SURF_ALLOC, drm_radeon_surface_alloc_t) #define DRM_IOCTL_RADEON_SURF_FREE DRM_IOW( DRM_COMMAND_BASE + DRM_RADEON_SURF_FREE, drm_radeon_surface_free_t) /* KMS */ -#define DRM_IOCTL_RADEON_GEM_INFO DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_INFO, struct drm_radeon_gem_info) -#define DRM_IOCTL_RADEON_GEM_CREATE DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_CREATE, struct drm_radeon_gem_create) -#define DRM_IOCTL_RADEON_GEM_MMAP DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_MMAP, struct drm_radeon_gem_mmap) -#define DRM_IOCTL_RADEON_GEM_PREAD DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_PREAD, struct drm_radeon_gem_pread) -#define DRM_IOCTL_RADEON_GEM_PWRITE DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_PWRITE, struct drm_radeon_gem_pwrite) -#define DRM_IOCTL_RADEON_GEM_SET_DOMAIN DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_SET_DOMAIN, struct drm_radeon_gem_set_domain) -#define DRM_IOCTL_RADEON_GEM_WAIT_IDLE DRM_IOW(DRM_COMMAND_BASE + DRM_RADEON_GEM_WAIT_IDLE, struct drm_radeon_gem_wait_idle) -#define DRM_IOCTL_RADEON_CS DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_CS, struct drm_radeon_cs) -#define DRM_IOCTL_RADEON_INFO DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_INFO, struct drm_radeon_info) -#define DRM_IOCTL_RADEON_GEM_SET_TILING DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_SET_TILING, struct drm_radeon_gem_set_tiling) -#define DRM_IOCTL_RADEON_GEM_GET_TILING DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_GET_TILING, struct drm_radeon_gem_get_tiling) -#define DRM_IOCTL_RADEON_GEM_BUSY DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_BUSY, struct drm_radeon_gem_busy) -#define DRM_IOCTL_RADEON_GEM_VA DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_VA, struct drm_radeon_gem_va) -#define DRM_IOCTL_RADEON_GEM_OP DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_OP, struct drm_radeon_gem_op) -#define DRM_IOCTL_RADEON_GEM_USERPTR DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_USERPTR, struct drm_radeon_gem_userptr) +#define DRM_IOCTL_RADEON_GEM_INFO DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_INFO, struct drm_radeon_gem_info) +#define DRM_IOCTL_RADEON_GEM_CREATE DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_CREATE, struct drm_radeon_gem_create) +#define DRM_IOCTL_RADEON_GEM_MMAP DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_MMAP, struct drm_radeon_gem_mmap) +#define DRM_IOCTL_RADEON_GEM_PREAD DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_PREAD, struct drm_radeon_gem_pread) +#define DRM_IOCTL_RADEON_GEM_PWRITE DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_PWRITE, struct drm_radeon_gem_pwrite) +#define DRM_IOCTL_RADEON_GEM_SET_DOMAIN DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_SET_DOMAIN, struct drm_radeon_gem_set_domain) +#define DRM_IOCTL_RADEON_GEM_WAIT_IDLE DRM_IOW(DRM_COMMAND_BASE + DRM_RADEON_GEM_WAIT_IDLE, struct drm_radeon_gem_wait_idle) +#define DRM_IOCTL_RADEON_CS DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_CS, struct drm_radeon_cs) +#define DRM_IOCTL_RADEON_INFO DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_INFO, struct drm_radeon_info) +#define DRM_IOCTL_RADEON_GEM_SET_TILING DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_SET_TILING, struct drm_radeon_gem_set_tiling) +#define DRM_IOCTL_RADEON_GEM_GET_TILING DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_GET_TILING, struct drm_radeon_gem_get_tiling) +#define DRM_IOCTL_RADEON_GEM_BUSY DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_BUSY, struct drm_radeon_gem_busy) +#define DRM_IOCTL_RADEON_GEM_VA DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_VA, struct drm_radeon_gem_va) +#define DRM_IOCTL_RADEON_GEM_OP DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_OP, struct drm_radeon_gem_op) +#define DRM_IOCTL_RADEON_GEM_USERPTR DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_USERPTR, struct drm_radeon_gem_userptr) +#define DRM_IOCTL_RADEON_GEM_WAIT_USERFENCE DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_WAIT_USERFENCE, struct drm_radeon_gem_wait_userfence)

typedef struct drm_radeon_init { enum { @@ -831,6 +833,15 @@ struct drm_radeon_gem_userptr { uint32_t handle; };

+struct drm_radeon_gem_wait_userfence {

uint32_t handle;

uint32_t offset;

uint64_t fence;

uint32_t ring;

int32_t priority;

uint32_t flags;

+};

#define RADEON_TILING_MACRO 0x1 #define RADEON_TILING_MICRO 0x2

#define RADEON_TILING_SWAP_16BIT 0x4

1.9.1

dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Jerome Glisse

5:51 p.m.

New subject: [PATCH 3/3] drm/radeon: add userfence IOCTL

On Mon, Apr 13, 2015 at 07:23:34PM +0200, Daniel Vetter wrote:

...

On Mon, Apr 13, 2015 at 04:52:17PM +0200, Christian König wrote:

...
From: Christian König christian.koenig@amd.com

WIP patch which adds an user fence IOCTL.

Signed-off-by: Christian König christian.koenig@amd.com

I've discussed userspace fences a lot with Jerome last XDC, so here's my comments:

My primary concern with mid-batch fences is that if we create real kernel fences (which might even escape to other places using android syncpts or dma-buf) then we end up relying upon correct userspace to not hang the kernel, which isn't good.

Yes i agree on that, solution i propose make sure that this can not happen.

...

So imo any kind of mid-batch fence must be done completely in userspace and never show up as a fence object on the kernel side. I thought that just busy-spinning in userspace would be all that's needed, but adding an ioctl to wait on such user fences seems like a nice idea too. On i915 we even have 2 interrupt sources per ring, so we could split the irq processing between kernel fences and userspace fences.

Technicaly here the kernel does not allocate any object it just that kernel can enable GPU interrupt and thus wait "inteligently" until the GPU fire an interrupt telling us that it might be a good time to look at the fence value.

So technicaly this ioctl is nothing more than a wait for irq and check memory value.

...

One thing to keep in mind (I dunno radeon/ttm internals enough to know) is to make sure that while being blocked for a userspace fence in the ioctl you're not starving anyone else. But it doesn't look like you're holding any reservation objects or something similar which might prevent concurrent cs.

Yes this is the discussion we are having, how to make sure that such ioctl would not block any regular processing so that it could not be abuse in anyway (well at least in anyway my devious imagination can think of right now :)).

Cheers, Jérôme

Christian König

6:26 p.m.

New subject: [PATCH 3/3] drm/radeon: add userfence IOCTL

On 13.04.2015 19:51, Jerome Glisse wrote:

...

On Mon, Apr 13, 2015 at 07:23:34PM +0200, Daniel Vetter wrote:

...
On Mon, Apr 13, 2015 at 04:52:17PM +0200, Christian König wrote:

...
From: Christian König christian.koenig@amd.com

WIP patch which adds an user fence IOCTL.

Signed-off-by: Christian König christian.koenig@amd.com

I've discussed userspace fences a lot with Jerome last XDC, so here's my comments:

My primary concern with mid-batch fences is that if we create real kernel fences (which might even escape to other places using android syncpts or dma-buf) then we end up relying upon correct userspace to not hang the kernel, which isn't good.

Yes i agree on that, solution i propose make sure that this can not happen.

What if we want to base a GPU scheduler and Android sync points on that functionality? E.g. it might be necessary to create "struct fence" objects which are based on the information from userspace.

Would that be possible or would we run into issues?

Regards, Christian.

...

...
So imo any kind of mid-batch fence must be done completely in userspace and never show up as a fence object on the kernel side. I thought that just busy-spinning in userspace would be all that's needed, but adding an ioctl to wait on such user fences seems like a nice idea too. On i915 we even have 2 interrupt sources per ring, so we could split the irq processing between kernel fences and userspace fences.

Technicaly here the kernel does not allocate any object it just that kernel can enable GPU interrupt and thus wait "inteligently" until the GPU fire an interrupt telling us that it might be a good time to look at the fence value.

So technicaly this ioctl is nothing more than a wait for irq and check memory value.

...
One thing to keep in mind (I dunno radeon/ttm internals enough to know) is to make sure that while being blocked for a userspace fence in the ioctl you're not starving anyone else. But it doesn't look like you're holding any reservation objects or something similar which might prevent concurrent cs.

Yes this is the discussion we are having, how to make sure that such ioctl would not block any regular processing so that it could not be abuse in anyway (well at least in anyway my devious imagination can think of right now :)).

Cheers, Jérôme

Jerome Glisse

7:48 p.m.

New subject: [PATCH 3/3] drm/radeon: add userfence IOCTL

On Mon, Apr 13, 2015 at 08:26:05PM +0200, Christian König wrote:

...

On 13.04.2015 19:51, Jerome Glisse wrote:

...
On Mon, Apr 13, 2015 at 07:23:34PM +0200, Daniel Vetter wrote:

...
On Mon, Apr 13, 2015 at 04:52:17PM +0200, Christian König wrote:

...
From: Christian König christian.koenig@amd.com

WIP patch which adds an user fence IOCTL.

Signed-off-by: Christian König christian.koenig@amd.com

I've discussed userspace fences a lot with Jerome last XDC, so here's my comments:

My primary concern with mid-batch fences is that if we create real kernel fences (which might even escape to other places using android syncpts or dma-buf) then we end up relying upon correct userspace to not hang the kernel, which isn't good.

Yes i agree on that, solution i propose make sure that this can not happen.

What if we want to base a GPU scheduler and Android sync points on that functionality? E.g. it might be necessary to create "struct fence" objects which are based on the information from userspace.

Would that be possible or would we run into issues?

Well i would not like that, but i do not like Android code much, but you could use the same code as i show in my other email and just have the android fence struct take a reference on the BO where the fence is. Then using same code to check status.

Obviously there should be timeout as there is no way for the kernel to know if such fence will ever signal.

Cheers, Jérôme

Daniel Vetter

14 Apr 14 Apr

8:17 a.m.

New subject: [PATCH 3/3] drm/radeon: add userfence IOCTL

On Mon, Apr 13, 2015 at 03:48:51PM -0400, Jerome Glisse wrote:

...

On Mon, Apr 13, 2015 at 08:26:05PM +0200, Christian König wrote:

...
On 13.04.2015 19:51, Jerome Glisse wrote:

...
On Mon, Apr 13, 2015 at 07:23:34PM +0200, Daniel Vetter wrote:

...
On Mon, Apr 13, 2015 at 04:52:17PM +0200, Christian König wrote:

...
From: Christian König christian.koenig@amd.com

WIP patch which adds an user fence IOCTL.

Signed-off-by: Christian König christian.koenig@amd.com

I've discussed userspace fences a lot with Jerome last XDC, so here's my comments:

My primary concern with mid-batch fences is that if we create real kernel fences (which might even escape to other places using android syncpts or dma-buf) then we end up relying upon correct userspace to not hang the kernel, which isn't good.

Yes i agree on that, solution i propose make sure that this can not happen.

What if we want to base a GPU scheduler and Android sync points on that functionality? E.g. it might be necessary to create "struct fence" objects which are based on the information from userspace.

Would that be possible or would we run into issues?

Well i would not like that, but i do not like Android code much, but you could use the same code as i show in my other email and just have the android fence struct take a reference on the BO where the fence is. Then using same code to check status.

Obviously there should be timeout as there is no way for the kernel to know if such fence will ever signal.

Yeah I think if your umd is using this for all fencing needs, including cross-process and everything then I don't think it's such a good idea any more ;-) But if it's just fine-grained sync for heterogenous compute within the same process (ocl, hsa or whatever) then this seems reasonable.

I guess this boils down to what the userspace side will look like, and what kind of standard you're trying to implement here. -Daniel

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Daniel Vetter

13 Apr 13 Apr

5:27 p.m.

New subject: [PATCH 3/3] drm/radeon: add userfence IOCTL

On Mon, Apr 13, 2015 at 04:52:17PM +0200, Christian König wrote:

...

From: Christian König christian.koenig@amd.com

WIP patch which adds an user fence IOCTL.

Signed-off-by: Christian König christian.koenig@amd.com

drivers/gpu/drm/radeon/radeon.h | 2 ++ drivers/gpu/drm/radeon/radeon_gem.c | 71 +++++++++++++++++++++++++++++++++++++ drivers/gpu/drm/radeon/radeon_kms.c | 1 + include/uapi/drm/radeon_drm.h | 41 +++++++++++++-------- 4 files changed, 100 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h index 57d63c4..110baae 100644 --- a/drivers/gpu/drm/radeon/radeon.h +++ b/drivers/gpu/drm/radeon/radeon.h @@ -2215,6 +2215,8 @@ int radeon_gem_create_ioctl(struct drm_device *dev, void *data, struct drm_file *filp); int radeon_gem_userptr_ioctl(struct drm_device *dev, void *data, struct drm_file *filp); +int radeon_gem_wait_userfence_ioctl(struct drm_device *dev, void *data,
		    struct drm_file *filp);
int radeon_gem_pin_ioctl(struct drm_device *dev, void *data, struct drm_file *file_priv); int radeon_gem_unpin_ioctl(struct drm_device *dev, void *data, diff --git a/drivers/gpu/drm/radeon/radeon_gem.c b/drivers/gpu/drm/radeon/radeon_gem.c index ac3c131..094b3d5 100644 --- a/drivers/gpu/drm/radeon/radeon_gem.c +++ b/drivers/gpu/drm/radeon/radeon_gem.c @@ -365,6 +365,77 @@ handle_lockup: return r; }

+int radeon_gem_wait_userfence_ioctl(struct drm_device *dev, void *data,
		    struct drm_file *filp)
+{
struct radeon_device *rdev = dev->dev_private;

struct drm_radeon_gem_wait_userfence *args = data;

struct drm_gem_object *gobj;

struct radeon_bo *robj;

uint64_t *fence_addr;

void *cpu_addr;

int r, ring;

down_read(&rdev->exclusive_lock);

ring = radeon_cs_get_ring(rdev, args->ring, args->priority);

if (ring < 0)
return ring;
gobj = drm_gem_object_lookup(dev, filp, args->handle);

if (gobj == NULL) {
r = -ENOENT;
goto error_unref;
}

robj = gem_to_radeon_bo(gobj);

if (args->offset & 0x7 || args->offset + 8 > radeon_bo_size(robj)) {
r = -EINVAL;
goto error_unref;
}

r = radeon_bo_reserve(robj, false);
   if (r)
goto error_unref;
   r = radeon_bo_pin(rdev->uvd.vcpu_bo, RADEON_GEM_DOMAIN_VRAM |
RADEON_GEM_DOMAIN_GTT, NULL);

One thing on top of making sure that you don't hold any locks while waiting to ensure you can't abuse this: I think you need to add a check here that the bo is reasonably size, otherwise someone will abuse this to lock in massive amounts of ram. As long as the limit is reasonable small we can subsume the overhead under the per-thread limits imo. -Daniel

...

if (r)
goto error_unreserve;
   r = radeon_bo_kmap(robj, &cpu_addr);
   if (r)
goto error_unpin;
   radeon_bo_unreserve(robj);
fence_addr = (uint64_t *)(((uint8_t*)cpu_addr) + args->offset);

radeon_irq_kms_sw_irq_get(rdev, ring);

r = wait_event_interruptible(rdev->fence_queue, (
*fence_addr >= args->fence || rdev->needs_reset
));

radeon_irq_kms_sw_irq_put(rdev, ring);

if (rdev->needs_reset)
r = -EDEADLK;
radeon_bo_reserve(robj, false);

radeon_bo_kunmap(robj);
+error_unpin:

radeon_bo_unpin(robj);

+error_unreserve:
   radeon_bo_unreserve(robj);
+error_unref:

drm_gem_object_unreference_unlocked(gobj);

up_read(&rdev->exclusive_lock);

r = radeon_gem_handle_lockup(robj->rdev, r);

return r;

+}

int radeon_gem_set_domain_ioctl(struct drm_device *dev, void *data, struct drm_file *filp) { diff --git a/drivers/gpu/drm/radeon/radeon_kms.c b/drivers/gpu/drm/radeon/radeon_kms.c index 686411e..4757f25 100644 --- a/drivers/gpu/drm/radeon/radeon_kms.c +++ b/drivers/gpu/drm/radeon/radeon_kms.c @@ -893,5 +893,6 @@ const struct drm_ioctl_desc radeon_ioctls_kms[] = { DRM_IOCTL_DEF_DRV(RADEON_GEM_VA, radeon_gem_va_ioctl, DRM_AUTH|DRM_UNLOCKED|DRM_RENDER_ALLOW), DRM_IOCTL_DEF_DRV(RADEON_GEM_OP, radeon_gem_op_ioctl, DRM_AUTH|DRM_UNLOCKED|DRM_RENDER_ALLOW), DRM_IOCTL_DEF_DRV(RADEON_GEM_USERPTR, radeon_gem_userptr_ioctl, DRM_AUTH|DRM_UNLOCKED|DRM_RENDER_ALLOW),

DRM_IOCTL_DEF_DRV(RADEON_GEM_WAIT_USERFENCE, radeon_gem_wait_userfence_ioctl, DRM_AUTH|DRM_UNLOCKED|DRM_RENDER_ALLOW),

}; int radeon_max_kms_ioctl = ARRAY_SIZE(radeon_ioctls_kms); diff --git a/include/uapi/drm/radeon_drm.h b/include/uapi/drm/radeon_drm.h index 50d0fb4..43fe660 100644 --- a/include/uapi/drm/radeon_drm.h +++ b/include/uapi/drm/radeon_drm.h @@ -512,6 +512,7 @@ typedef struct { #define DRM_RADEON_GEM_VA 0x2b #define DRM_RADEON_GEM_OP 0x2c #define DRM_RADEON_GEM_USERPTR 0x2d +#define DRM_RADEON_GEM_WAIT_USERFENCE 0x2e

#define DRM_IOCTL_RADEON_CP_INIT DRM_IOW( DRM_COMMAND_BASE + DRM_RADEON_CP_INIT, drm_radeon_init_t) #define DRM_IOCTL_RADEON_CP_START DRM_IO( DRM_COMMAND_BASE + DRM_RADEON_CP_START) @@ -541,21 +542,22 @@ typedef struct { #define DRM_IOCTL_RADEON_SURF_ALLOC DRM_IOW( DRM_COMMAND_BASE + DRM_RADEON_SURF_ALLOC, drm_radeon_surface_alloc_t) #define DRM_IOCTL_RADEON_SURF_FREE DRM_IOW( DRM_COMMAND_BASE + DRM_RADEON_SURF_FREE, drm_radeon_surface_free_t) /* KMS */ -#define DRM_IOCTL_RADEON_GEM_INFO DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_INFO, struct drm_radeon_gem_info) -#define DRM_IOCTL_RADEON_GEM_CREATE DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_CREATE, struct drm_radeon_gem_create) -#define DRM_IOCTL_RADEON_GEM_MMAP DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_MMAP, struct drm_radeon_gem_mmap) -#define DRM_IOCTL_RADEON_GEM_PREAD DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_PREAD, struct drm_radeon_gem_pread) -#define DRM_IOCTL_RADEON_GEM_PWRITE DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_PWRITE, struct drm_radeon_gem_pwrite) -#define DRM_IOCTL_RADEON_GEM_SET_DOMAIN DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_SET_DOMAIN, struct drm_radeon_gem_set_domain) -#define DRM_IOCTL_RADEON_GEM_WAIT_IDLE DRM_IOW(DRM_COMMAND_BASE + DRM_RADEON_GEM_WAIT_IDLE, struct drm_radeon_gem_wait_idle) -#define DRM_IOCTL_RADEON_CS DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_CS, struct drm_radeon_cs) -#define DRM_IOCTL_RADEON_INFO DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_INFO, struct drm_radeon_info) -#define DRM_IOCTL_RADEON_GEM_SET_TILING DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_SET_TILING, struct drm_radeon_gem_set_tiling) -#define DRM_IOCTL_RADEON_GEM_GET_TILING DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_GET_TILING, struct drm_radeon_gem_get_tiling) -#define DRM_IOCTL_RADEON_GEM_BUSY DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_BUSY, struct drm_radeon_gem_busy) -#define DRM_IOCTL_RADEON_GEM_VA DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_VA, struct drm_radeon_gem_va) -#define DRM_IOCTL_RADEON_GEM_OP DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_OP, struct drm_radeon_gem_op) -#define DRM_IOCTL_RADEON_GEM_USERPTR DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_USERPTR, struct drm_radeon_gem_userptr) +#define DRM_IOCTL_RADEON_GEM_INFO DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_INFO, struct drm_radeon_gem_info) +#define DRM_IOCTL_RADEON_GEM_CREATE DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_CREATE, struct drm_radeon_gem_create) +#define DRM_IOCTL_RADEON_GEM_MMAP DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_MMAP, struct drm_radeon_gem_mmap) +#define DRM_IOCTL_RADEON_GEM_PREAD DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_PREAD, struct drm_radeon_gem_pread) +#define DRM_IOCTL_RADEON_GEM_PWRITE DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_PWRITE, struct drm_radeon_gem_pwrite) +#define DRM_IOCTL_RADEON_GEM_SET_DOMAIN DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_SET_DOMAIN, struct drm_radeon_gem_set_domain) +#define DRM_IOCTL_RADEON_GEM_WAIT_IDLE DRM_IOW(DRM_COMMAND_BASE + DRM_RADEON_GEM_WAIT_IDLE, struct drm_radeon_gem_wait_idle) +#define DRM_IOCTL_RADEON_CS DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_CS, struct drm_radeon_cs) +#define DRM_IOCTL_RADEON_INFO DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_INFO, struct drm_radeon_info) +#define DRM_IOCTL_RADEON_GEM_SET_TILING DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_SET_TILING, struct drm_radeon_gem_set_tiling) +#define DRM_IOCTL_RADEON_GEM_GET_TILING DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_GET_TILING, struct drm_radeon_gem_get_tiling) +#define DRM_IOCTL_RADEON_GEM_BUSY DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_BUSY, struct drm_radeon_gem_busy) +#define DRM_IOCTL_RADEON_GEM_VA DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_VA, struct drm_radeon_gem_va) +#define DRM_IOCTL_RADEON_GEM_OP DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_OP, struct drm_radeon_gem_op) +#define DRM_IOCTL_RADEON_GEM_USERPTR DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_USERPTR, struct drm_radeon_gem_userptr) +#define DRM_IOCTL_RADEON_GEM_WAIT_USERFENCE DRM_IOWR(DRM_COMMAND_BASE + DRM_RADEON_GEM_WAIT_USERFENCE, struct drm_radeon_gem_wait_userfence)

typedef struct drm_radeon_init { enum { @@ -831,6 +833,15 @@ struct drm_radeon_gem_userptr { uint32_t handle; };

+struct drm_radeon_gem_wait_userfence {

uint32_t handle;

uint32_t offset;

uint64_t fence;

uint32_t ring;

int32_t priority;

uint32_t flags;

+};

#define RADEON_TILING_MACRO 0x1 #define RADEON_TILING_MICRO 0x2

#define RADEON_TILING_SWAP_16BIT 0x4

1.9.1

dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Serguei Sagalovitch

3:25 p.m.

...

the BO to be kept in the same place while it is mapped inside the

kernel page table ...

...

So this requires that we pin down the BO for the duration of the wait

IOCTL.

But my understanding is that it should be not duration of "wait" IOCTL but "duration" of command buffer execution.

BTW: I would assume that this is not the new scenario.

This is scenario: - User allocate BO - User get CPU address for BO - User submit command buffer to write to BO - User could "poll" / "read" or "write" BO data by CPU

So when TTM needs to move BO to another location it should also update CPU "mapping" correctly so user will always read / write the correct data.

Did I miss anything?

Sincerely yours, Serguei Sagalovitch

On 15-04-13 10:52 AM, Christian König wrote:

...

Hello everyone,

we have a requirement for a bit different kind of fence handling. Currently we handle fences completely inside the kernel, but in the future we would like to emit multiple fences inside the same IB as well.

This works by adding multiple fence commands into an IB which just write their value to a specific location inside a BO and trigger the appropriate hardware interrupt.

The user part of the driver stack should then be able to call an IOCTL to wait for the interrupt and block for the value (or something larger) to be written to the specific location.

This has the advantage that you can have multiple synchronization points in the same IB and don't need to split up your draw commands over several IBs so that the kernel can insert kernel fences in between.

The following set of patches tries to implement exactly this IOCTL. The big problem with that IOCTL is that TTM needs the BO to be kept in the same place while it is mapped inside the kernel page table. So this requires that we pin down the BO for the duration of the wait IOCTL.

This practically gives userspace a way of pinning down BOs for as long as it wants, without the ability for the kernel for intervention.

Any ideas how to avoid those problems? Or better ideas how to handle the new requirements?

Please note that the patches are only hacked together quick&dirty to demonstrate the problem and not very well tested.

Best regards, Christian.

Christian König

3:35 p.m.

On 13.04.2015 17:25, Serguei Sagalovitch wrote:

...

...
the BO to be kept in the same place while it is mapped inside the

kernel page table ...

...
So this requires that we pin down the BO for the duration of the

wait IOCTL.

But my understanding is that it should be not duration of "wait" IOCTL but "duration" of command buffer execution.

BTW: I would assume that this is not the new scenario.

This is scenario: - User allocate BO - User get CPU address for BO - User submit command buffer to write to BO - User could "poll" / "read" or "write" BO data by CPU

So when TTM needs to move BO to another location it should also update CPU "mapping" correctly so user will always read / write the correct data.

Did I miss anything?

The problem is that kernel mappings are not updated when TTM moves the buffer around. In the case of a swapped out buffer that wouldn't even be possible cause kernel mappings aren't pageable.

You just can't map the BO into kernel space unless you have it pinned down, so you can't check the current value written in the BO in your IOCTL.

One alternative is to send all interrupts in question unfiltered to user space and let userspace do the check if the right value was written or not. But I assume that this would be rather bad for performance.

Another alternative would be to use the userspace mapping to check the BO value, but this approach isn't compatible with a GPU scheduler. E.g. you can't really do cross process space memory access in device drivers.

Regards, Christian.

...

Sincerely yours, Serguei Sagalovitch

On 15-04-13 10:52 AM, Christian König wrote:

...
Hello everyone,

we have a requirement for a bit different kind of fence handling. Currently we handle fences completely inside the kernel, but in the future we would like to emit multiple fences inside the same IB as well.

This works by adding multiple fence commands into an IB which just write their value to a specific location inside a BO and trigger the appropriate hardware interrupt.

The user part of the driver stack should then be able to call an IOCTL to wait for the interrupt and block for the value (or something larger) to be written to the specific location.

This has the advantage that you can have multiple synchronization points in the same IB and don't need to split up your draw commands over several IBs so that the kernel can insert kernel fences in between.

The following set of patches tries to implement exactly this IOCTL. The big problem with that IOCTL is that TTM needs the BO to be kept in the same place while it is mapped inside the kernel page table. So this requires that we pin down the BO for the duration of the wait IOCTL.

This practically gives userspace a way of pinning down BOs for as long as it wants, without the ability for the kernel for intervention.

Any ideas how to avoid those problems? Or better ideas how to handle the new requirements?

Please note that the patches are only hacked together quick&dirty to demonstrate the problem and not very well tested.

Best regards, Christian.

Serguei Sagalovitch

3:37 p.m.

...

Another alternative would be to use the userspace mapping to check

the BO value This is what I was thinking.

Sincerely yours, Serguei Sagalovitch

On 15-04-13 11:35 AM, Christian König wrote:

...

On 13.04.2015 17:25, Serguei Sagalovitch wrote:

...
...
the BO to be kept in the same place while it is mapped inside the

kernel page table ...

...
So this requires that we pin down the BO for the duration of the

wait IOCTL.

But my understanding is that it should be not duration of "wait" IOCTL but "duration" of command buffer execution.

BTW: I would assume that this is not the new scenario.

This is scenario: - User allocate BO - User get CPU address for BO - User submit command buffer to write to BO - User could "poll" / "read" or "write" BO data by CPU

So when TTM needs to move BO to another location it should also update CPU "mapping" correctly so user will always read / write the correct data.

Did I miss anything?

The problem is that kernel mappings are not updated when TTM moves the buffer around. In the case of a swapped out buffer that wouldn't even be possible cause kernel mappings aren't pageable.

You just can't map the BO into kernel space unless you have it pinned down, so you can't check the current value written in the BO in your IOCTL.

One alternative is to send all interrupts in question unfiltered to user space and let userspace do the check if the right value was written or not. But I assume that this would be rather bad for performance.

Another alternative would be to use the userspace mapping to check the BO value, but this approach isn't compatible with a GPU scheduler. E.g. you can't really do cross process space memory access in device drivers.

Regards, Christian.

...
Sincerely yours, Serguei Sagalovitch

On 15-04-13 10:52 AM, Christian König wrote:

...
Hello everyone,

we have a requirement for a bit different kind of fence handling. Currently we handle fences completely inside the kernel, but in the future we would like to emit multiple fences inside the same IB as well.

This works by adding multiple fence commands into an IB which just write their value to a specific location inside a BO and trigger the appropriate hardware interrupt.

The user part of the driver stack should then be able to call an IOCTL to wait for the interrupt and block for the value (or something larger) to be written to the specific location.

This has the advantage that you can have multiple synchronization points in the same IB and don't need to split up your draw commands over several IBs so that the kernel can insert kernel fences in between.

The following set of patches tries to implement exactly this IOCTL. The big problem with that IOCTL is that TTM needs the BO to be kept in the same place while it is mapped inside the kernel page table. So this requires that we pin down the BO for the duration of the wait IOCTL.

This practically gives userspace a way of pinning down BOs for as long as it wants, without the ability for the kernel for intervention.

Any ideas how to avoid those problems? Or better ideas how to handle the new requirements?

Please note that the patches are only hacked together quick&dirty to demonstrate the problem and not very well tested.

Best regards, Christian.

Jerome Glisse

3:46 p.m.

On Mon, Apr 13, 2015 at 05:35:04PM +0200, Christian König wrote:

...

On 13.04.2015 17:25, Serguei Sagalovitch wrote:

...
...
the BO to be kept in the same place while it is mapped inside the kernel

page table ...

...
So this requires that we pin down the BO for the duration of the wait

IOCTL.

But my understanding is that it should be not duration of "wait" IOCTL but "duration" of command buffer execution.

BTW: I would assume that this is not the new scenario.

This is scenario:

User allocate BO

User get CPU address for BO

User submit command buffer to write to BO

User could "poll" / "read" or "write" BO data by CPU

So when TTM needs to move BO to another location it should also update CPU "mapping" correctly so user will always read / write the correct data.

Did I miss anything?

The problem is that kernel mappings are not updated when TTM moves the buffer around. In the case of a swapped out buffer that wouldn't even be possible cause kernel mappings aren't pageable.

You just can't map the BO into kernel space unless you have it pinned down, so you can't check the current value written in the BO in your IOCTL.

One alternative is to send all interrupts in question unfiltered to user space and let userspace do the check if the right value was written or not. But I assume that this would be rather bad for performance.

Yeah this most likey would be seriously bad. It might even allow malicous userspace to force irq storm.

...

Another alternative would be to use the userspace mapping to check the BO value, but this approach isn't compatible with a GPU scheduler. E.g. you can't really do cross process space memory access in device drivers.

Not to mention that you would need mmu_notifier to protect you from munmap. I think the solution i proposed in the other mail is simplest and safest.

Cheers, Jérôme

...

Regards, Christian.

...
Sincerely yours, Serguei Sagalovitch

On 15-04-13 10:52 AM, Christian König wrote:

...
Hello everyone,

we have a requirement for a bit different kind of fence handling. Currently we handle fences completely inside the kernel, but in the future we would like to emit multiple fences inside the same IB as well.

This works by adding multiple fence commands into an IB which just write their value to a specific location inside a BO and trigger the appropriate hardware interrupt.

The user part of the driver stack should then be able to call an IOCTL to wait for the interrupt and block for the value (or something larger) to be written to the specific location.

This has the advantage that you can have multiple synchronization points in the same IB and don't need to split up your draw commands over several IBs so that the kernel can insert kernel fences in between.

The following set of patches tries to implement exactly this IOCTL. The big problem with that IOCTL is that TTM needs the BO to be kept in the same place while it is mapped inside the kernel page table. So this requires that we pin down the BO for the duration of the wait IOCTL.

This practically gives userspace a way of pinning down BOs for as long as it wants, without the ability for the kernel for intervention.

Any ideas how to avoid those problems? Or better ideas how to handle the new requirements?

Please note that the patches are only hacked together quick&dirty to demonstrate the problem and not very well tested.

Best regards, Christian.

Jerome Glisse

3:39 p.m.

On Mon, Apr 13, 2015 at 11:25:30AM -0400, Serguei Sagalovitch wrote:

...

...
the BO to be kept in the same place while it is mapped inside the kernel

page table ...

...
So this requires that we pin down the BO for the duration of the wait

IOCTL.

But my understanding is that it should be not duration of "wait" IOCTL but "duration" of command buffer execution.

BTW: I would assume that this is not the new scenario.

This is scenario: - User allocate BO - User get CPU address for BO - User submit command buffer to write to BO - User could "poll" / "read" or "write" BO data by CPU

So when TTM needs to move BO to another location it should also update CPU "mapping" correctly so user will always read / write the correct data.

Did I miss anything?

No this is how things works. But we want to avoid pinning buffer. One use case for this userspace fence is i assume same BO same user fence use accross several command buffer. Given that the userspace wait fence ioctl has not way to know which command buffer it is really waiting after then kernel has no knowledge of if this user fence will signal at all. So a malicious user space (we always have to assume this thing exist) could create a big VRAM BO and effectively pin it in VRAM leading to a GPU DOS (denial of service).

By the way Christian, i would add a timeout to this ioctl and return eagain to userspace on timeout so that userspace can resumit. That way a malicious userspace will just keep exhausting its cpu timeslot.

Cheers, Jérôme

Christian König

3:47 p.m.

On 13.04.2015 17:39, Jerome Glisse wrote:

...

On Mon, Apr 13, 2015 at 11:25:30AM -0400, Serguei Sagalovitch wrote:

...
...
the BO to be kept in the same place while it is mapped inside the kernel

page table ...

...
So this requires that we pin down the BO for the duration of the wait

IOCTL.

But my understanding is that it should be not duration of "wait" IOCTL but "duration" of command buffer execution.

BTW: I would assume that this is not the new scenario.

This is scenario: - User allocate BO - User get CPU address for BO - User submit command buffer to write to BO - User could "poll" / "read" or "write" BO data by CPU

So when TTM needs to move BO to another location it should also update CPU "mapping" correctly so user will always read / write the correct data.

Did I miss anything?

No this is how things works. But we want to avoid pinning buffer. One use case for this userspace fence is i assume same BO same user fence use accross several command buffer. Given that the userspace wait fence ioctl has not way to know which command buffer it is really waiting after then kernel has no knowledge of if this user fence will signal at all. So a malicious user space (we always have to assume this thing exist) could create a big VRAM BO and effectively pin it in VRAM leading to a GPU DOS (denial of service).

By the way Christian, i would add a timeout to this ioctl and return eagain to userspace on timeout so that userspace can resumit. That way a malicious userspace will just keep exhausting its cpu timeslot.

Yeah, I knew. I honestly haven't even tested the implementation, just first wanted to check how far of the whole idea is.

On the one hand it is rather appealing, but on the other it's a complete different approach of what we have done so far. E.g. we can pretty much forget the whole kernel fence framework with that.

Regards, Christian.

...

Cheers, Jérôme

Serguei Sagalovitch

3:52 p.m.

...

Given that the userspace wait fence ioctl has not way to know which

command buffer it is really waiting after then kernel has no knowledge of if this user fence will signal at all. We could pass BO handle as parameter in the "fence ioctl" to rely on it so kernel will know which BO it is waiting.

...

So a malicious user space (we always have to assume this thing exist)

could create a big VRAM BO and effectively pin it in VRAM leading to a GPU DOS (denial of service). This problem always exists. Malicious user space could allocate big BO and then submit shader running in loop which read/write from this BO. It could also spawn processes which will do the same thing. IMHO the only way to improve situation is to have GPU Scheduler to allow "unloading" such application.

Sincerely yours, Serguei Sagalovitch

On 15-04-13 11:39 AM, Jerome Glisse wrote:

...

On Mon, Apr 13, 2015 at 11:25:30AM -0400, Serguei Sagalovitch wrote:

...
...
the BO to be kept in the same place while it is mapped inside the kernel

page table ...

...
So this requires that we pin down the BO for the duration of the wait

IOCTL.

But my understanding is that it should be not duration of "wait" IOCTL but "duration" of command buffer execution.

BTW: I would assume that this is not the new scenario.

This is scenario: - User allocate BO - User get CPU address for BO - User submit command buffer to write to BO - User could "poll" / "read" or "write" BO data by CPU

So when TTM needs to move BO to another location it should also update CPU "mapping" correctly so user will always read / write the correct data.

Did I miss anything?

No this is how things works. But we want to avoid pinning buffer. One use case for this userspace fence is i assume same BO same user fence use accross several command buffer. Given that the userspace wait fence ioctl has not way to know which command buffer it is really waiting after then kernel has no knowledge of if this user fence will signal at all. So a malicious user space (we always have to assume this thing exist) could create a big VRAM BO and effectively pin it in VRAM leading to a GPU DOS (denial of service).

By the way Christian, i would add a timeout to this ioctl and return eagain to userspace on timeout so that userspace can resumit. That way a malicious userspace will just keep exhausting its cpu timeslot.

Cheers, Jérôme

Jerome Glisse

4:13 p.m.

On Mon, Apr 13, 2015 at 11:52:19AM -0400, Serguei Sagalovitch wrote:

...

...
Given that the userspace wait fence ioctl has not way to know which command buffer it is really waiting after then kernel has no knowledge if this user fence will signal at all.

We could pass BO handle as parameter in the "fence ioctl" to rely on it so kernel will know which BO it is waiting.

Christian code already do that, but this is not enough to know which cs kernel is waiting for. See my other emails with Christian.

...

...
So a malicious user space (we always have to assume this thing exist) could create a big VRAM BO and effectively pin it in VRAM leading to a GPU DOS (denial of service).

This problem always exists. Malicious user space could allocate big BO and then submit shader running in loop which read/write from this BO. It could also spawn processes which will do the same thing. IMHO the only way to improve situation is to have GPU Scheduler to allow "unloading" such application.

Yes but the GPU lockup watchdog would kick in (thinking it is a GPU lockup) and reset the GPU which is harsh but that is what we have now (well i think the compute guys messed with that so it might no longer be the case).

So it is just about avoiding giving userspace more means to mess with thing.

Cheers, Jérôme

Jerome Glisse

3:31 p.m.

On Mon, Apr 13, 2015 at 04:52:14PM +0200, Christian König wrote:

...

Hello everyone,

we have a requirement for a bit different kind of fence handling. Currently we handle fences completely inside the kernel, but in the future we would like to emit multiple fences inside the same IB as well.

This works by adding multiple fence commands into an IB which just write their value to a specific location inside a BO and trigger the appropriate hardware interrupt.

The user part of the driver stack should then be able to call an IOCTL to wait for the interrupt and block for the value (or something larger) to be written to the specific location.

This has the advantage that you can have multiple synchronization points in the same IB and don't need to split up your draw commands over several IBs so that the kernel can insert kernel fences in between.

The following set of patches tries to implement exactly this IOCTL. The big problem with that IOCTL is that TTM needs the BO to be kept in the same place while it is mapped inside the kernel page table. So this requires that we pin down the BO for the duration of the wait IOCTL.

This practically gives userspace a way of pinning down BOs for as long as it wants, without the ability for the kernel for intervention.

Any ideas how to avoid those problems? Or better ideas how to handle the new requirements?

So i think the simplest solution is to only allow such "fence" bo to be inside system memory (no vram for them). My assumption here is that such BO will barely see more than couple dword write so it is not a bandwidth intensive BO. Or do you have a requirement for such BO to be in VRAM ?

Now to do that i would just add a property to buffer object that effectively forbid such BO to be place anywhere else than GTT. Doing that would make the ioctl code simpler, just check the BO as the GTT only property set and if not return -EINVAL. Then its a simple matter of kmapping the proper page.

Note that the only thing that would be left to forbid is the swaping of the buffer due to memory pressure (from various ttm/core shrinker).

Cheers, Jérôme

Christian König

3:45 p.m.

On 13.04.2015 17:31, Jerome Glisse wrote:

...

On Mon, Apr 13, 2015 at 04:52:14PM +0200, Christian König wrote:

...
Hello everyone,

we have a requirement for a bit different kind of fence handling. Currently we handle fences completely inside the kernel, but in the future we would like to emit multiple fences inside the same IB as well.

This works by adding multiple fence commands into an IB which just write their value to a specific location inside a BO and trigger the appropriate hardware interrupt.

The user part of the driver stack should then be able to call an IOCTL to wait for the interrupt and block for the value (or something larger) to be written to the specific location.

This has the advantage that you can have multiple synchronization points in the same IB and don't need to split up your draw commands over several IBs so that the kernel can insert kernel fences in between.

The following set of patches tries to implement exactly this IOCTL. The big problem with that IOCTL is that TTM needs the BO to be kept in the same place while it is mapped inside the kernel page table. So this requires that we pin down the BO for the duration of the wait IOCTL.

This practically gives userspace a way of pinning down BOs for as long as it wants, without the ability for the kernel for intervention.

Any ideas how to avoid those problems? Or better ideas how to handle the new requirements?

So i think the simplest solution is to only allow such "fence" bo to be inside system memory (no vram for them). My assumption here is that such BO will barely see more than couple dword write so it is not a bandwidth intensive BO. Or do you have a requirement for such BO to be in VRAM ?

Not that I know off.

...

Now to do that i would just add a property to buffer object that effectively forbid such BO to be place anywhere else than GTT. Doing that would make the ioctl code simpler, just check the BO as the GTT only property set and if not return -EINVAL. Then its a simple matter of kmapping the proper page.

I've also considered adding an internal flag that when a buffer is kmapped we avoid moving it to VRAM / swapping it out, but see below.

...

Note that the only thing that would be left to forbid is the swaping of the buffer due to memory pressure (from various ttm/core shrinker).

Yeah, how the heck would I do that? That's internals of TTM that I never got into.

Thanks for the ideas, Christian.

...

Cheers, Jérôme

Jerome Glisse

4:08 p.m.

On Mon, Apr 13, 2015 at 05:45:21PM +0200, Christian König wrote:

...

On 13.04.2015 17:31, Jerome Glisse wrote:

...
On Mon, Apr 13, 2015 at 04:52:14PM +0200, Christian König wrote:

...
Hello everyone,

we have a requirement for a bit different kind of fence handling. Currently we handle fences completely inside the kernel, but in the future we would like to emit multiple fences inside the same IB as well.

This works by adding multiple fence commands into an IB which just write their value to a specific location inside a BO and trigger the appropriate hardware interrupt.

The user part of the driver stack should then be able to call an IOCTL to wait for the interrupt and block for the value (or something larger) to be written to the specific location.

This has the advantage that you can have multiple synchronization points in the same IB and don't need to split up your draw commands over several IBs so that the kernel can insert kernel fences in between.

The following set of patches tries to implement exactly this IOCTL. The big problem with that IOCTL is that TTM needs the BO to be kept in the same place while it is mapped inside the kernel page table. So this requires that we pin down the BO for the duration of the wait IOCTL.

This practically gives userspace a way of pinning down BOs for as long as it wants, without the ability for the kernel for intervention.

Any ideas how to avoid those problems? Or better ideas how to handle the new requirements?

So i think the simplest solution is to only allow such "fence" bo to be inside system memory (no vram for them). My assumption here is that such BO will barely see more than couple dword write so it is not a bandwidth intensive BO. Or do you have a requirement for such BO to be in VRAM ?

Not that I know off.

...
Now to do that i would just add a property to buffer object that effectively forbid such BO to be place anywhere else than GTT. Doing that would make the ioctl code simpler, just check the BO as the GTT only property set and if not return -EINVAL. Then its a simple matter of kmapping the proper page.

I've also considered adding an internal flag that when a buffer is kmapped we avoid moving it to VRAM / swapping it out, but see below.

...
Note that the only thing that would be left to forbid is the swaping of the buffer due to memory pressure (from various ttm/core shrinker).

Yeah, how the heck would I do that? That's internals of TTM that I never got into.

Actualy i think it is easier then i first thought, in the wait ioctl check if the buffer has a pending fence ie gpu is still using it, if not return -EAGAIN because it means that it is pointless to wait for next GPU interrupt.

For as long as the BO has an active fence it will not be swapped out (see ttm_bo_swapout()). So in the wait event test both the value and the pending fence. If the pending fence signal but not the value then return -EAGAIN. In all case just keep a kmap of the page (do not kmap the using existing kmap helper we would need something new to not interfer with the shrinker). Not that after testing the value you would need to check that the BO was not move and thus the page you were looking into is still the one the BO is using.

That way userspace can not abuse this ioctl to block the shrinker from making progress.

I need to look at ttm kmap code to see if it is actually useable without disrupting the shrinker. Will do that after lunch.

Cheers, Jérôme

...

Thanks for the ideas, Christian.

...
Cheers, Jérôme

Christian König

4:55 p.m.

On 13.04.2015 18:08, Jerome Glisse wrote:

...

On Mon, Apr 13, 2015 at 05:45:21PM +0200, Christian König wrote:

...
On 13.04.2015 17:31, Jerome Glisse wrote:

...
On Mon, Apr 13, 2015 at 04:52:14PM +0200, Christian König wrote:

...
Hello everyone,

we have a requirement for a bit different kind of fence handling. Currently we handle fences completely inside the kernel, but in the future we would like to emit multiple fences inside the same IB as well.

This works by adding multiple fence commands into an IB which just write their value to a specific location inside a BO and trigger the appropriate hardware interrupt.

The user part of the driver stack should then be able to call an IOCTL to wait for the interrupt and block for the value (or something larger) to be written to the specific location.

This has the advantage that you can have multiple synchronization points in the same IB and don't need to split up your draw commands over several IBs so that the kernel can insert kernel fences in between.

The following set of patches tries to implement exactly this IOCTL. The big problem with that IOCTL is that TTM needs the BO to be kept in the same place while it is mapped inside the kernel page table. So this requires that we pin down the BO for the duration of the wait IOCTL.

This practically gives userspace a way of pinning down BOs for as long as it wants, without the ability for the kernel for intervention.

Any ideas how to avoid those problems? Or better ideas how to handle the new requirements?

So i think the simplest solution is to only allow such "fence" bo to be inside system memory (no vram for them). My assumption here is that such BO will barely see more than couple dword write so it is not a bandwidth intensive BO. Or do you have a requirement for such BO to be in VRAM ?

Not that I know off.

...
Now to do that i would just add a property to buffer object that effectively forbid such BO to be place anywhere else than GTT. Doing that would make the ioctl code simpler, just check the BO as the GTT only property set and if not return -EINVAL. Then its a simple matter of kmapping the proper page.

I've also considered adding an internal flag that when a buffer is kmapped we avoid moving it to VRAM / swapping it out, but see below.

...
Note that the only thing that would be left to forbid is the swaping of the buffer due to memory pressure (from various ttm/core shrinker).

Yeah, how the heck would I do that? That's internals of TTM that I never got into.

Actualy i think it is easier then i first thought, in the wait ioctl check if the buffer has a pending fence ie gpu is still using it, if not return -EAGAIN because it means that it is pointless to wait for next GPU interrupt.

For as long as the BO has an active fence it will not be swapped out (see ttm_bo_swapout()). So in the wait event test both the value and the pending fence. If the pending fence signal but not the value then return -EAGAIN. In all case just keep a kmap of the page (do not kmap the using existing kmap helper we would need something new to not interfer with the shrinker). Not that after testing the value you would need to check that the BO was not move and thus the page you were looking into is still the one the BO is using.

That way userspace can not abuse this ioctl to block the shrinker from making progress.

So what we do on the start of the IOCTL is to check the BOs fences and see if it actually is still used and note it's current placement.

Then map it so the kernel can access it and in the waiting loop we check if it still has a fence and is still in the same place.

If there isn't any fences left or the placement has changed we simply assume that the fence is signaled.

Yeah, that actually should work.

Thanks for the tip, Christian.

...

I need to look at ttm kmap code to see if it is actually useable without disrupting the shrinker. Will do that after lunch.

Cheers, Jérôme

...
Thanks for the ideas, Christian.

...
Cheers, Jérôme

Jerome Glisse

5:46 p.m.

On Mon, Apr 13, 2015 at 06:55:19PM +0200, Christian König wrote:

...

On 13.04.2015 18:08, Jerome Glisse wrote:

...
On Mon, Apr 13, 2015 at 05:45:21PM +0200, Christian König wrote:

...
On 13.04.2015 17:31, Jerome Glisse wrote:

...
On Mon, Apr 13, 2015 at 04:52:14PM +0200, Christian König wrote:

...
Hello everyone,

we have a requirement for a bit different kind of fence handling. Currently we handle fences completely inside the kernel, but in the future we would like to emit multiple fences inside the same IB as well.

This works by adding multiple fence commands into an IB which just write their value to a specific location inside a BO and trigger the appropriate hardware interrupt.

The user part of the driver stack should then be able to call an IOCTL to wait for the interrupt and block for the value (or something larger) to be written to the specific location.

This has the advantage that you can have multiple synchronization points in the same IB and don't need to split up your draw commands over several IBs so that the kernel can insert kernel fences in between.

The following set of patches tries to implement exactly this IOCTL. The big problem with that IOCTL is that TTM needs the BO to be kept in the same place while it is mapped inside the kernel page table. So this requires that we pin down the BO for the duration of the wait IOCTL.

This practically gives userspace a way of pinning down BOs for as long as it wants, without the ability for the kernel for intervention.

Any ideas how to avoid those problems? Or better ideas how to handle the new requirements?

So i think the simplest solution is to only allow such "fence" bo to be inside system memory (no vram for them). My assumption here is that such BO will barely see more than couple dword write so it is not a bandwidth intensive BO. Or do you have a requirement for such BO to be in VRAM ?

Not that I know off.

...
Now to do that i would just add a property to buffer object that effectively forbid such BO to be place anywhere else than GTT. Doing that would make the ioctl code simpler, just check the BO as the GTT only property set and if not return -EINVAL. Then its a simple matter of kmapping the proper page.

I've also considered adding an internal flag that when a buffer is kmapped we avoid moving it to VRAM / swapping it out, but see below.

...
Note that the only thing that would be left to forbid is the swaping of the buffer due to memory pressure (from various ttm/core shrinker).

Yeah, how the heck would I do that? That's internals of TTM that I never got into.

Actualy i think it is easier then i first thought, in the wait ioctl check if the buffer has a pending fence ie gpu is still using it, if not return -EAGAIN because it means that it is pointless to wait for next GPU interrupt.

For as long as the BO has an active fence it will not be swapped out (see ttm_bo_swapout()). So in the wait event test both the value and the pending fence. If the pending fence signal but not the value then return -EAGAIN. In all case just keep a kmap of the page (do not kmap the using existing kmap helper we would need something new to not interfer with the shrinker). Not that after testing the value you would need to check that the BO was not move and thus the page you were looking into is still the one the BO is using.

That way userspace can not abuse this ioctl to block the shrinker from making progress.

So what we do on the start of the IOCTL is to check the BOs fences and see if it actually is still used and note it's current placement.

Then map it so the kernel can access it and in the waiting loop we check if it still has a fence and is still in the same place.

If there isn't any fences left or the placement has changed we simply assume that the fence is signaled.

Well no code is easier so pseudo below :

radeon_user_fence_wait_ioctl() { struct radeon_bo rbo; struct page *page; uint64_t *fenceptr;

radeon_bo_reserve(rbo, false); if (!(radeon_bo_property(rbo) & GTT_ONLY)) { radeon_bo_reserve(rbo, false); return -EINVAL; }

if (reservation_object_test_signaled_rcu(robj->tbo.resv, true)) { radeon_bo_reserve(rbo, false); /* There is no pending work on this BO so just have * userspace check for the fence itself. Proper user * space will check fence has not signaled when getting * -EAGAIN. */ return -EAGAIN; }

page = ttm_get_page_at(&rbo->tbo, args->offset); // maybe we just want to return -EINVAL but if !page the // something is seriously wrong here. BUG_ON(!page);

fenceptr = kmap(page) + (arg->offset & ~PAGE_MASK); radeon_bo_unreserve(robj);

// Do not hold object

radeon_irq_kms_sw_irq_get(rdev, ring); // FIXME timeout r = wait_event_interruptible(rdev->fence_queue, (!robj->tbo.resv || *fenceptr >= args->fence || rdev->needs_reset)); radeon_irq_kms_sw_irq_put(rdev, ring);

radeon_bo_reserve(rbo, false); if (*fenceptr >= args->fence && page == ttm_get_page_at(&rbo->tbo, args->offset)) { radeon_bo_unreserve(robj); kunmap(page); /* success the user fence signaled. */ return 0; } radeon_bo_unreserve(robj); kunmap(page);

/* Fence is not signaled and either bo no longer have pending * work or we timedout or gpu need reset. */ return -EAGAIN; }

The point here is that it's ok to map the page in the first place has object is still in use. But then even if object is swaped out it is ok to keep that page mapped, worst case we are just accessing a page use by someone else but we are just reading.

That is why we have to check again after the wait that we are still reading from the right page to know fore sure if this is actually the fence value we are looking at or just some random page.

That way because we do not have anything on the object while waiting we are not blocking anything to happen. The object can be swapped out and thus there is no way for userspace to abuse this ioctl.

The new GTT only property only ensure us that ttm_get_page_at() will actually be successfull if the object is active. If the object is swapped out then ttm_get_page_at() return NULL. But in the first part of the ioctl as we checked that the object is still active we know it can not be swapped out so ttm_get_page_at() must return a proper page.

Cheers, Jérôme

...

Yeah, that actually should work.

Thanks for the tip, Christian.

...
I need to look at ttm kmap code to see if it is actually useable without disrupting the shrinker. Will do that after lunch.

Cheers, Jérôme

...
Thanks for the ideas, Christian.

...
Cheers, Jérôme

3678

Age (days ago)

3679

Last active (days ago)

dri-devel@lists.freedesktop.org

22 comments

4 participants

tags (0)

participants (4)

Christian König
Daniel Vetter
Jerome Glisse
Serguei Sagalovitch