"Fixes" for page flipping under PRIME on AMD & nouveau

List overview All Threads
Download

newer

older

[PATCH] drm: drm_probe_helper: Fix...

[PATCH] drm/amdgpu: Add amdkfd...

Mario Kleiner

17 Aug 2016 17 Aug '16

4:12 p.m.

Hi,

i spent some time playing with DRI3/Present + PRIME for testing how well it works for Optimus/Enduro style setups wrt. page flipping on the current kernel/mesa/xorg. I want page flipping, because neuroscience/medical applications need the reliable timing/timestamping and tear free presentation we currently only can get via page flipping, but not the copyswap path.

Intel as display gpu + nouveau for render offload worked nicely on intel-ddx with page flipping, proper timing, dmabuf fence sync and all.

AMD uses copy swaps because radeon/amdgpu kms can't switch the scanout mode from tiled to linear on the fly during flips. That's a todo in itself. For the moment i used the ati-ddx with Option "ColorTiling/ColorTiling2D" "off" to force my pair of old Radeon HD-5770's into linear mode so page flipping can be used for prime. The current modesetting-ddx will use page flipping in any case as it doesn't detect the tiling format mismatch.

nouveau uses page flips.

Turns out that prime + page flipping currently doesn't work on nouveau and amd. The first offload rendered images from the imported dmabufs show up properly, but then the display is stuck alternating between the first two or three rendered frames.

The problem is that during the pageflip ioctl we pin the dmabuf into VRAM in preparation for scanout, then unpin it when we are done with it at next flip, but the buffer stays in the VRAM memory domain. Next time we flip to the buffer again, the driver skips the DMA copy from GTT to VRAM during pinning, because the buffers content apparently already resides in VRAM. Therefore it doesn't update the VRAM copy with the updated dmabuf content in system RAM, so freshly rendered frames from the prime export/render offload gpu never reach the display gpu and one only sees stale images.

The attached patches for nouveau and radeon kms seem to work pretty ok, page flipping works, display updates, tear-free, dmabuf fence sync works, onset timing/timestamping is correct. They simply pin the buffer back into GTT, then unpin, to force a move of the buffer into the GTT domain, and thereby force the following pin to do a new copy from GTT -> VRAM. The code tries to avoid a useless copy from VRAM -> GTT during the pin op.

However, the approach feels very much like a hack, so i assume this is not the proper way of doing it? I looked what ttm has to offer, but couldn't find anything elegant and obvious. Maybe there is a way to evict a bo without actually copying data back to RAM? Or to invalidate the VRAM copy as stale? Maybe i just missed something, as i'm not very familiar with ttm.

Thoughts or suggestions?

Another insight with my hacks is so far that nouveau seems to be fast as prime exporter/renderoffload, but rather slow as display gpu/prime importer, as tested on a 2008 or 2009 MacBookPro dual-Nvidia laptop.

AMD, as tested with dual Radeon HD-5770 seems to be fast as prime importer/display gpu, but very slow as prime exporter/render offload, e.g., taking 16 msecs to get a 1920x1080 framebuffer into RAM. Seems that Mesa's blitImage function is the slow bit here. On r600 it seems to draw a textured triangle strip to detile the gpu renderbuffer and copy it into GTT. As drawing a textured fullscreen quad is normally much faster, something special seems to be going on there wrt. DMA? However, i don't have a realistic real Enduro test setup with AMD iGPU + dGPU, only this cobbled together dual HD-5770's in a MacPro, so this could be wrong.

thanks, -mario

Show replies by date

Mario Kleiner

17 Aug 17 Aug

4:12 p.m.

New subject: [PATCH 1/2] drm/nouveau: Fix pageflipping of PRIME imported scanout bo's.

Scanout bo's which are dmabuf backed in RAM and imported via prime will not update their content with new rendering from the renderoffload gpu once they've been flipped onto the scanout once. The reason is that at preparation of first flip they get pinned into VRAM, then unpinned at some later point, but they stay in the VRAM memory domain, so updates to the system RAM dmabuf object by the exporting render offload gpu don't lead to updates of the content in VRAM - it becomes stale.

For prime imported dmabufs we solve this by first pinning the bo into GTT, which will reset the bos domain back to GTT, then unpinning again, so the followup pinning into VRAM will actually upload an up to date display buffer from dmabuf GTT backing store.

During the pinning into GTT, we skip the actual data move from VRAM to GTT to avoid a needless bo copy of stale image data.

Signed-off-by: Mario Kleiner mario.kleiner.de@gmail.com --- drivers/gpu/drm/nouveau/nouveau_bo.c | 35 +++++++++++++++++++++++++++++-- drivers/gpu/drm/nouveau/nouveau_bo.h | 1 + drivers/gpu/drm/nouveau/nouveau_display.c | 17 +++++++++++++++ drivers/gpu/drm/nouveau/nouveau_prime.c | 1 + 4 files changed, 52 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c index 6190035..87052e4 100644 --- a/drivers/gpu/drm/nouveau/nouveau_bo.c +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c @@ -38,6 +38,18 @@ #include "nouveau_ttm.h" #include "nouveau_gem.h"

+static inline bool nouveau_dmabuf_skip_op(struct ttm_buffer_object *bo, + struct ttm_mem_reg *new_mem) +{ + struct nouveau_bo *nvbo = nouveau_bo(bo); + + /* + * Return true if a expensive operation as part of a dmabuf + * bo copy from VRAM to GTT can be skipped on this bo. + */ + return nvbo->prime_imported && new_mem && new_mem->mem_type == TTM_PL_TT; +} + /* * NV10-NV40 tiling helpers */ @@ -1026,13 +1038,15 @@ nouveau_bo_move_m2mf(struct ttm_buffer_object *bo, int evict, bool intr, struct nouveau_channel *chan = drm->ttm.chan; struct nouveau_cli *cli = (void *)chan->user.client; struct nouveau_fence *fence; + bool skip_prime = !evict && nouveau_dmabuf_skip_op(bo, new_mem); int ret;

/* create temporary vmas for the transfer and attach them to the * old nvkm_mem node, these will get cleaned up after ttm has * destroyed the ttm_mem_reg */ - if (drm->device.info.family >= NV_DEVICE_INFO_V0_TESLA) { + if (drm->device.info.family >= NV_DEVICE_INFO_V0_TESLA && + !skip_prime) { ret = nouveau_bo_move_prep(drm, bo, new_mem); if (ret) return ret; @@ -1041,7 +1055,21 @@ nouveau_bo_move_m2mf(struct ttm_buffer_object *bo, int evict, bool intr, mutex_lock_nested(&cli->mutex, SINGLE_DEPTH_NESTING); ret = nouveau_fence_sync(nouveau_bo(bo), chan, true, intr); if (ret == 0) { - ret = drm->ttm.move(chan, bo, &bo->mem, new_mem); + /* + * For prime-imported dmabufs which are page-flipped to the + * display as scanout bo's and thereby pinned into VRAM, we + * need to do a pseudo-move back into GTT memory domain once + * they are replaced by a new scanout bo. This to enforce an + * update to the new content from dmabuf storage at next flip, + * otherwise we'd display a stale image. The move back into + * GTT goes through most "administrative moves" of a real + * bo move, but we skip the actual copy of the now stale old + * image data from VRAM back to GTT dmabuf backing to save a + * useless copy. + */ + if (!skip_prime) + ret = drm->ttm.move(chan, bo, &bo->mem, new_mem); + if (ret == 0) { ret = nouveau_fence_new(chan, false, &fence); if (ret == 0) { @@ -1202,6 +1230,9 @@ nouveau_bo_move_ntfy(struct ttm_buffer_object *bo, struct ttm_mem_reg *new_mem) if (bo->destroy != nouveau_bo_del_ttm) return;

+ if (nouveau_dmabuf_skip_op(bo, new_mem)) + return; + list_for_each_entry(vma, &nvbo->vma_list, head) { if (new_mem && new_mem->mem_type != TTM_PL_SYSTEM && (new_mem->mem_type == TTM_PL_VRAM || diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.h b/drivers/gpu/drm/nouveau/nouveau_bo.h index e423609..4e415e0 100644 --- a/drivers/gpu/drm/nouveau/nouveau_bo.h +++ b/drivers/gpu/drm/nouveau/nouveau_bo.h @@ -39,6 +39,7 @@ struct nouveau_bo { int pin_refcnt;

struct ttm_bo_kmap_obj dma_buf_vmap; + bool prime_imported; };

static inline struct nouveau_bo * diff --git a/drivers/gpu/drm/nouveau/nouveau_display.c b/drivers/gpu/drm/nouveau/nouveau_display.c index afbf557..bb49159 100644 --- a/drivers/gpu/drm/nouveau/nouveau_display.c +++ b/drivers/gpu/drm/nouveau/nouveau_display.c @@ -736,6 +736,22 @@ nouveau_crtc_page_flip(struct drm_crtc *crtc, struct drm_framebuffer *fb, return -ENOMEM;

if (new_bo != old_bo) { + /* Is this a scanout buffer from an imported prime dmabuf? */ + if (new_bo->prime_imported && !new_bo->pin_refcnt) { + /* + * Pretend it "moved out" of VRAM, so a fresh copy of + * new dmabuf content from export gpu gets reuploaded + * from GTT backing store when pinning into VRAM. + */ + DRM_DEBUG_PRIME("Flip to prime imported dmabuf %p\n", + new_bo); + if (nouveau_bo_pin(new_bo, TTM_PL_FLAG_TT, false)) + DRM_ERROR("Fail gtt pin imported buf %p\n", + new_bo); + else + nouveau_bo_unpin(new_bo); + } + ret = nouveau_bo_pin(new_bo, TTM_PL_FLAG_VRAM, true); if (ret) goto fail_free; @@ -808,6 +824,7 @@ nouveau_crtc_page_flip(struct drm_crtc *crtc, struct drm_framebuffer *fb, ttm_bo_unreserve(&old_bo->bo); if (old_bo != new_bo) nouveau_bo_unpin(old_bo); + nouveau_fence_unref(&fence); return 0;

diff --git a/drivers/gpu/drm/nouveau/nouveau_prime.c b/drivers/gpu/drm/nouveau/nouveau_prime.c index a0a9704..2bd76f6 100644 --- a/drivers/gpu/drm/nouveau/nouveau_prime.c +++ b/drivers/gpu/drm/nouveau/nouveau_prime.c @@ -75,6 +75,7 @@ struct drm_gem_object *nouveau_gem_prime_import_sg_table(struct drm_device *dev, return ERR_PTR(ret);

nvbo->valid_domains = NOUVEAU_GEM_DOMAIN_GART; + nvbo->prime_imported = true;

/* Initialize the embedded gem-object. We return a single gem-reference * to the caller, instead of a normal nouveau_bo ttm reference. */

-- 2.7.0

Mario Kleiner

4:12 p.m.

New subject: [PATCH 2/2] drm/radeon: Fix pageflipping of PRIME imported scanout bo's.

During the pinning into GTT, we skip the actual data move from VRAM to GTT to avoid a needless bo copy of stale image data.

Signed-off-by: Mario Kleiner mario.kleiner.de@gmail.com --- drivers/gpu/drm/radeon/radeon.h | 1 + drivers/gpu/drm/radeon/radeon_display.c | 28 ++++++++++++++++++++++++++++ drivers/gpu/drm/radeon/radeon_prime.c | 1 + drivers/gpu/drm/radeon/radeon_ttm.c | 14 ++++++++++++++ 4 files changed, 44 insertions(+)

diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h index 5633ee3..c200e8a 100644 --- a/drivers/gpu/drm/radeon/radeon.h +++ b/drivers/gpu/drm/radeon/radeon.h @@ -508,6 +508,7 @@ struct radeon_bo { struct drm_gem_object gem_base;

struct ttm_bo_kmap_obj dma_buf_vmap; + bool prime_imported; pid_t pid;

struct radeon_mn *mn; diff --git a/drivers/gpu/drm/radeon/radeon_display.c b/drivers/gpu/drm/radeon/radeon_display.c index c3206fb..1082267 100644 --- a/drivers/gpu/drm/radeon/radeon_display.c +++ b/drivers/gpu/drm/radeon/radeon_display.c @@ -550,6 +550,34 @@ static int radeon_crtc_page_flip(struct drm_crtc *crtc, DRM_ERROR("failed to reserve new rbo buffer before flip\n"); goto cleanup; } + + /* + * Repin into GTT in case of imported prime dmabuf, + * then unpin again. Restores source dmabuf location + * to GTT, where the actual dmabuf backing store gets + * updated by the exporting render offload gpu at swap. + */ + if (new_rbo->prime_imported) { + DRM_DEBUG_PRIME("Flip to prime imported dmabuf %p\n", new_rbo); + + r = radeon_bo_pin(new_rbo, RADEON_GEM_DOMAIN_GTT, NULL); + if (unlikely(r != 0)) { + DRM_ERROR("failed to gtt pin buffer %p before flip\n", + new_rbo); + } + else { + r = radeon_bo_unpin(new_rbo); + } + + if (unlikely(r != 0)) { + radeon_bo_unreserve(new_rbo); + r = -EINVAL; + DRM_ERROR("failed to gtt unpin buffer %p before flip\n", + new_rbo); + goto cleanup; + } + } + /* Only 27 bit offset for legacy CRTC */ r = radeon_bo_pin_restricted(new_rbo, RADEON_GEM_DOMAIN_VRAM, ASIC_IS_AVIVO(rdev) ? 0 : 1 << 27, &base); diff --git a/drivers/gpu/drm/radeon/radeon_prime.c b/drivers/gpu/drm/radeon/radeon_prime.c index f3609c9..693c362 100644 --- a/drivers/gpu/drm/radeon/radeon_prime.c +++ b/drivers/gpu/drm/radeon/radeon_prime.c @@ -69,6 +69,7 @@ struct drm_gem_object *radeon_gem_prime_import_sg_table(struct drm_device *dev, ww_mutex_lock(&resv->lock, NULL); ret = radeon_bo_create(rdev, attach->dmabuf->size, PAGE_SIZE, false, RADEON_GEM_DOMAIN_GTT, 0, sg, resv, &bo); + bo->prime_imported = true; ww_mutex_unlock(&resv->lock); if (ret) return ERR_PTR(ret); diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c index 0c00e19..87b3f59 100644 --- a/drivers/gpu/drm/radeon/radeon_ttm.c +++ b/drivers/gpu/drm/radeon/radeon_ttm.c @@ -256,6 +256,7 @@ static int radeon_move_blit(struct ttm_buffer_object *bo, struct ttm_mem_reg *old_mem) { struct radeon_device *rdev; + struct radeon_bo *rbo; uint64_t old_start, new_start; struct radeon_fence *fence; unsigned num_pages; @@ -296,6 +297,19 @@ static int radeon_move_blit(struct ttm_buffer_object *bo, BUILD_BUG_ON((PAGE_SIZE % RADEON_GPU_PAGE_SIZE) != 0);

num_pages = new_mem->num_pages * (PAGE_SIZE / RADEON_GPU_PAGE_SIZE); + + /* + * Prime imported dmabuf, previously used as scanout buffer in a page + * flip? If so, skip actual data move back from VRAM into GTT, as this + * would only copy back stale image data. + */ + rbo = container_of(bo, struct radeon_bo, tbo); + if (rbo->prime_imported && old_mem->mem_type == TTM_PL_VRAM && + new_mem->mem_type == TTM_PL_TT) { + DRM_DEBUG_PRIME("Skip for dmabuf back-move %p.\n", rbo); + num_pages = 0; + } + fence = radeon_copy(rdev, old_start, new_start, num_pages, bo->resv); if (IS_ERR(fence)) return PTR_ERR(fence);

-- 2.7.0

Christian König

4:27 p.m.

...

AMD uses copy swaps because radeon/amdgpu kms can't switch the scanout mode from tiled to linear on the fly during flips.

Well I'm not an expert on this, but as far as I know the bigger problem is that the dedicated AMD hardware generations you are targeting usually can't reliable scanout from system memory without a rather complicated setup.

So that is a complete NAK to the radeon changes.

Regards, Christian.

Am 17.08.2016 um 18:12 schrieb Mario Kleiner:

...

Hi,

i spent some time playing with DRI3/Present + PRIME for testing how well it works for Optimus/Enduro style setups wrt. page flipping on the current kernel/mesa/xorg. I want page flipping, because neuroscience/medical applications need the reliable timing/timestamping and tear free presentation we currently only can get via page flipping, but not the copyswap path.

Intel as display gpu + nouveau for render offload worked nicely on intel-ddx with page flipping, proper timing, dmabuf fence sync and all.

AMD uses copy swaps because radeon/amdgpu kms can't switch the scanout mode from tiled to linear on the fly during flips. That's a todo in itself. For the moment i used the ati-ddx with Option "ColorTiling/ColorTiling2D" "off" to force my pair of old Radeon HD-5770's into linear mode so page flipping can be used for prime. The current modesetting-ddx will use page flipping in any case as it doesn't detect the tiling format mismatch.

nouveau uses page flips.

Turns out that prime + page flipping currently doesn't work on nouveau and amd. The first offload rendered images from the imported dmabufs show up properly, but then the display is stuck alternating between the first two or three rendered frames.

The problem is that during the pageflip ioctl we pin the dmabuf into VRAM in preparation for scanout, then unpin it when we are done with it at next flip, but the buffer stays in the VRAM memory domain. Next time we flip to the buffer again, the driver skips the DMA copy from GTT to VRAM during pinning, because the buffers content apparently already resides in VRAM. Therefore it doesn't update the VRAM copy with the updated dmabuf content in system RAM, so freshly rendered frames from the prime export/render offload gpu never reach the display gpu and one only sees stale images.

The attached patches for nouveau and radeon kms seem to work pretty ok, page flipping works, display updates, tear-free, dmabuf fence sync works, onset timing/timestamping is correct. They simply pin the buffer back into GTT, then unpin, to force a move of the buffer into the GTT domain, and thereby force the following pin to do a new copy from GTT -> VRAM. The code tries to avoid a useless copy from VRAM -> GTT during the pin op.

However, the approach feels very much like a hack, so i assume this is not the proper way of doing it? I looked what ttm has to offer, but couldn't find anything elegant and obvious. Maybe there is a way to evict a bo without actually copying data back to RAM? Or to invalidate the VRAM copy as stale? Maybe i just missed something, as i'm not very familiar with ttm.

Thoughts or suggestions?

Another insight with my hacks is so far that nouveau seems to be fast as prime exporter/renderoffload, but rather slow as display gpu/prime importer, as tested on a 2008 or 2009 MacBookPro dual-Nvidia laptop.

AMD, as tested with dual Radeon HD-5770 seems to be fast as prime importer/display gpu, but very slow as prime exporter/render offload, e.g., taking 16 msecs to get a 1920x1080 framebuffer into RAM. Seems that Mesa's blitImage function is the slow bit here. On r600 it seems to draw a textured triangle strip to detile the gpu renderbuffer and copy it into GTT. As drawing a textured fullscreen quad is normally much faster, something special seems to be going on there wrt. DMA? However, i don't have a realistic real Enduro test setup with AMD iGPU + dGPU, only this cobbled together dual HD-5770's in a MacPro, so this could be wrong.

thanks, -mario

dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel

Mario Kleiner

4:35 p.m.

On 08/17/2016 06:27 PM, Christian König wrote:

...

...
AMD uses copy swaps because radeon/amdgpu kms can't switch the scanout mode from tiled to linear on the fly during flips.

Well I'm not an expert on this, but as far as I know the bigger problem is that the dedicated AMD hardware generations you are targeting usually can't reliable scanout from system memory without a rather complicated setup.

So that is a complete NAK to the radeon changes.

Hi Christian,

thanks for the feedback, but i think that's a misunderstanding. The patches don't make them scanout from system memory, they just enforce a fresh copy from RAM/GTT -> VRAM before scanning out a buffer again. I just assume there is a more elegant/clean way than this "fake" pin/unpin to GTT to essentially tell the driver that its current VRAM content is stale and needs a refresh from the up to date dmabuf in system RAM.

Btw. i'll be offline for the next few hours, just wanted to get this out now.

thanks, -mario

...

Regards, Christian.

Am 17.08.2016 um 18:12 schrieb Mario Kleiner:

...
Hi,

i spent some time playing with DRI3/Present + PRIME for testing how well it works for Optimus/Enduro style setups wrt. page flipping on the current kernel/mesa/xorg. I want page flipping, because neuroscience/medical applications need the reliable timing/timestamping and tear free presentation we currently only can get via page flipping, but not the copyswap path.

Intel as display gpu + nouveau for render offload worked nicely on intel-ddx with page flipping, proper timing, dmabuf fence sync and all.

AMD uses copy swaps because radeon/amdgpu kms can't switch the scanout mode from tiled to linear on the fly during flips. That's a todo in itself. For the moment i used the ati-ddx with Option "ColorTiling/ColorTiling2D" "off" to force my pair of old Radeon HD-5770's into linear mode so page flipping can be used for prime. The current modesetting-ddx will use page flipping in any case as it doesn't detect the tiling format mismatch.

nouveau uses page flips.

Turns out that prime + page flipping currently doesn't work on nouveau and amd. The first offload rendered images from the imported dmabufs show up properly, but then the display is stuck alternating between the first two or three rendered frames.

The problem is that during the pageflip ioctl we pin the dmabuf into VRAM in preparation for scanout, then unpin it when we are done with it at next flip, but the buffer stays in the VRAM memory domain. Next time we flip to the buffer again, the driver skips the DMA copy from GTT to VRAM during pinning, because the buffers content apparently already resides in VRAM. Therefore it doesn't update the VRAM copy with the updated dmabuf content in system RAM, so freshly rendered frames from the prime export/render offload gpu never reach the display gpu and one only sees stale images.

The attached patches for nouveau and radeon kms seem to work pretty ok, page flipping works, display updates, tear-free, dmabuf fence sync works, onset timing/timestamping is correct. They simply pin the buffer back into GTT, then unpin, to force a move of the buffer into the GTT domain, and thereby force the following pin to do a new copy from GTT -> VRAM. The code tries to avoid a useless copy from VRAM -> GTT during the pin op.

However, the approach feels very much like a hack, so i assume this is not the proper way of doing it? I looked what ttm has to offer, but couldn't find anything elegant and obvious. Maybe there is a way to evict a bo without actually copying data back to RAM? Or to invalidate the VRAM copy as stale? Maybe i just missed something, as i'm not very familiar with ttm.

Thoughts or suggestions?

Another insight with my hacks is so far that nouveau seems to be fast as prime exporter/renderoffload, but rather slow as display gpu/prime importer, as tested on a 2008 or 2009 MacBookPro dual-Nvidia laptop.

AMD, as tested with dual Radeon HD-5770 seems to be fast as prime importer/display gpu, but very slow as prime exporter/render offload, e.g., taking 16 msecs to get a 1920x1080 framebuffer into RAM. Seems that Mesa's blitImage function is the slow bit here. On r600 it seems to draw a textured triangle strip to detile the gpu renderbuffer and copy it into GTT. As drawing a textured fullscreen quad is normally much faster, something special seems to be going on there wrt. DMA? However, i don't have a realistic real Enduro test setup with AMD iGPU + dGPU, only this cobbled together dual HD-5770's in a MacPro, so this could be wrong.

thanks, -mario

dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel

Christian König

5:02 p.m.

Am 17.08.2016 um 18:35 schrieb Mario Kleiner:

...

On 08/17/2016 06:27 PM, Christian König wrote:

...
...
AMD uses copy swaps because radeon/amdgpu kms can't switch the scanout mode from tiled to linear on the fly during flips.

Well I'm not an expert on this, but as far as I know the bigger problem is that the dedicated AMD hardware generations you are targeting usually can't reliable scanout from system memory without a rather complicated setup.

So that is a complete NAK to the radeon changes.

Hi Christian,

thanks for the feedback, but i think that's a misunderstanding. The patches don't make them scanout from system memory, they just enforce a fresh copy from RAM/GTT -> VRAM before scanning out a buffer again. I just assume there is a more elegant/clean way than this "fake" pin/unpin to GTT to essentially tell the driver that its current VRAM content is stale and needs a refresh from the up to date dmabuf in system RAM.

I was already wondering how the heck you got that working.

What do you mean with a fresh copy from GTT to VRAM? A buffer exported by DMA-buf should never move as long as it is exported, same for a buffer pinned to VRAM.

So using a DMA-buf for scanout is impossible and actually not valuable cause is shouldn't matter if we copy from GTT to VRAM because of a buffer migration or because of a copy triggered by the DDX.

What are you actually trying to do here?

Regards, Christian.

...

Btw. i'll be offline for the next few hours, just wanted to get this out now.

thanks, -mario

...
Regards, Christian.

Am 17.08.2016 um 18:12 schrieb Mario Kleiner:

...
Hi,

i spent some time playing with DRI3/Present + PRIME for testing how well it works for Optimus/Enduro style setups wrt. page flipping on the current kernel/mesa/xorg. I want page flipping, because neuroscience/medical applications need the reliable timing/timestamping and tear free presentation we currently only can get via page flipping, but not the copyswap path.

Intel as display gpu + nouveau for render offload worked nicely on intel-ddx with page flipping, proper timing, dmabuf fence sync and all.

AMD uses copy swaps because radeon/amdgpu kms can't switch the scanout mode from tiled to linear on the fly during flips. That's a todo in itself. For the moment i used the ati-ddx with Option "ColorTiling/ColorTiling2D" "off" to force my pair of old Radeon HD-5770's into linear mode so page flipping can be used for prime. The current modesetting-ddx will use page flipping in any case as it doesn't detect the tiling format mismatch.

nouveau uses page flips.

Turns out that prime + page flipping currently doesn't work on nouveau and amd. The first offload rendered images from the imported dmabufs show up properly, but then the display is stuck alternating between the first two or three rendered frames.

The problem is that during the pageflip ioctl we pin the dmabuf into VRAM in preparation for scanout, then unpin it when we are done with it at next flip, but the buffer stays in the VRAM memory domain. Next time we flip to the buffer again, the driver skips the DMA copy from GTT to VRAM during pinning, because the buffers content apparently already resides in VRAM. Therefore it doesn't update the VRAM copy with the updated dmabuf content in system RAM, so freshly rendered frames from the prime export/render offload gpu never reach the display gpu and one only sees stale images.

The attached patches for nouveau and radeon kms seem to work pretty ok, page flipping works, display updates, tear-free, dmabuf fence sync works, onset timing/timestamping is correct. They simply pin the buffer back into GTT, then unpin, to force a move of the buffer into the GTT domain, and thereby force the following pin to do a new copy from GTT -> VRAM. The code tries to avoid a useless copy from VRAM -> GTT during the pin op.

However, the approach feels very much like a hack, so i assume this is not the proper way of doing it? I looked what ttm has to offer, but couldn't find anything elegant and obvious. Maybe there is a way to evict a bo without actually copying data back to RAM? Or to invalidate the VRAM copy as stale? Maybe i just missed something, as i'm not very familiar with ttm.

Thoughts or suggestions?

Another insight with my hacks is so far that nouveau seems to be fast as prime exporter/renderoffload, but rather slow as display gpu/prime importer, as tested on a 2008 or 2009 MacBookPro dual-Nvidia laptop.

AMD, as tested with dual Radeon HD-5770 seems to be fast as prime importer/display gpu, but very slow as prime exporter/render offload, e.g., taking 16 msecs to get a 1920x1080 framebuffer into RAM. Seems that Mesa's blitImage function is the slow bit here. On r600 it seems to draw a textured triangle strip to detile the gpu renderbuffer and copy it into GTT. As drawing a textured fullscreen quad is normally much faster, something special seems to be going on there wrt. DMA? However, i don't have a realistic real Enduro test setup with AMD iGPU + dGPU, only this cobbled together dual HD-5770's in a MacPro, so this could be wrong.

thanks, -mario

dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel

Mario Kleiner

11:29 p.m.

On 08/17/2016 07:02 PM, Christian König wrote:

...

Am 17.08.2016 um 18:35 schrieb Mario Kleiner:

...
On 08/17/2016 06:27 PM, Christian König wrote:

...
...
AMD uses copy swaps because radeon/amdgpu kms can't switch the scanout mode from tiled to linear on the fly during flips.

Well I'm not an expert on this, but as far as I know the bigger problem is that the dedicated AMD hardware generations you are targeting usually can't reliable scanout from system memory without a rather complicated setup.

So that is a complete NAK to the radeon changes.

Hi Christian,

thanks for the feedback, but i think that's a misunderstanding. The patches don't make them scanout from system memory, they just enforce a fresh copy from RAM/GTT -> VRAM before scanning out a buffer again. I just assume there is a more elegant/clean way than this "fake" pin/unpin to GTT to essentially tell the driver that its current VRAM content is stale and needs a refresh from the up to date dmabuf in system RAM.

I was already wondering how the heck you got that working.

What do you mean with a fresh copy from GTT to VRAM? A buffer exported by DMA-buf should never move as long as it is exported, same for a buffer pinned to VRAM.

Under DRI3/Present, the way it is currently implemented in the X-Server and Mesa, the display gpu (= normally integrated one) is importing the dma-buf that was exported by the render offload gpu. So the actual dmabuf doesn't move, but just stays where it is in system RAM.

Afaiu the prime importing display gpu generates its own gem buffer handle (prime_fd_to_handle) from that dmabuf, importing scather-gather tables to access the dmabuf in system ram. As far as page flipping is concerned, so far those gem buffers / radeon_bo's aren't treated any different than native ones. During pageflip setup they get pinned into VRAM, which moves (=copies) their content from the RAM dmabuf backing store into VRAM. Then they get flipped and scanned out as usual. The disconnect happens when such a buffer gets flipped off the scanout (and unpinned) and later on page-flipped to the scanout again. Now the driver just reuses the bo that still likely resides in VRAM (although not pinned anymore) and forgets that it was associated with some dmabuf backing in RAM which may have updated visual content. So the exporting renderoffload gpu happily renders new frames into the dmabuf in ram, while radeon kms happily displays stale frames from its own copy in VRAM.

...

So using a DMA-buf for scanout is impossible and actually not valuable cause is shouldn't matter if we copy from GTT to VRAM because of a buffer migration or because of a copy triggered by the DDX.

What are you actually trying to do here?

Make a typical Enduro laptop with an AMD iGPU + AMD dGPU work under DRI3/Present, without tearing and other ugliness, e.g.,

DRI_PRIME=1 glxgears -fullscreen

-> discrete gpu renders, integrated gpu displays the rendered frames.

Currently the drivers use copies for handling the PresentPixmap requests, which sort of works in showing the right pictures, but gives bad tearing and undefined timing. With copies we are too slow to keep ahead of the scanout and Present doesn't even guarantee that the copy starts vsync'ed. So at all levels, from delays in the x-server, mesa's way of doing things, commmand submission and the hw itself we end up blitting in the middle of scanout. And the presentation timing isn't ever trustworthy for timing sensitive applications unless we present via page flipping.

The hack in my patch tricks the driver into migrating the bo back to GTT (skipping the actual pointless data copy though) and then back into VRAM to force a copy of fresh content from the imported dmabuf into VRAM, so page flipping flips up to date content into the scanout.

-mario

...

Regards, Christian.

...
Btw. i'll be offline for the next few hours, just wanted to get this out now.

thanks, -mario

...
Regards, Christian.

Am 17.08.2016 um 18:12 schrieb Mario Kleiner:

...
Hi,

i spent some time playing with DRI3/Present + PRIME for testing how well it works for Optimus/Enduro style setups wrt. page flipping on the current kernel/mesa/xorg. I want page flipping, because neuroscience/medical applications need the reliable timing/timestamping and tear free presentation we currently only can get via page flipping, but not the copyswap path.

Intel as display gpu + nouveau for render offload worked nicely on intel-ddx with page flipping, proper timing, dmabuf fence sync and all.

AMD uses copy swaps because radeon/amdgpu kms can't switch the scanout mode from tiled to linear on the fly during flips. That's a todo in itself. For the moment i used the ati-ddx with Option "ColorTiling/ColorTiling2D" "off" to force my pair of old Radeon HD-5770's into linear mode so page flipping can be used for prime. The current modesetting-ddx will use page flipping in any case as it doesn't detect the tiling format mismatch.

nouveau uses page flips.

Turns out that prime + page flipping currently doesn't work on nouveau and amd. The first offload rendered images from the imported dmabufs show up properly, but then the display is stuck alternating between the first two or three rendered frames.

The problem is that during the pageflip ioctl we pin the dmabuf into VRAM in preparation for scanout, then unpin it when we are done with it at next flip, but the buffer stays in the VRAM memory domain. Next time we flip to the buffer again, the driver skips the DMA copy from GTT to VRAM during pinning, because the buffers content apparently already resides in VRAM. Therefore it doesn't update the VRAM copy with the updated dmabuf content in system RAM, so freshly rendered frames from the prime export/render offload gpu never reach the display gpu and one only sees stale images.

The attached patches for nouveau and radeon kms seem to work pretty ok, page flipping works, display updates, tear-free, dmabuf fence sync works, onset timing/timestamping is correct. They simply pin the buffer back into GTT, then unpin, to force a move of the buffer into the GTT domain, and thereby force the following pin to do a new copy from GTT -> VRAM. The code tries to avoid a useless copy from VRAM -> GTT during the pin op.

However, the approach feels very much like a hack, so i assume this is not the proper way of doing it? I looked what ttm has to offer, but couldn't find anything elegant and obvious. Maybe there is a way to evict a bo without actually copying data back to RAM? Or to invalidate the VRAM copy as stale? Maybe i just missed something, as i'm not very familiar with ttm.

Thoughts or suggestions?

Another insight with my hacks is so far that nouveau seems to be fast as prime exporter/renderoffload, but rather slow as display gpu/prime importer, as tested on a 2008 or 2009 MacBookPro dual-Nvidia laptop.

AMD, as tested with dual Radeon HD-5770 seems to be fast as prime importer/display gpu, but very slow as prime exporter/render offload, e.g., taking 16 msecs to get a 1920x1080 framebuffer into RAM. Seems that Mesa's blitImage function is the slow bit here. On r600 it seems to draw a textured triangle strip to detile the gpu renderbuffer and copy it into GTT. As drawing a textured fullscreen quad is normally much faster, something special seems to be going on there wrt. DMA? However, i don't have a realistic real Enduro test setup with AMD iGPU + dGPU, only this cobbled together dual HD-5770's in a MacPro, so this could be wrong.

thanks, -mario

dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel

Christian König

18 Aug 18 Aug

7:41 a.m.

...

Afaiu the prime importing display gpu generates its own gem buffer handle (prime_fd_to_handle) from that dmabuf, importing scather-gather tables to access the dmabuf in system ram. As far as page flipping is concerned, so far those gem buffers / radeon_bo's aren't treated any different than native ones. During pageflip setup they get pinned into VRAM, which moves (=copies) their content from the RAM dmabuf backing store into VRAM.

Your understanding isn't correct. Buffers imported using prime always stay in GTT, they can't be moved to VRAM.

It's the DDX which copies the buffer content from the imported prime handle into a native on which is enabled to scan out.

Regards, Christian.

Am 18.08.2016 um 01:29 schrieb Mario Kleiner:

...

On 08/17/2016 07:02 PM, Christian König wrote:

...
Am 17.08.2016 um 18:35 schrieb Mario Kleiner:

...
On 08/17/2016 06:27 PM, Christian König wrote:

...
...
AMD uses copy swaps because radeon/amdgpu kms can't switch the scanout mode from tiled to linear on the fly during flips.

Well I'm not an expert on this, but as far as I know the bigger problem is that the dedicated AMD hardware generations you are targeting usually can't reliable scanout from system memory without a rather complicated setup.

So that is a complete NAK to the radeon changes.

Hi Christian,

thanks for the feedback, but i think that's a misunderstanding. The patches don't make them scanout from system memory, they just enforce a fresh copy from RAM/GTT -> VRAM before scanning out a buffer again. I just assume there is a more elegant/clean way than this "fake" pin/unpin to GTT to essentially tell the driver that its current VRAM content is stale and needs a refresh from the up to date dmabuf in system RAM.

I was already wondering how the heck you got that working.

What do you mean with a fresh copy from GTT to VRAM? A buffer exported by DMA-buf should never move as long as it is exported, same for a buffer pinned to VRAM.

Under DRI3/Present, the way it is currently implemented in the X-Server and Mesa, the display gpu (= normally integrated one) is importing the dma-buf that was exported by the render offload gpu. So the actual dmabuf doesn't move, but just stays where it is in system RAM.

Afaiu the prime importing display gpu generates its own gem buffer handle (prime_fd_to_handle) from that dmabuf, importing scather-gather tables to access the dmabuf in system ram. As far as page flipping is concerned, so far those gem buffers / radeon_bo's aren't treated any different than native ones. During pageflip setup they get pinned into VRAM, which moves (=copies) their content from the RAM dmabuf backing store into VRAM. Then they get flipped and scanned out as usual. The disconnect happens when such a buffer gets flipped off the scanout (and unpinned) and later on page-flipped to the scanout again. Now the driver just reuses the bo that still likely resides in VRAM (although not pinned anymore) and forgets that it was associated with some dmabuf backing in RAM which may have updated visual content. So the exporting renderoffload gpu happily renders new frames into the dmabuf in ram, while radeon kms happily displays stale frames from its own copy in VRAM.

...
So using a DMA-buf for scanout is impossible and actually not valuable cause is shouldn't matter if we copy from GTT to VRAM because of a buffer migration or because of a copy triggered by the DDX.

What are you actually trying to do here?

Make a typical Enduro laptop with an AMD iGPU + AMD dGPU work under DRI3/Present, without tearing and other ugliness, e.g.,

DRI_PRIME=1 glxgears -fullscreen

-> discrete gpu renders, integrated gpu displays the rendered frames.

Currently the drivers use copies for handling the PresentPixmap requests, which sort of works in showing the right pictures, but gives bad tearing and undefined timing. With copies we are too slow to keep ahead of the scanout and Present doesn't even guarantee that the copy starts vsync'ed. So at all levels, from delays in the x-server, mesa's way of doing things, commmand submission and the hw itself we end up blitting in the middle of scanout. And the presentation timing isn't ever trustworthy for timing sensitive applications unless we present via page flipping.

The hack in my patch tricks the driver into migrating the bo back to GTT (skipping the actual pointless data copy though) and then back into VRAM to force a copy of fresh content from the imported dmabuf into VRAM, so page flipping flips up to date content into the scanout.

-mario

...
Regards, Christian.

...
Btw. i'll be offline for the next few hours, just wanted to get this out now.

thanks, -mario

...
Regards, Christian.

Am 17.08.2016 um 18:12 schrieb Mario Kleiner:

...
Hi,

i spent some time playing with DRI3/Present + PRIME for testing how well it works for Optimus/Enduro style setups wrt. page flipping on the current kernel/mesa/xorg. I want page flipping, because neuroscience/medical applications need the reliable timing/timestamping and tear free presentation we currently only can get via page flipping, but not the copyswap path.

Intel as display gpu + nouveau for render offload worked nicely on intel-ddx with page flipping, proper timing, dmabuf fence sync and all.

AMD uses copy swaps because radeon/amdgpu kms can't switch the scanout mode from tiled to linear on the fly during flips. That's a todo in itself. For the moment i used the ati-ddx with Option "ColorTiling/ColorTiling2D" "off" to force my pair of old Radeon HD-5770's into linear mode so page flipping can be used for prime. The current modesetting-ddx will use page flipping in any case as it doesn't detect the tiling format mismatch.

nouveau uses page flips.

Turns out that prime + page flipping currently doesn't work on nouveau and amd. The first offload rendered images from the imported dmabufs show up properly, but then the display is stuck alternating between the first two or three rendered frames.

The problem is that during the pageflip ioctl we pin the dmabuf into VRAM in preparation for scanout, then unpin it when we are done with it at next flip, but the buffer stays in the VRAM memory domain. Next time we flip to the buffer again, the driver skips the DMA copy from GTT to VRAM during pinning, because the buffers content apparently already resides in VRAM. Therefore it doesn't update the VRAM copy with the updated dmabuf content in system RAM, so freshly rendered frames from the prime export/render offload gpu never reach the display gpu and one only sees stale images.

The attached patches for nouveau and radeon kms seem to work pretty ok, page flipping works, display updates, tear-free, dmabuf fence sync works, onset timing/timestamping is correct. They simply pin the buffer back into GTT, then unpin, to force a move of the buffer into the GTT domain, and thereby force the following pin to do a new copy from GTT -> VRAM. The code tries to avoid a useless copy from VRAM -> GTT during the pin op.

However, the approach feels very much like a hack, so i assume this is not the proper way of doing it? I looked what ttm has to offer, but couldn't find anything elegant and obvious. Maybe there is a way to evict a bo without actually copying data back to RAM? Or to invalidate the VRAM copy as stale? Maybe i just missed something, as i'm not very familiar with ttm.

Thoughts or suggestions?

Another insight with my hacks is so far that nouveau seems to be fast as prime exporter/renderoffload, but rather slow as display gpu/prime importer, as tested on a 2008 or 2009 MacBookPro dual-Nvidia laptop.

AMD, as tested with dual Radeon HD-5770 seems to be fast as prime importer/display gpu, but very slow as prime exporter/render offload, e.g., taking 16 msecs to get a 1920x1080 framebuffer into RAM. Seems that Mesa's blitImage function is the slow bit here. On r600 it seems to draw a textured triangle strip to detile the gpu renderbuffer and copy it into GTT. As drawing a textured fullscreen quad is normally much faster, something special seems to be going on there wrt. DMA? However, i don't have a realistic real Enduro test setup with AMD iGPU + dGPU, only this cobbled together dual HD-5770's in a MacPro, so this could be wrong.

thanks, -mario

dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel

Michel Dänzer

7:52 a.m.

On 18/08/16 04:41 PM, Christian König wrote:

...

...
Afaiu the prime importing display gpu generates its own gem buffer handle (prime_fd_to_handle) from that dmabuf, importing scather-gather tables to access the dmabuf in system ram. As far as page flipping is concerned, so far those gem buffers / radeon_bo's aren't treated any different than native ones. During pageflip setup they get pinned into VRAM, which moves (=copies) their content from the RAM dmabuf backing store into VRAM.

Your understanding isn't correct. Buffers imported using prime always stay in GTT, they can't be moved to VRAM.

That's the theory, but based on Mario's description it's clear that there is at least one bug which either actually allows a shared buffer to be moved to VRAM, or at least doesn't propagate the error correctly, so the page flip operation "succeeds".

...

It's the DDX which copies the buffer content from the imported prime handle into a native on which is enabled to scan out.

There is no such code which could explain what Mario is seeing.

-- Earthling Michel Dänzer | http://www.amd.com Libre software enthusiast | Mesa and X developer

Christian König

8:20 a.m.

Am 18.08.2016 um 09:52 schrieb Michel Dänzer:

...

On 18/08/16 04:41 PM, Christian König wrote:

...
...
Afaiu the prime importing display gpu generates its own gem buffer handle (prime_fd_to_handle) from that dmabuf, importing scather-gather tables to access the dmabuf in system ram. As far as page flipping is concerned, so far those gem buffers / radeon_bo's aren't treated any different than native ones. During pageflip setup they get pinned into VRAM, which moves (=copies) their content from the RAM dmabuf backing store into VRAM.

Your understanding isn't correct. Buffers imported using prime always stay in GTT, they can't be moved to VRAM.

That's the theory, but based on Mario's description it's clear that there is at least one bug which either actually allows a shared buffer to be moved to VRAM, or at least doesn't propagate the error correctly, so the page flip operation "succeeds".

...
It's the DDX which copies the buffer content from the imported prime handle into a native on which is enabled to scan out.

There is no such code which could explain what Mario is seeing.

How should this work then otherwise?

I agree that I don't understand fully either what is happening here, but I find it quite unlikely that we actually scan out from system memory without the proper hardware setup.

On the other hand that we accidentally move a prime imported buffer to VRAM could be possible, but this would clearly be a rather severe bug we hopefully have noticed already.

Any other idea what actually happens here?

Regards, Christian.

Michel Dänzer

8:26 a.m.

On 18/08/16 05:20 PM, Christian König wrote:

...

Am 18.08.2016 um 09:52 schrieb Michel Dänzer:

...
On 18/08/16 04:41 PM, Christian König wrote:

...
...
Afaiu the prime importing display gpu generates its own gem buffer handle (prime_fd_to_handle) from that dmabuf, importing scather-gather tables to access the dmabuf in system ram. As far as page flipping is concerned, so far those gem buffers / radeon_bo's aren't treated any different than native ones. During pageflip setup they get pinned into VRAM, which moves (=copies) their content from the RAM dmabuf backing store into VRAM.

Your understanding isn't correct. Buffers imported using prime always stay in GTT, they can't be moved to VRAM.

That's the theory, but based on Mario's description it's clear that there is at least one bug which either actually allows a shared buffer to be moved to VRAM, or at least doesn't propagate the error correctly, so the page flip operation "succeeds".

...
It's the DDX which copies the buffer content from the imported prime handle into a native on which is enabled to scan out.

There is no such code which could explain what Mario is seeing.

How should this work then otherwise?

[...]

...

On the other hand that we accidentally move a prime imported buffer to VRAM could be possible, but this would clearly be a rather severe bug we hopefully have noticed already.

That's what seems to be happening, based on Mario's description and patches.

-- Earthling Michel Dänzer | http://www.amd.com Libre software enthusiast | Mesa and X developer

Alex Deucher

17 Aug 17 Aug

5:43 p.m.

On Wed, Aug 17, 2016 at 12:35 PM, Mario Kleiner mario.kleiner.de@gmail.com wrote:

...

On 08/17/2016 06:27 PM, Christian König wrote:

...
...
AMD uses copy swaps because radeon/amdgpu kms can't switch the scanout mode from tiled to linear on the fly during flips.

Well I'm not an expert on this, but as far as I know the bigger problem is that the dedicated AMD hardware generations you are targeting usually can't reliable scanout from system memory without a rather complicated setup.

So that is a complete NAK to the radeon changes.

Hi Christian,

thanks for the feedback, but i think that's a misunderstanding. The patches don't make them scanout from system memory, they just enforce a fresh copy from RAM/GTT -> VRAM before scanning out a buffer again. I just assume there is a more elegant/clean way than this "fake" pin/unpin to GTT to essentially tell the driver that its current VRAM content is stale and needs a refresh from the up to date dmabuf in system RAM.

I think the ddx should handle the copy rather than the kernel. That also takes care of the tiling. I.e., copy from the linear shared buffer in system memory to the tiled scanout buffer in vram. The ddx should also be able to take damage into account and only copy the delta. From a bandwidth perspective, I'm not sure how much sense pageflipping makes since there are so many copies already.

Alex

...

Btw. i'll be offline for the next few hours, just wanted to get this out now.

thanks, -mario

...
Regards, Christian.

Am 17.08.2016 um 18:12 schrieb Mario Kleiner:

...
Hi,

i spent some time playing with DRI3/Present + PRIME for testing how well it works for Optimus/Enduro style setups wrt. page flipping on the current kernel/mesa/xorg. I want page flipping, because neuroscience/medical applications need the reliable timing/timestamping and tear free presentation we currently only can get via page flipping, but not the copyswap path.

Intel as display gpu + nouveau for render offload worked nicely on intel-ddx with page flipping, proper timing, dmabuf fence sync and all.

AMD uses copy swaps because radeon/amdgpu kms can't switch the scanout mode from tiled to linear on the fly during flips. That's a todo in itself. For the moment i used the ati-ddx with Option "ColorTiling/ColorTiling2D" "off" to force my pair of old Radeon HD-5770's into linear mode so page flipping can be used for prime. The current modesetting-ddx will use page flipping in any case as it doesn't detect the tiling format mismatch.

nouveau uses page flips.

Turns out that prime + page flipping currently doesn't work on nouveau and amd. The first offload rendered images from the imported dmabufs show up properly, but then the display is stuck alternating between the first two or three rendered frames.

The problem is that during the pageflip ioctl we pin the dmabuf into VRAM in preparation for scanout, then unpin it when we are done with it at next flip, but the buffer stays in the VRAM memory domain. Next time we flip to the buffer again, the driver skips the DMA copy from GTT to VRAM during pinning, because the buffers content apparently already resides in VRAM. Therefore it doesn't update the VRAM copy with the updated dmabuf content in system RAM, so freshly rendered frames from the prime export/render offload gpu never reach the display gpu and one only sees stale images.

The attached patches for nouveau and radeon kms seem to work pretty ok, page flipping works, display updates, tear-free, dmabuf fence sync works, onset timing/timestamping is correct. They simply pin the buffer back into GTT, then unpin, to force a move of the buffer into the GTT domain, and thereby force the following pin to do a new copy from GTT -> VRAM. The code tries to avoid a useless copy from VRAM -> GTT during the pin op.

However, the approach feels very much like a hack, so i assume this is not the proper way of doing it? I looked what ttm has to offer, but couldn't find anything elegant and obvious. Maybe there is a way to evict a bo without actually copying data back to RAM? Or to invalidate the VRAM copy as stale? Maybe i just missed something, as i'm not very familiar with ttm.

Thoughts or suggestions?

Another insight with my hacks is so far that nouveau seems to be fast as prime exporter/renderoffload, but rather slow as display gpu/prime importer, as tested on a 2008 or 2009 MacBookPro dual-Nvidia laptop.

AMD, as tested with dual Radeon HD-5770 seems to be fast as prime importer/display gpu, but very slow as prime exporter/render offload, e.g., taking 16 msecs to get a 1920x1080 framebuffer into RAM. Seems that Mesa's blitImage function is the slow bit here. On r600 it seems to draw a textured triangle strip to detile the gpu renderbuffer and copy it into GTT. As drawing a textured fullscreen quad is normally much faster, something special seems to be going on there wrt. DMA? However, i don't have a realistic real Enduro test setup with AMD iGPU + dGPU, only this cobbled together dual HD-5770's in a MacPro, so this could be wrong.

thanks, -mario

dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel

dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel

Mario Kleiner

11:51 p.m.

On 08/17/2016 07:43 PM, Alex Deucher wrote:

...

On Wed, Aug 17, 2016 at 12:35 PM, Mario Kleiner mario.kleiner.de@gmail.com wrote:

...
On 08/17/2016 06:27 PM, Christian König wrote:

...
...
AMD uses copy swaps because radeon/amdgpu kms can't switch the scanout mode from tiled to linear on the fly during flips.

Well I'm not an expert on this, but as far as I know the bigger problem is that the dedicated AMD hardware generations you are targeting usually can't reliable scanout from system memory without a rather complicated setup.

So that is a complete NAK to the radeon changes.

Hi Christian,

thanks for the feedback, but i think that's a misunderstanding. The patches don't make them scanout from system memory, they just enforce a fresh copy from RAM/GTT -> VRAM before scanning out a buffer again. I just assume there is a more elegant/clean way than this "fake" pin/unpin to GTT to essentially tell the driver that its current VRAM content is stale and needs a refresh from the up to date dmabuf in system RAM.

I think the ddx should handle the copy rather than the kernel. That also takes care of the tiling. I.e., copy from the linear shared buffer in system memory to the tiled scanout buffer in vram. The ddx should also be able to take damage into account and only copy the delta. From a bandwidth perspective, I'm not sure how much sense pageflipping makes since there are so many copies already.

Alex

That's what the ati-ddx/amdgpu-ddx does at the moment, as it detects the mismatch in tiling flags and uses the DRI3/Present copy path instead of the pageflip path. The problem is that the servers Present implementation doesn't request a vsync'ed start of the copy operation and the whole procedure is too slow to keep ahead of the scanout, so it tears pretty badly for many animations. Also no page flipping = no reliable timestamps. And the modesetting ddx doesn't handle it at all, as it doesn't know about the tiling mismatch.

You are right, going through page flipping doesn't save any bandwith, may even use more without damage handling, but it prevents tearing and undefined presentation timing.

So it sounds as if the bug is not that page flipping doesn't quite work without my hack, but that i even managed to get this far?

There is this other approach from NVidia's Alex Goins for their proprietary driver, whose patches landed in the X-Server 1.19 master branch a couple of weeks ago. I haven't read his patches in detail yet, and i so far couldn't successfully test them with the reference implementation in modesetting ddx 1.19. Afaik there the display gpu exports a pair of scanout friendly, page flipping compatible dmabufs (i assume linear, contiguous, accessible by the display engines), and the offload gpu imports those and renders into them. That saves one extra copy, so should be somewhat more efficient.

Setting it up seems to be more involved and less flexible though. So far i couldn't make it work here for testing. Maybe bugs, maybe mistakes on my side, maybe i just have the wrong hardware for it. Need to read the patches first in detail to understand how it is supposed to work.

-mario

...

...
Btw. i'll be offline for the next few hours, just wanted to get this out now.

thanks, -mario

...
Regards, Christian.

Am 17.08.2016 um 18:12 schrieb Mario Kleiner:

...
Hi,

i spent some time playing with DRI3/Present + PRIME for testing how well it works for Optimus/Enduro style setups wrt. page flipping on the current kernel/mesa/xorg. I want page flipping, because neuroscience/medical applications need the reliable timing/timestamping and tear free presentation we currently only can get via page flipping, but not the copyswap path.

Intel as display gpu + nouveau for render offload worked nicely on intel-ddx with page flipping, proper timing, dmabuf fence sync and all.

AMD uses copy swaps because radeon/amdgpu kms can't switch the scanout mode from tiled to linear on the fly during flips. That's a todo in itself. For the moment i used the ati-ddx with Option "ColorTiling/ColorTiling2D" "off" to force my pair of old Radeon HD-5770's into linear mode so page flipping can be used for prime. The current modesetting-ddx will use page flipping in any case as it doesn't detect the tiling format mismatch.

nouveau uses page flips.

Turns out that prime + page flipping currently doesn't work on nouveau and amd. The first offload rendered images from the imported dmabufs show up properly, but then the display is stuck alternating between the first two or three rendered frames.

The problem is that during the pageflip ioctl we pin the dmabuf into VRAM in preparation for scanout, then unpin it when we are done with it at next flip, but the buffer stays in the VRAM memory domain. Next time we flip to the buffer again, the driver skips the DMA copy from GTT to VRAM during pinning, because the buffers content apparently already resides in VRAM. Therefore it doesn't update the VRAM copy with the updated dmabuf content in system RAM, so freshly rendered frames from the prime export/render offload gpu never reach the display gpu and one only sees stale images.

The attached patches for nouveau and radeon kms seem to work pretty ok, page flipping works, display updates, tear-free, dmabuf fence sync works, onset timing/timestamping is correct. They simply pin the buffer back into GTT, then unpin, to force a move of the buffer into the GTT domain, and thereby force the following pin to do a new copy from GTT -> VRAM. The code tries to avoid a useless copy from VRAM -> GTT during the pin op.

However, the approach feels very much like a hack, so i assume this is not the proper way of doing it? I looked what ttm has to offer, but couldn't find anything elegant and obvious. Maybe there is a way to evict a bo without actually copying data back to RAM? Or to invalidate the VRAM copy as stale? Maybe i just missed something, as i'm not very familiar with ttm.

Thoughts or suggestions?

Another insight with my hacks is so far that nouveau seems to be fast as prime exporter/renderoffload, but rather slow as display gpu/prime importer, as tested on a 2008 or 2009 MacBookPro dual-Nvidia laptop.

AMD, as tested with dual Radeon HD-5770 seems to be fast as prime importer/display gpu, but very slow as prime exporter/render offload, e.g., taking 16 msecs to get a 1920x1080 framebuffer into RAM. Seems that Mesa's blitImage function is the slow bit here. On r600 it seems to draw a textured triangle strip to detile the gpu renderbuffer and copy it into GTT. As drawing a textured fullscreen quad is normally much faster, something special seems to be going on there wrt. DMA? However, i don't have a realistic real Enduro test setup with AMD iGPU + dGPU, only this cobbled together dual HD-5770's in a MacPro, so this could be wrong.

thanks, -mario

dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel

dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel

Michel Dänzer

18 Aug 18 Aug

2:32 a.m.

On 18/08/16 08:51 AM, Mario Kleiner wrote:

...

That's what the ati-ddx/amdgpu-ddx does at the moment, as it detects the mismatch in tiling flags and uses the DRI3/Present copy path instead of the pageflip path. The problem is that the servers Present implementation doesn't request a vsync'ed start of the copy operation [...]

It waits for vblank before starting the copy.

...

There is this other approach from NVidia's Alex Goins for their proprietary driver, whose patches landed in the X-Server 1.19 master branch a couple of weeks ago. I haven't read his patches in detail yet, and i so far couldn't successfully test them with the reference implementation in modesetting ddx 1.19. Afaik there the display gpu exports a pair of scanout friendly, page flipping compatible dmabufs (i assume linear, contiguous, accessible by the display engines),

FWIW, that wouldn't be possible with our "older" GPUs which can't scan out from GTT: A BO can be either shared with another GPU or scanout friendly, not both at the same time.

...

and the offload gpu imports those and renders into them. That saves one extra copy, so should be somewhat more efficient.

Using two shared buffers actually isn't as efficient as possible wrt inter-GPU bandwidth.

...

Setting it up seems to be more involved and less flexible though. So far i couldn't make it work here for testing. Maybe bugs, maybe mistakes on my side, maybe i just have the wrong hardware for it.

Yeah, my impression has been it's a rather complicated solution geared towards the Intel iGPU + proprietary nVidia use case.

-- Earthling Michel Dänzer | http://www.amd.com Libre software enthusiast | Mesa and X developer

Christian König

7:49 a.m.

Am 18.08.2016 um 04:32 schrieb Michel Dänzer:

...

On 18/08/16 08:51 AM, Mario Kleiner wrote:

...
There is this other approach from NVidia's Alex Goins for their proprietary driver, whose patches landed in the X-Server 1.19 master branch a couple of weeks ago. I haven't read his patches in detail yet, and i so far couldn't successfully test them with the reference implementation in modesetting ddx 1.19. Afaik there the display gpu exports a pair of scanout friendly, page flipping compatible dmabufs (i assume linear, contiguous, accessible by the display engines),

FWIW, that wouldn't be possible with our "older" GPUs which can't scan out from GTT: A BO can be either shared with another GPU or scanout friendly, not both at the same time.

And even for newer GPUs it is quite complicated to setup.

As far as I understood it you need to make sure that at least: 1. A whole line buffered is continuous. E.g. if you want to scan out 1920x1080 32bpp without tilling you need 1920*4=7680 bytes of linear memory. The result is that you need to special allocate your GTT buffer. 2. You can't use multiple layer page tables for the system domain (we already do this). 3. The MC needs to guarantee enough PCIe bandwith for the CRTC. This means you need to reprogram some priorities in the MC differently which can only be done when the whole GPU is idle and we haven't released documentation for at all.

But keep in mind that this is only *AFAIK* and from a document on how the DCE works I read quite a while ago.

Regards, Christian.

Mario Kleiner

26 Aug 26 Aug

8:07 p.m.

On 08/18/2016 04:32 AM, Michel Dänzer wrote:

...

On 18/08/16 08:51 AM, Mario Kleiner wrote:

...
That's what the ati-ddx/amdgpu-ddx does at the moment, as it detects the mismatch in tiling flags and uses the DRI3/Present copy path instead of the pageflip path. The problem is that the servers Present implementation doesn't request a vsync'ed start of the copy operation [...]

It waits for vblank before starting the copy.

Yes, a vblank event triggers the present_execute in the server. But all the latency from vblank event dispatch to the copy command packet hitting the gpu is still way too bad to avoid tearing. I tried again and couldn't find a single intel/amd/nvidia gpu here that doesn't tear more or less badly depending on load with DRI3/Present Copyswaps. Even tearfree wouldn't be good enough for my kind of applications as crucial timing/timestamps could still be off frequently by at least 1 frame.

...

...
There is this other approach from NVidia's Alex Goins for their proprietary driver, whose patches landed in the X-Server 1.19 master branch a couple of weeks ago. I haven't read his patches in detail yet, and i so far couldn't successfully test them with the reference implementation in modesetting ddx 1.19. Afaik there the display gpu exports a pair of scanout friendly, page flipping compatible dmabufs (i assume linear, contiguous, accessible by the display engines),

FWIW, that wouldn't be possible with our "older" GPUs which can't scan out from GTT: A BO can be either shared with another GPU or scanout friendly, not both at the same time.

Ok, good to know.

...

...
and the offload gpu imports those and renders into them. That saves one extra copy, so should be somewhat more efficient.

Using two shared buffers actually isn't as efficient as possible wrt inter-GPU bandwidth.

Out of interest, why? You'd have only one detiling copy VRAM -> RAM? Or is it about switching some kind of GTT mappings with two buffers that is inefficient?

...

...
Setting it up seems to be more involved and less flexible though. So far i couldn't make it work here for testing. Maybe bugs, maybe mistakes on my side, maybe i just have the wrong hardware for it.

Yeah, my impression has been it's a rather complicated solution geared towards the Intel iGPU + proprietary nVidia use case.

Setting up output source/output sink is not fun, as i learned now, rather clumsy and complex compared to render offload. I hope the real thing will come with some fool-proof one-click setup GUI, otherwise i don't have great hopes, given the technical skill level of my users. I still didn't manage to get it working, not even with the new Nvidia proprietary beta drivers on a real Optimus laptop.

-mario

Michel Dänzer

29 Aug 29 Aug

3:06 a.m.

On 27/08/16 05:07 AM, Mario Kleiner wrote:

...

On 08/18/2016 04:32 AM, Michel Dänzer wrote:

...
On 18/08/16 08:51 AM, Mario Kleiner wrote:

...
and the offload gpu imports those and renders into them. That saves one extra copy, so should be somewhat more efficient.

Using two shared buffers actually isn't as efficient as possible wrt inter-GPU bandwidth.

Out of interest, why? You'd have only one detiling copy VRAM -> RAM?

Yeah, that's basically it. With a single shared buffer, only the parts which have changed since last time need to be copied between the GPUs; the slave GPU can copy the other changed parts from its other local scanout pixmap (with TearFree enabled; note that this isn't quite implemented yet in our drivers for slave output, but I'm planning to do it soon). With two shared pixmaps, some changed parts have to be copied between GPUs several times.

-- Earthling Michel Dänzer | http://www.amd.com Libre software enthusiast | Mesa and X developer

Michel Dänzer

18 Aug 18 Aug

2:23 a.m.

On 18/08/16 01:12 AM, Mario Kleiner wrote:

...

Intel as display gpu + nouveau for render offload worked nicely on intel-ddx with page flipping, proper timing, dmabuf fence sync and all.

How about with AMD instead of nouveau in this case?

...

Turns out that prime + page flipping currently doesn't work on nouveau and amd. The first offload rendered images from the imported dmabufs show up properly, but then the display is stuck alternating between the first two or three rendered frames.

The problem is that during the pageflip ioctl we pin the dmabuf into VRAM in preparation for scanout, then unpin it when we are done with it at next flip, but the buffer stays in the VRAM memory domain.

Sounds like you found a bug here: BOs which are being shared between different GPUs should always be pinned to GTT, moving them to VRAM (and consequently the page flip) should fail.

The latest versions of DCE support scanning out from GTT, so that might be a good solution at least for Carrizo and newer APUs, not sure it makes sense for dGPUs though.

...

AMD, as tested with dual Radeon HD-5770 seems to be fast as prime importer/display gpu, but very slow as prime exporter/render offload, e.g., taking 16 msecs to get a 1920x1080 framebuffer into RAM. Seems that Mesa's blitImage function is the slow bit here. On r600 it seems to draw a textured triangle strip to detile the gpu renderbuffer and copy it into GTT. As drawing a textured fullscreen quad is normally much faster, something special seems to be going on there wrt. DMA?

Maybe the rasterization as two triangles results in bad PCIe bandwidth utilization. Using the asynchronous DMA engine for these transfers would probably be ideal, but having the 3D engine rasterize a single rectangle (either using the rectangle primitive or a large triangle with scissor) might already help.

-- Earthling Michel Dänzer | http://www.amd.com Libre software enthusiast | Mesa and X developer

Marek Olšák

7:21 p.m.

On Thu, Aug 18, 2016 at 4:23 AM, Michel Dänzer michel@daenzer.net wrote:

...

Maybe the rasterization as two triangles results in bad PCIe bandwidth utilization. Using the asynchronous DMA engine for these transfers would probably be ideal, but having the 3D engine rasterize a single rectangle (either using the rectangle primitive or a large triangle with scissor) might already help.

There is only one thing that's bad for PCIe when the surface is linear: the 3D engine. Disabling all but the first shader engine and all but the first 2 RBs should improve performance for blits from VRAM to GTT. The closed driver does that, but I don't remember if the destination must be linear, must be in GTT, or both. In any case, SDMA should still be the best for VRAM->GTT blits.

Marek

Mario Kleiner

26 Aug 26 Aug

8:10 p.m.

On 08/18/2016 09:21 PM, Marek Olšák wrote:

...

On Thu, Aug 18, 2016 at 4:23 AM, Michel Dänzer michel@daenzer.net wrote:

...
Maybe the rasterization as two triangles results in bad PCIe bandwidth utilization. Using the asynchronous DMA engine for these transfers would probably be ideal, but having the 3D engine rasterize a single rectangle (either using the rectangle primitive or a large triangle with scissor) might already help.

There is only one thing that's bad for PCIe when the surface is linear: the 3D engine. Disabling all but the first shader engine and all but the first 2 RBs should improve performance for blits from VRAM to GTT. The closed driver does that, but I don't remember if the destination must be linear, must be in GTT, or both. In any case, SDMA should still be the best for VRAM->GTT blits.

Marek

Friday evening education question:

So if you have multiple render backends active they compete for PCIe bus access and some kind of "trashing" happens in the arbitration, drastically reducing the bandwidth?

thanks, -mario

Alex Deucher

8:33 p.m.

On Fri, Aug 26, 2016 at 4:10 PM, Mario Kleiner mario.kleiner.de@gmail.com wrote:

...

On 08/18/2016 09:21 PM, Marek Olšák wrote:

...
On Thu, Aug 18, 2016 at 4:23 AM, Michel Dänzer michel@daenzer.net wrote:

...
Maybe the rasterization as two triangles results in bad PCIe bandwidth utilization. Using the asynchronous DMA engine for these transfers would probably be ideal, but having the 3D engine rasterize a single rectangle (either using the rectangle primitive or a large triangle with scissor) might already help.

There is only one thing that's bad for PCIe when the surface is linear: the 3D engine. Disabling all but the first shader engine and all but the first 2 RBs should improve performance for blits from VRAM to GTT. The closed driver does that, but I don't remember if the destination must be linear, must be in GTT, or both. In any case, SDMA should still be the best for VRAM->GTT blits.

Marek

Friday evening education question:

So if you have multiple render backends active they compete for PCIe bus access and some kind of "trashing" happens in the arbitration, drastically reducing the bandwidth?

I think it has more to do with the access patterns. The requests can't be scheduled as efficiently compared to contiguous linear accesses.

Alex

Mario Kleiner

7:57 p.m.

To pick this up again after a week of manic testing :)

On 08/18/2016 04:23 AM, Michel Dänzer wrote:

...

On 18/08/16 01:12 AM, Mario Kleiner wrote:

...
Intel as display gpu + nouveau for render offload worked nicely on intel-ddx with page flipping, proper timing, dmabuf fence sync and all.

How about with AMD instead of nouveau in this case?

I don't have any real AMD Enduro laptop with either Intel + AMD or AMD + AMD atm., so i tested with my hacked up setups, but there things look very good:

a) A standard PC with Intel Haswell + AMD Tonga Pro R9 380. Seems to work correctly, page-flipping used, no visual artifacts or other problems, my measurement equipment also shows perfect timing and no glitches. Performance is very good, even without Marek's recent SDMA + PRIME patch series. Seems though with his patches some of the many criterions for using it doesn't get satisfied so it uses a fallback path on my machine.

One thing that confuses me so far is that visual results and measurment suggest it works nicely, properly serializing the rendering/detiling blit and the pageflip. But when i ftrace the Intel drivers reservation_object_wait_timeout_rcu() call where it normally waits for the dmabuf fence to complete then i never see it blocking for more than a few dozen microseconds, and i couldn't find any other place where it blocks on detiling blit completion yet. Iow. it seems to work correctly in practice, but i don't know where it actually blocks. Could also be that the flip work func in intels driver just executes after the detiling blit has already completed.

b) A MacPro with dual Radeon HD-5770 and NVidia GeForce, and my pageflip hacks applied. I ported Marek's Mesa SDMA patch to r600, and with that i get very good performance for AMD Evergreen as renderoffload gpu both for the NVidia + AMD and AMD + AMD combo. So this solved the performance problems on the older gpus. I assume Intel + old radeon-kms would just behave equally well. So thanks Marek, that was perfect!

I guess that means we are really good now wrt. renderoffload whenever an Intel iGPU is used for display, regardless if nouveau or AMD is used as dGPU :)

...

...
Turns out that prime + page flipping currently doesn't work on nouveau and amd. The first offload rendered images from the imported dmabufs show up properly, but then the display is stuck alternating between the first two or three rendered frames.

The problem is that during the pageflip ioctl we pin the dmabuf into VRAM in preparation for scanout, then unpin it when we are done with it at next flip, but the buffer stays in the VRAM memory domain.

Sounds like you found a bug here: BOs which are being shared between different GPUs should always be pinned to GTT, moving them to VRAM (and consequently the page flip) should fail.

Seems so, although i hoped i was fixing a bug, not exploiting a loophole. In practice i haven't observed trouble with the hack so far. I havent't looked deeply enough into how the dma api below dmabuf operates, so this is just guesswork, but i suspect the reason that this doesn't blow up in an obvious way is that if the render offload gpu exports the dmabuf then the pages get pinned/locked into system RAM, so the pages can't move around or get paged out to swap, as long as the dmabuf stays exported. When the dmabuf importing AMD or nouveau display gpu then moves the bo from GTT to VRAM (or pseudo-moves it back with my hack) all that changes is some pin refcount for the RAM pages, but the refcount always stays non-zero and system RAM isn't freed or moved around during the session. I just wonder if this bug couldn't somehow be turned into a proper feature?

I'm tempted to keep my patches as a temporary stop gap measure in some kernel on GitHub, so my users could use them to get NVidia+NVidia or at least old AMD+AMD setups with radeon-kms + ati-ddx working well enough for their research work until some proper solution comes around. But if you think there is some major way how this could blow up, corrupt data, hang/crash during normal use then better not. I don't know how many of my users have such systems, as my advice to them so far was to "stay the hell away from anything with hybrid graphics/Optimus/Enduro in its name if they value their work". Now i could change my purchase advice to "anything hybrid with a Intel iGPU is probably ok in terms of correctness/timing/performance for not too demanding performance needs".

...

The latest versions of DCE support scanning out from GTT, so that might be a good solution at least for Carrizo and newer APUs, not sure it makes sense for dGPUs though.

That would be good to have. But that means DCE-11 or later only? What is the constraint on older parts, does it need contiguous memory? I personally don't care about the dGPU case, i only use these dGPUs for testing because i don't have access to any real Enduro laptops with APUs.

-mario

...

...
AMD, as tested with dual Radeon HD-5770 seems to be fast as prime importer/display gpu, but very slow as prime exporter/render offload, e.g., taking 16 msecs to get a 1920x1080 framebuffer into RAM. Seems that Mesa's blitImage function is the slow bit here. On r600 it seems to draw a textured triangle strip to detile the gpu renderbuffer and copy it into GTT. As drawing a textured fullscreen quad is normally much faster, something special seems to be going on there wrt. DMA?

Maybe the rasterization as two triangles results in bad PCIe bandwidth utilization. Using the asynchronous DMA engine for these transfers would probably be ideal, but having the 3D engine rasterize a single rectangle (either using the rectangle primitive or a large triangle with scissor) might already help.

Michel Dänzer

29 Aug 29 Aug

3:16 a.m.

On 27/08/16 04:57 AM, Mario Kleiner wrote:

...

On 08/18/2016 04:23 AM, Michel Dänzer wrote:

...
On 18/08/16 01:12 AM, Mario Kleiner wrote:

One thing that confuses me so far is that visual results and measurment suggest it works nicely, properly serializing the rendering/detiling blit and the pageflip. But when i ftrace the Intel drivers reservation_object_wait_timeout_rcu() call where it normally waits for the dmabuf fence to complete then i never see it blocking for more than a few dozen microseconds, and i couldn't find any other place where it blocks on detiling blit completion yet. Iow. it seems to work correctly in practice, but i don't know where it actually blocks.

It actually doesn't work correctly in all cases yet: https://bugs.freedesktop.org/show_bug.cgi?id=95472

...

...
...
Turns out that prime + page flipping currently doesn't work on nouveau and amd. The first offload rendered images from the imported dmabufs show up properly, but then the display is stuck alternating between the first two or three rendered frames.

The problem is that during the pageflip ioctl we pin the dmabuf into VRAM in preparation for scanout, then unpin it when we are done with it at next flip, but the buffer stays in the VRAM memory domain.

Sounds like you found a bug here: BOs which are being shared between different GPUs should always be pinned to GTT, moving them to VRAM (and consequently the page flip) should fail.

Seems so, although i hoped i was fixing a bug, not exploiting a loophole. In practice i haven't observed trouble with the hack so far. I havent't looked deeply enough into how the dma api below dmabuf operates, so this is just guesswork, but i suspect the reason that this doesn't blow up in an obvious way is that if the render offload gpu exports the dmabuf then the pages get pinned/locked into system RAM, so the pages can't move around or get paged out to swap, as long as the dmabuf stays exported. When the dmabuf importing AMD or nouveau display gpu then moves the bo from GTT to VRAM (or pseudo-moves it back with my hack) all that changes is some pin refcount for the RAM pages, but the refcount always stays non-zero and system RAM isn't freed or moved around during the session. I just wonder if this bug couldn't somehow be turned into a proper feature?

I'm afraid not; BOs which are being shared between devices are supposed to be pinned to GTT, and pinned BOs aren't supposed to move.

However, something similar to your patches could be done in the DDX drivers, using the dedicated scanout pixmap mechanism.

...

...
The latest versions of DCE support scanning out from GTT, so that might be a good solution at least for Carrizo and newer APUs, not sure it makes sense for dGPUs though.

That would be good to have. But that means DCE-11 or later only? What is the constraint on older parts, does it need contiguous memory?

Presumably. Anyway, from Christian's description it sounds like it'll be tricky to get this working even with current APUs. :(

-- Earthling Michel Dänzer | http://www.amd.com Libre software enthusiast | Mesa and X developer

Deucher, Alexander

1:20 p.m.

...

-----Original Message----- From: Michel Dänzer [mailto:michel@daenzer.net] Sent: Sunday, August 28, 2016 11:17 PM To: Mario Kleiner Cc: dri-devel@lists.freedesktop.org; jglisse@redhat.com; bskeggs@redhat.com; Deucher, Alexander; airlied@redhat.com Subject: Re: "Fixes" for page flipping under PRIME on AMD & nouveau

On 27/08/16 04:57 AM, Mario Kleiner wrote:

...
On 08/18/2016 04:23 AM, Michel Dänzer wrote:

...
On 18/08/16 01:12 AM, Mario Kleiner wrote:

One thing that confuses me so far is that visual results and measurment suggest it works nicely, properly serializing the rendering/detiling blit and the pageflip. But when i ftrace the Intel drivers reservation_object_wait_timeout_rcu() call where it normally waits for the dmabuf fence to complete then i never see it blocking for more than a few dozen microseconds, and i couldn't find any other place where it blocks on detiling blit completion yet. Iow. it seems to work correctly in practice, but i don't know where it actually blocks.

It actually doesn't work correctly in all cases yet: https://bugs.freedesktop.org/show_bug.cgi?id=95472

...
...
...
Turns out that prime + page flipping currently doesn't work on nouveau and amd. The first offload rendered images from the imported dmabufs show up properly, but then the display is stuck alternating between the first two or three rendered frames.

The problem is that during the pageflip ioctl we pin the dmabuf into VRAM in preparation for scanout, then unpin it when we are done with it at next flip, but the buffer stays in the VRAM memory domain.

Sounds like you found a bug here: BOs which are being shared between different GPUs should always be pinned to GTT, moving them to VRAM

(and

...
...
consequently the page flip) should fail.

Seems so, although i hoped i was fixing a bug, not exploiting a loophole. In practice i haven't observed trouble with the hack so far. I havent't looked deeply enough into how the dma api below dmabuf operates, so this is just guesswork, but i suspect the reason that this doesn't blow up in an obvious way is that if the render offload gpu exports the dmabuf then the pages get pinned/locked into system RAM,

so

...
the pages can't move around or get paged out to swap, as long as the dmabuf stays exported. When the dmabuf importing AMD or nouveau

display

...
gpu then moves the bo from GTT to VRAM (or pseudo-moves it back with

my

...
hack) all that changes is some pin refcount for the RAM pages, but the refcount always stays non-zero and system RAM isn't freed or moved around during the session. I just wonder if this bug couldn't somehow be turned into a proper feature?

I'm afraid not; BOs which are being shared between devices are supposed to be pinned to GTT, and pinned BOs aren't supposed to move.

However, something similar to your patches could be done in the DDX drivers, using the dedicated scanout pixmap mechanism.

...
...
The latest versions of DCE support scanning out from GTT, so that might be a good solution at least for Carrizo and newer APUs, not sure it makes sense for dGPUs though.

That would be good to have. But that means DCE-11 or later only? What is the constraint on older parts, does it need contiguous memory?

Presumably. Anyway, from Christian's description it sounds like it'll be tricky to get this working even with current APUs. :(

It only works for DCE11 APUs (not dGPUs) using single level page tables for gart and has fairly strict alignment requirements. The watermark setup and bandwidth management also have much stricter requirements. I think DAL has most of what is needed in place on the display side assuming the rest of the stack provides a buffer with the right alignment.

Alex

3180

Age (days ago)

3192

Last active (days ago)

dri-devel@lists.freedesktop.org

23 comments

6 participants

tags (0)

participants (6)

Alex Deucher
Christian König
Deucher, Alexander
Marek Olšák
Mario Kleiner
Michel Dänzer