[RFC PATCH 0/5] Core TTM changes for i915 TTM enabling

List overview All Threads
Download

newer

older

[PULL] drm-intel-fixes

[PATCH 00/34] Rid W=1 warnings...

Thomas Hellström

20 May 2021 20 May '21

3:09 p.m.

This is mainly a pre-check that the core TTM changes for the initial i915 TTM patch series look reasonably ok.

Main thing is we add the new page-based iomem memcpy util to TTM, and for some speed the copy-from-wc-x86-only prefetching memcpy to core drm. Note that the legacy memcpy path is largely untested. Perhaps can give it some testing on vmwgfx.

A bugfix and some minor optimization for the ttm_bo_pipeline_gutting() idle case

Finally allow the frequently-pinning i915 driver to block swapping of pinned memory that is still on the LRU.

If OK, I'd like to include these as a part of the i915 series.

Cc: Christian König christian.koenig@amd.com Cc: Dave Airlie airlied@gmail.com Cc: Daniel Vetter daniel.vetter@ffwll.ch

Thomas Hellström (5): drm/ttm: Add a generic TTM memcpy move for page-based iomem drm, drm/i915: Move the memcpy_from_wc functionality to core drm drm/ttm: Use drm_memcpy_from_wc for TTM bo moves drm/ttm: Document and optimize ttm_bo_pipeline_gutting() drm/ttm, drm/amdgpu: Allow the driver some control over swapping

drivers/gpu/drm/Makefile | 2 +- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 4 + drivers/gpu/drm/drm_drv.c | 2 + .../drm/{i915/i915_memcpy.c => drm_memcpy.c} | 31 +- drivers/gpu/drm/i915/Makefile | 1 - .../gpu/drm/i915/gem/i915_gem_execbuffer.c | 4 +- drivers/gpu/drm/i915/gem/i915_gem_object.c | 5 +- drivers/gpu/drm/i915/gt/selftest_reset.c | 7 +- drivers/gpu/drm/i915/gt/uc/intel_guc_log.c | 11 +- drivers/gpu/drm/i915/i915_cmd_parser.c | 4 +- drivers/gpu/drm/i915/i915_drv.c | 2 - drivers/gpu/drm/i915/i915_gpu_error.c | 8 +- drivers/gpu/drm/i915/i915_memcpy.h | 34 -- .../drm/i915/selftests/intel_memory_region.c | 7 +- drivers/gpu/drm/ttm/ttm_bo.c | 61 +- drivers/gpu/drm/ttm/ttm_bo_util.c | 547 ++++++++++++------ drivers/gpu/drm/ttm/ttm_tt.c | 9 + include/drm/drm_memcpy.h | 41 ++ include/drm/ttm/ttm_bo_driver.h | 94 +++ include/drm/ttm/ttm_tt.h | 10 + 20 files changed, 614 insertions(+), 270 deletions(-) rename drivers/gpu/drm/{i915/i915_memcpy.c => drm_memcpy.c} (84%) delete mode 100644 drivers/gpu/drm/i915/i915_memcpy.h create mode 100644 include/drm/drm_memcpy.h

-- 2.31.1

Show replies by date

Thomas Hellström

20 May 20 May

3:09 p.m.

New subject: [RFC PATCH 1/5] drm/ttm: Add a generic TTM memcpy move for page-based iomem

The internal ttm_bo_util memcpy uses ioremap functionality, and while it probably might be possible to use it for copying in- and out of sglist represented io memory, using io_mem_reserve() / io_mem_free() callbacks, that would cause problems with fault(). Instead, implement a method mapping page-by-page using kmap_local() semantics. As an additional benefit we then avoid the occasional global TLB flushes of ioremap() and consuming ioremap space, elimination of a critical point of failure and with a slight change of semantics we could also push the memcpy out async for testing and async driver development purposes.

A special linear iomem iterator is introduced internally to mimic the old ioremap behaviour for code-paths that can't immediately be ported over. This adds to the code size and should be considered a temporary solution.

Looking at the code we have a lot of checks for iomap tagged pointers. Ideally we should extend the core memremap functions to also accept uncached memory and kmap_local functionality. Then we could strip a lot of code.

Cc: Christian König christian.koenig@amd.com Signed-off-by: Thomas Hellström thomas.hellstrom@linux.intel.com --- drivers/gpu/drm/ttm/ttm_bo_util.c | 468 ++++++++++++++++++++---------- include/drm/ttm/ttm_bo_driver.h | 94 ++++++ 2 files changed, 407 insertions(+), 155 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c index ae8b61460724..bad9b16e96ba 100644 --- a/drivers/gpu/drm/ttm/ttm_bo_util.c +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c @@ -35,11 +35,13 @@ #include <linux/dma-buf-map.h> #include <linux/io.h> #include <linux/highmem.h> +#include <linux/io-mapping.h> #include <linux/wait.h> #include <linux/slab.h> #include <linux/vmalloc.h> #include <linux/module.h> #include <linux/dma-resv.h> +#include <linux/scatterlist.h>

struct ttm_transfer_obj { struct ttm_buffer_object base; @@ -72,190 +74,366 @@ void ttm_mem_io_free(struct ttm_device *bdev, mem->bus.addr = NULL; }

-static int ttm_resource_ioremap(struct ttm_device *bdev, - struct ttm_resource *mem, - void **virtual) +static pgprot_t ttm_prot_from_caching(enum ttm_caching caching, pgprot_t tmp) { - int ret; - void *addr; + /* Cached mappings need no adjustment */ + if (caching == ttm_cached) + return tmp;

- *virtual = NULL; - ret = ttm_mem_io_reserve(bdev, mem); - if (ret || !mem->bus.is_iomem) - return ret; +#if defined(__i386__) || defined(__x86_64__) + if (caching == ttm_write_combined) + tmp = pgprot_writecombine(tmp); + else if (boot_cpu_data.x86 > 3) + tmp = pgprot_noncached(tmp); +#endif +#if defined(__ia64__) || defined(__arm__) || defined(__aarch64__) || \ + defined(__powerpc__) || defined(__mips__) + if (caching == ttm_write_combined) + tmp = pgprot_writecombine(tmp); + else + tmp = pgprot_noncached(tmp); +#endif +#if defined(__sparc__) + tmp = pgprot_noncached(tmp); +#endif + return tmp; +}

- if (mem->bus.addr) { - addr = mem->bus.addr; - } else { - size_t bus_size = (size_t)mem->num_pages << PAGE_SHIFT; +static void ttm_kmap_iter_tt_kmap_local(struct ttm_kmap_iter *iter, + struct dma_buf_map *dmap, + pgoff_t i) +{ + struct ttm_kmap_iter_tt *iter_tt = + container_of(iter, typeof(*iter_tt), base);

- if (mem->bus.caching == ttm_write_combined) - addr = ioremap_wc(mem->bus.offset, bus_size); -#ifdef CONFIG_X86 - else if (mem->bus.caching == ttm_cached) - addr = ioremap_cache(mem->bus.offset, bus_size); -#endif - else - addr = ioremap(mem->bus.offset, bus_size); - if (!addr) { - ttm_mem_io_free(bdev, mem); - return -ENOMEM; - } + dma_buf_map_set_vaddr(dmap, kmap_local_page_prot(iter_tt->tt->pages[i], + iter_tt->prot)); +} + +static void ttm_kmap_iter_iomap_kmap_local(struct ttm_kmap_iter *iter, + struct dma_buf_map *dmap, + pgoff_t i) +{ + struct ttm_kmap_iter_iomap *iter_io = + container_of(iter, typeof(*iter_io), base); + void __iomem *addr; + +retry: + while (i >= iter_io->cache.end) { + iter_io->cache.sg = iter_io->cache.sg ? + sg_next(iter_io->cache.sg) : iter_io->st->sgl; + iter_io->cache.i = iter_io->cache.end; + iter_io->cache.end += sg_dma_len(iter_io->cache.sg) >> + PAGE_SHIFT; + iter_io->cache.offs = sg_dma_address(iter_io->cache.sg) - + iter_io->start; } - *virtual = addr; - return 0; + + if (i < iter_io->cache.i) { + iter_io->cache.end = 0; + iter_io->cache.sg = NULL; + goto retry; + } + + addr = io_mapping_map_local_wc(iter_io->iomap, iter_io->cache.offs + + (((resource_size_t)i - iter_io->cache.i) + << PAGE_SHIFT)); + dma_buf_map_set_vaddr_iomem(dmap, addr); }

-static void ttm_resource_iounmap(struct ttm_device *bdev, - struct ttm_resource *mem, - void *virtual) +static const struct ttm_kmap_iter_ops ttm_kmap_iter_tt_ops = { + .kmap_local = ttm_kmap_iter_tt_kmap_local, + .needs_unmap = true +}; + +static const struct ttm_kmap_iter_ops ttm_kmap_iter_io_ops = { + .kmap_local = ttm_kmap_iter_iomap_kmap_local, + .needs_unmap = true +}; + +/* If needed, make unmap functionality part of ttm_kmap_iter_ops */ +static void kunmap_local_iter(struct ttm_kmap_iter *iter, + struct dma_buf_map *map) { - if (virtual && mem->bus.addr == NULL) - iounmap(virtual); - ttm_mem_io_free(bdev, mem); + if (!iter->ops->needs_unmap) + return; + + if (map->is_iomem) + io_mapping_unmap_local(map->vaddr_iomem); + else + kunmap_local(map->vaddr); }

-static int ttm_copy_io_page(void *dst, void *src, unsigned long page) +/** + * ttm_move_memcpy - Helper to perform a memcpy ttm move operation. + * @bo: The struct ttm_buffer_object. + * @new_mem: The struct ttm_resource we're moving to (copy destination). + * @new_iter: A struct ttm_kmap_iter representing the destination resource. + * @old_iter: A struct ttm_kmap_iter representing the source resource. + * + * This function is intended to be able to move out async under a + * dma-fence if desired. + */ +void ttm_move_memcpy(struct ttm_buffer_object *bo, + struct ttm_resource *new_mem, + struct ttm_kmap_iter *new_iter, + struct ttm_kmap_iter *old_iter) { - uint32_t *dstP = - (uint32_t *) ((unsigned long)dst + (page << PAGE_SHIFT)); - uint32_t *srcP = - (uint32_t *) ((unsigned long)src + (page << PAGE_SHIFT)); - - int i; - for (i = 0; i < PAGE_SIZE / sizeof(uint32_t); ++i) - iowrite32(ioread32(srcP++), dstP++); - return 0; + struct ttm_device *bdev = bo->bdev; + struct ttm_resource_manager *man = ttm_manager_type(bdev, new_mem->mem_type); + struct ttm_tt *ttm = bo->ttm; + struct ttm_resource *old_mem = &bo->mem; + struct ttm_resource_manager *old_man = ttm_manager_type(bdev, old_mem->mem_type); + struct dma_buf_map old_map, new_map; + pgoff_t i; + + /* Single TTM move. NOP */ + if (old_man->use_tt && man->use_tt) + return; + + /* Don't move nonexistent data. Clear destination instead. */ + if (old_man->use_tt && !man->use_tt && + (!ttm || !ttm_tt_is_populated(ttm))) { + if (ttm && !(ttm->page_flags & TTM_PAGE_FLAG_ZERO_ALLOC)) + return; + + for (i = 0; i < new_mem->num_pages; ++i) { + new_iter->ops->kmap_local(new_iter, &new_map, i); + if (new_map.is_iomem) + memset_io(new_map.vaddr_iomem, 0, PAGE_SIZE); + else + memset(new_map.vaddr, 0, PAGE_SIZE); + kunmap_local_iter(new_iter, &new_map); + } + return; + } + + for (i = 0; i < new_mem->num_pages; ++i) { + new_iter->ops->kmap_local(new_iter, &new_map, i); + old_iter->ops->kmap_local(old_iter, &old_map, i); + + if (!old_map.is_iomem && !new_map.is_iomem) { + memcpy(new_map.vaddr, old_map.vaddr, PAGE_SIZE); + } else if (!old_map.is_iomem) { + dma_buf_map_memcpy_to(&new_map, old_map.vaddr, + PAGE_SIZE); + } else if (!new_map.is_iomem) { + memcpy_fromio(new_map.vaddr, old_map.vaddr_iomem, + PAGE_SIZE); + } else { + int j; + u32 __iomem *src = old_map.vaddr_iomem; + u32 __iomem *dst = new_map.vaddr_iomem; + + for (j = 0; j < (PAGE_SIZE >> 2); ++j) + iowrite32(ioread32(src++), dst++); + } + kunmap_local_iter(old_iter, &old_map); + kunmap_local_iter(new_iter, &new_map); + } } +EXPORT_SYMBOL(ttm_move_memcpy);

-static int ttm_copy_io_ttm_page(struct ttm_tt *ttm, void *src, - unsigned long page, - pgprot_t prot) +/** + * ttm_kmap_iter_iomap_init - Initialize a struct ttm_kmap_iter_iomap + * @iter_io: The struct ttm_kmap_iter_iomap to initialize. + * @iomap: The struct io_mapping representing the underlying linear io_memory. + * @st: sg_table into @iomap, representing the memory of the struct + * ttm_resource. + * @start: Offset that needs to be subtracted from @st to make + * sg_dma_address(st->sgl) - @start == 0 for @iomap start. + * + * Return: Pointer to the embedded struct ttm_kmap_iter. + */ +struct ttm_kmap_iter * +ttm_kmap_iter_iomap_init(struct ttm_kmap_iter_iomap *iter_io, + struct io_mapping *iomap, + struct sg_table *st, + resource_size_t start) { - struct page *d = ttm->pages[page]; - void *dst; + iter_io->base.ops = &ttm_kmap_iter_io_ops; + iter_io->iomap = iomap; + iter_io->st = st; + iter_io->start = start; + memset(&iter_io->cache, 0, sizeof(iter_io->cache));

- if (!d) - return -ENOMEM; + return &iter_io->base; +} +EXPORT_SYMBOL(ttm_kmap_iter_iomap_init);

- src = (void *)((unsigned long)src + (page << PAGE_SHIFT)); - dst = kmap_atomic_prot(d, prot); - if (!dst) - return -ENOMEM; +/** + * ttm_kmap_iter_tt_init - Initialize a struct ttm_kmap_iter_tt + * @iter_tt: The struct ttm_kmap_iter_tt to initialize. + * @tt: Struct ttm_tt holding page pointers of the struct ttm_resource. + * + * Return: Pointer to the embedded struct ttm_kmap_iter. + */ +struct ttm_kmap_iter * +ttm_kmap_iter_tt_init(struct ttm_kmap_iter_tt *iter_tt, + struct ttm_tt *tt) +{ + iter_tt->base.ops = &ttm_kmap_iter_tt_ops; + iter_tt->tt = tt; + iter_tt->prot = ttm_prot_from_caching(tt->caching, PAGE_KERNEL);

- memcpy_fromio(dst, src, PAGE_SIZE); + return &iter_tt->base; +} +EXPORT_SYMBOL(ttm_kmap_iter_tt_init);

- kunmap_atomic(dst); +/** + * DOC: Linear io iterator + * + * This code should die in the not too near future. Best would be if we could + * make io-mapping use memremap for all io memory, and have memremap + * implement a kmap_local functionality. We could then strip a huge amount of + * code. These linear io iterators are implemented to mimic old functionality, + * and they don't use kmap_local semantics at all internally. Rather ioremap or + * friends, and at least on 32-bit they add global TLB flushes and points + * of failure. + */

- return 0; +/** + * struct ttm_kmap_iter_linear_io - Iterator specialization for linear io + * @base: The base iterator + * @dmap: Points to the starting address of the region + * @needs_unmap: Whether we need to unmap on fini + */ +struct ttm_kmap_iter_linear_io { + struct ttm_kmap_iter base; + struct dma_buf_map dmap; + bool needs_unmap; +}; + +static void ttm_kmap_iter_linear_io_kmap_local(struct ttm_kmap_iter *iter, + struct dma_buf_map *dmap, + pgoff_t i) +{ + struct ttm_kmap_iter_linear_io *iter_io = + container_of(iter, typeof(*iter_io), base); + + *dmap = iter_io->dmap; + dma_buf_map_incr(dmap, i * PAGE_SIZE); }

-static int ttm_copy_ttm_io_page(struct ttm_tt *ttm, void *dst, - unsigned long page, - pgprot_t prot) +static const struct ttm_kmap_iter_ops ttm_kmap_iter_linear_io_ops = { + .kmap_local = ttm_kmap_iter_linear_io_kmap_local +}; + +static struct ttm_kmap_iter * +ttm_kmap_iter_linear_io_init(struct ttm_kmap_iter_linear_io *iter_io, + struct ttm_device *bdev, + struct ttm_resource *mem) { - struct page *s = ttm->pages[page]; - void *src; + int ret;

- if (!s) - return -ENOMEM; + ret = ttm_mem_io_reserve(bdev, mem); + if (ret) + goto out_err; + if (!mem->bus.is_iomem) { + ret = -EINVAL; + goto out_io_free; + }

- dst = (void *)((unsigned long)dst + (page << PAGE_SHIFT)); - src = kmap_atomic_prot(s, prot); - if (!src) - return -ENOMEM; + if (mem->bus.addr) { + dma_buf_map_set_vaddr(&iter_io->dmap, mem->bus.addr); + iter_io->needs_unmap = false; + } else { + size_t bus_size = (size_t)mem->num_pages << PAGE_SHIFT;

- memcpy_toio(dst, src, PAGE_SIZE); + iter_io->needs_unmap = true; + if (mem->bus.caching == ttm_write_combined) + dma_buf_map_set_vaddr_iomem(&iter_io->dmap, + ioremap_wc(mem->bus.offset, + bus_size)); + else if (mem->bus.caching == ttm_cached) + dma_buf_map_set_vaddr(&iter_io->dmap, + memremap(mem->bus.offset, bus_size, + MEMREMAP_WB)); + else + dma_buf_map_set_vaddr_iomem(&iter_io->dmap, + ioremap(mem->bus.offset, + bus_size)); + if (dma_buf_map_is_null(&iter_io->dmap)) { + ret = -ENOMEM; + goto out_io_free; + } + }

- kunmap_atomic(src); + iter_io->base.ops = &ttm_kmap_iter_linear_io_ops; + return &iter_io->base;

- return 0; +out_io_free: + ttm_mem_io_free(bdev, mem); +out_err: + return ERR_PTR(ret); +} + +static void +ttm_kmap_iter_linear_io_fini(struct ttm_kmap_iter_linear_io *iter_io, + struct ttm_device *bdev, + struct ttm_resource *mem) +{ + if (iter_io->needs_unmap && dma_buf_map_is_set(&iter_io->dmap)) { + if (iter_io->dmap.is_iomem) + iounmap(iter_io->dmap.vaddr_iomem); + else + memunmap(iter_io->dmap.vaddr); + } + + ttm_mem_io_free(bdev, mem); }

+static int ttm_bo_wait_free_node(struct ttm_buffer_object *bo, + bool dst_use_tt); + int ttm_bo_move_memcpy(struct ttm_buffer_object *bo, struct ttm_operation_ctx *ctx, struct ttm_resource *new_mem) { struct ttm_device *bdev = bo->bdev; struct ttm_resource_manager *man = ttm_manager_type(bdev, new_mem->mem_type); + struct ttm_resource_manager *new_man = + ttm_manager_type(bo->bdev, new_mem->mem_type); struct ttm_tt *ttm = bo->ttm; struct ttm_resource *old_mem = &bo->mem; struct ttm_resource old_copy = *old_mem; - void *old_iomap; - void *new_iomap; + union { + struct ttm_kmap_iter_tt tt; + struct ttm_kmap_iter_linear_io io; + } _new_iter, _old_iter; + struct ttm_kmap_iter *new_iter, *old_iter; int ret; - unsigned long i; - - ret = ttm_bo_wait_ctx(bo, ctx); - if (ret) - return ret; - - ret = ttm_resource_ioremap(bdev, old_mem, &old_iomap); - if (ret) - return ret; - ret = ttm_resource_ioremap(bdev, new_mem, &new_iomap); - if (ret) - goto out; - - /* - * Single TTM move. NOP. - */ - if (old_iomap == NULL && new_iomap == NULL) - goto out2;

- /* - * Don't move nonexistent data. Clear destination instead. - */ - if (old_iomap == NULL && - (ttm == NULL || (!ttm_tt_is_populated(ttm) && - !(ttm->page_flags & TTM_PAGE_FLAG_SWAPPED)))) { - memset_io(new_iomap, 0, new_mem->num_pages*PAGE_SIZE); - goto out2; - } - - /* - * TTM might be null for moves within the same region. - */ if (ttm) { ret = ttm_tt_populate(bdev, ttm, ctx); if (ret) - goto out1; + return ret; }

- for (i = 0; i < new_mem->num_pages; ++i) { - if (old_iomap == NULL) { - pgprot_t prot = ttm_io_prot(bo, old_mem, PAGE_KERNEL); - ret = ttm_copy_ttm_io_page(ttm, new_iomap, i, - prot); - } else if (new_iomap == NULL) { - pgprot_t prot = ttm_io_prot(bo, new_mem, PAGE_KERNEL); - ret = ttm_copy_io_ttm_page(ttm, old_iomap, i, - prot); - } else { - ret = ttm_copy_io_page(new_iomap, old_iomap, i); - } - if (ret) - goto out1; + new_iter = new_man->use_tt ? + ttm_kmap_iter_tt_init(&_new_iter.tt, bo->ttm) : + ttm_kmap_iter_linear_io_init(&_new_iter.io, bdev, new_mem); + if (IS_ERR(new_iter)) + return PTR_ERR(new_iter); + + old_iter = man->use_tt ? + ttm_kmap_iter_tt_init(&_old_iter.tt, bo->ttm) : + ttm_kmap_iter_linear_io_init(&_old_iter.io, bdev, old_mem); + if (IS_ERR(old_iter)) { + ret = PTR_ERR(old_iter); + goto out_old_iter; } - mb(); -out2: - old_copy = *old_mem;

- ttm_bo_assign_mem(bo, new_mem); + ttm_move_memcpy(bo, new_mem, new_iter, old_iter); + old_copy = *old_mem; + ret = ttm_bo_wait_free_node(bo, new_man->use_tt);

if (!man->use_tt) - ttm_bo_tt_destroy(bo); + ttm_kmap_iter_linear_io_fini(&_old_iter.io, bdev, &old_copy); +out_old_iter: + if (!new_man->use_tt) + ttm_kmap_iter_linear_io_fini(&_new_iter.io, bdev, new_mem);

-out1: - ttm_resource_iounmap(bdev, old_mem, new_iomap); -out: - ttm_resource_iounmap(bdev, &old_copy, old_iomap); - - /* - * On error, keep the mm node! - */ - if (!ret) - ttm_resource_free(bo, &old_copy); return ret; } EXPORT_SYMBOL(ttm_bo_move_memcpy); @@ -336,27 +514,7 @@ pgprot_t ttm_io_prot(struct ttm_buffer_object *bo, struct ttm_resource *res, man = ttm_manager_type(bo->bdev, res->mem_type); caching = man->use_tt ? bo->ttm->caching : res->bus.caching;

- /* Cached mappings need no adjustment */ - if (caching == ttm_cached) - return tmp; - -#if defined(__i386__) || defined(__x86_64__) - if (caching == ttm_write_combined) - tmp = pgprot_writecombine(tmp); - else if (boot_cpu_data.x86 > 3) - tmp = pgprot_noncached(tmp); -#endif -#if defined(__ia64__) || defined(__arm__) || defined(__aarch64__) || \ - defined(__powerpc__) || defined(__mips__) - if (caching == ttm_write_combined) - tmp = pgprot_writecombine(tmp); - else - tmp = pgprot_noncached(tmp); -#endif -#if defined(__sparc__) - tmp = pgprot_noncached(tmp); -#endif - return tmp; + return ttm_prot_from_caching(caching, tmp); } EXPORT_SYMBOL(ttm_io_prot);

diff --git a/include/drm/ttm/ttm_bo_driver.h b/include/drm/ttm/ttm_bo_driver.h index dbccac957f8f..30f1dce987b2 100644 --- a/include/drm/ttm/ttm_bo_driver.h +++ b/include/drm/ttm/ttm_bo_driver.h @@ -332,4 +332,98 @@ int ttm_range_man_init(struct ttm_device *bdev, int ttm_range_man_fini(struct ttm_device *bdev, unsigned type);

+struct dma_buf_map; +struct io_mapping; +struct sg_table; +struct scatterlist; +struct ttm_kmap_iter; + +/** + * struct ttm_kmap_iter_ops - Ops structure for a struct + * ttm_kmap_iter. + * @needs_unmap - Whether a kunmap_local is needed to balance @kmap_local. + */ +struct ttm_kmap_iter_ops { + /** + * kmap_local - Map a PAGE_SIZE part of the resource using + * kmap_local semantics. + * @res_kmap: Pointer to the struct ttm_kmap_iter representing + * the resource. + * @dmap: The struct dma_buf_map holding the virtual address after + * the operation. + * @i: The location within the resource to map. PAGE_SIZE granularity. + */ + void (*kmap_local)(struct ttm_kmap_iter *res_kmap, + struct dma_buf_map *dmap, pgoff_t i); + bool needs_unmap; +}; + +/** + * struct ttm_kmap_iter - Iterator for kmap_local type operations on a + * resource. + * @ops: Pointer to the operations struct. + * + * This struct is intended to be embedded in a resource-specific specialization + * implementing operations for the resource. + * + * Nothing stops us from extending the operations to vmap, vmap_pfn etc, + * replacing some or parts of the ttm_bo_util. cpu-map functionality. + */ +struct ttm_kmap_iter { + const struct ttm_kmap_iter_ops *ops; +}; + +/** + * struct ttm_kmap_iter_tt - Specialization for a tt (page) backed struct + * ttm_resource. + * @base: Embedded struct ttm_kmap_iter providing the usage interface + * @tt: Cached struct ttm_tt. + */ +struct ttm_kmap_iter_tt { + struct ttm_kmap_iter base; + struct ttm_tt *tt; + pgprot_t prot; +}; + +/** + * struct ttm_kmap_iter_iomap - Specialization for a struct io_mapping + + * struct sg_table backed struct ttm_resource. + * @base: Embedded struct ttm_kmap_iter providing the usage interface. + * @iomap: struct io_mapping representing the underlying linear io_memory. + * @st: sg_table into @iomap, representing the memory of the struct ttm_resource. + * @start: Offset that needs to be subtracted from @st to make + * sg_dma_address(st->sgl) - @start == 0 for @iomap start. + * @cache: Scatterlist traversal cache for fast lookups. + * @cache.sg: Pointer to the currently cached scatterlist segment. + * @cache.i: First index of @sg. PAGE_SIZE granularity. + * @cache.end: Last index + 1 of @sg. PAGE_SIZE granularity. + * @cache.offs: First offset into @iomap of @sg. PAGE_SIZE granularity. + */ +struct ttm_kmap_iter_iomap { + struct ttm_kmap_iter base; + struct io_mapping *iomap; + struct sg_table *st; + resource_size_t start; + struct { + struct scatterlist *sg; + pgoff_t i; + pgoff_t end; + pgoff_t offs; + } cache; +}; + +void ttm_move_memcpy(struct ttm_buffer_object *bo, + struct ttm_resource *new_mem, + struct ttm_kmap_iter *new_iter, + struct ttm_kmap_iter *old_iter); + +struct ttm_kmap_iter * +ttm_kmap_iter_tt_init(struct ttm_kmap_iter_tt *iter_tt, + struct ttm_tt *tt); + +struct ttm_kmap_iter * +ttm_kmap_iter_iomap_init(struct ttm_kmap_iter_iomap *iter_io, + struct io_mapping *iomap, + struct sg_table *st, + resource_size_t start); #endif

-- 2.31.1

Christian König

21 May 21 May

8:08 a.m.

New subject: [RFC PATCH 1/5] drm/ttm: Add a generic TTM memcpy move for page-based iomem

Am 20.05.21 um 17:09 schrieb Thomas Hellström:

...

The internal ttm_bo_util memcpy uses ioremap functionality, and while it probably might be possible to use it for copying in- and out of sglist represented io memory, using io_mem_reserve() / io_mem_free() callbacks, that would cause problems with fault(). Instead, implement a method mapping page-by-page using kmap_local() semantics. As an additional benefit we then avoid the occasional global TLB flushes of ioremap() and consuming ioremap space, elimination of a critical point of failure and with a slight change of semantics we could also push the memcpy out async for testing and async driver development purposes.

A special linear iomem iterator is introduced internally to mimic the old ioremap behaviour for code-paths that can't immediately be ported over. This adds to the code size and should be considered a temporary solution.

Looking at the code we have a lot of checks for iomap tagged pointers. Ideally we should extend the core memremap functions to also accept uncached memory and kmap_local functionality. Then we could strip a lot of code.

Cc: Christian König christian.koenig@amd.com Signed-off-by: Thomas Hellström thomas.hellstrom@linux.intel.com

drivers/gpu/drm/ttm/ttm_bo_util.c | 468 ++++++++++++++++++++---------- include/drm/ttm/ttm_bo_driver.h | 94 ++++++ 2 files changed, 407 insertions(+), 155 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c index ae8b61460724..bad9b16e96ba 100644 --- a/drivers/gpu/drm/ttm/ttm_bo_util.c +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c @@ -35,11 +35,13 @@ #include <linux/dma-buf-map.h> #include <linux/io.h> #include <linux/highmem.h> +#include <linux/io-mapping.h> #include <linux/wait.h> #include <linux/slab.h> #include <linux/vmalloc.h> #include <linux/module.h> #include <linux/dma-resv.h> +#include <linux/scatterlist.h>

struct ttm_transfer_obj { struct ttm_buffer_object base; @@ -72,190 +74,366 @@ void ttm_mem_io_free(struct ttm_device *bdev, mem->bus.addr = NULL; }

-static int ttm_resource_ioremap(struct ttm_device *bdev,
	       struct ttm_resource *mem,
	       void **virtual)
+static pgprot_t ttm_prot_from_caching(enum ttm_caching caching, pgprot_t tmp)

Could you move this to ttm_module.c, creating a separate ttm_caching.c sounds overkill but it is rather independent from the BO object.

...

{

int ret;

void *addr;
/* Cached mappings need no adjustment */

if (caching == ttm_cached)
return tmp;
*virtual = NULL;

ret = ttm_mem_io_reserve(bdev, mem);

if (ret || !mem->bus.is_iomem)
return ret;
+#if defined(__i386__) || defined(__x86_64__)
if (caching == ttm_write_combined)
tmp = pgprot_writecombine(tmp);
else if (boot_cpu_data.x86 > 3)
tmp = pgprot_noncached(tmp);
+#endif +#if defined(__ia64__) || defined(__arm__) || defined(__aarch64__) || \
defined(__powerpc__) || defined(__mips__)

if (caching == ttm_write_combined)
tmp = pgprot_writecombine(tmp);
else
tmp = pgprot_noncached(tmp);
+#endif +#if defined(__sparc__)

tmp = pgprot_noncached(tmp);

+#endif

return tmp;

+}
if (mem->bus.addr) {
addr = mem->bus.addr;
} else {
size_t bus_size = (size_t)mem->num_pages << PAGE_SHIFT;
+static void ttm_kmap_iter_tt_kmap_local(struct ttm_kmap_iter *iter,
			struct dma_buf_map *dmap,
			pgoff_t i)

I really like the idea of the iterator, but I think we should move the code for iterating over TT and iterating over resource object to ttm_tt.c and ttm_resource.c respectively.

...

+{
struct ttm_kmap_iter_tt *iter_tt =
container_of(iter, typeof(*iter_tt), base);
if (mem->bus.caching == ttm_write_combined)
	addr = ioremap_wc(mem->bus.offset, bus_size);
-#ifdef CONFIG_X86
else if (mem->bus.caching == ttm_cached)
	addr = ioremap_cache(mem->bus.offset, bus_size);
-#endif
else
	addr = ioremap(mem->bus.offset, bus_size);
if (!addr) {
	ttm_mem_io_free(bdev, mem);
	return -ENOMEM;
}
dma_buf_map_set_vaddr(dmap, kmap_local_page_prot(iter_tt->tt->pages[i],
					 iter_tt->prot));
+}

+static void ttm_kmap_iter_iomap_kmap_local(struct ttm_kmap_iter *iter,
			   struct dma_buf_map *dmap,
			   pgoff_t i)
+{
struct ttm_kmap_iter_iomap *iter_io =
container_of(iter, typeof(*iter_io), base);
void __iomem *addr;
+retry:
while (i >= iter_io->cache.end) {
iter_io->cache.sg = iter_io->cache.sg ?
	sg_next(iter_io->cache.sg) : iter_io->st->sgl;
iter_io->cache.i = iter_io->cache.end;
iter_io->cache.end += sg_dma_len(iter_io->cache.sg) >>
	PAGE_SHIFT;
iter_io->cache.offs = sg_dma_address(iter_io->cache.sg) -
	iter_io->start;
}
*virtual = addr;

return 0;
if (i < iter_io->cache.i) {
iter_io->cache.end = 0;
iter_io->cache.sg = NULL;
goto retry;
}

addr = io_mapping_map_local_wc(iter_io->iomap, iter_io->cache.offs +
		       (((resource_size_t)i - iter_io->cache.i)
			<< PAGE_SHIFT));
dma_buf_map_set_vaddr_iomem(dmap, addr); }
-static void ttm_resource_iounmap(struct ttm_device *bdev,
		struct ttm_resource *mem,
		void *virtual)
+static const struct ttm_kmap_iter_ops ttm_kmap_iter_tt_ops = {

.kmap_local = ttm_kmap_iter_tt_kmap_local,

.needs_unmap = true

+};

+static const struct ttm_kmap_iter_ops ttm_kmap_iter_io_ops = {

.kmap_local = ttm_kmap_iter_iomap_kmap_local,

.needs_unmap = true

I think we should rather have an map and an unmap callback instead of the needs_unmap flag.

Christian.

...

+};

+/* If needed, make unmap functionality part of ttm_kmap_iter_ops */ +static void kunmap_local_iter(struct ttm_kmap_iter *iter,
	      struct dma_buf_map *map)
{
if (virtual && mem->bus.addr == NULL)
iounmap(virtual);
ttm_mem_io_free(bdev, mem);
if (!iter->ops->needs_unmap)
return;
if (map->is_iomem)
io_mapping_unmap_local(map->vaddr_iomem);
else
kunmap_local(map->vaddr);
}
-static int ttm_copy_io_page(void *dst, void *src, unsigned long page) +/**

ttm_move_memcpy - Helper to perform a memcpy ttm move operation.

@bo: The struct ttm_buffer_object.

@new_mem: The struct ttm_resource we're moving to (copy destination).

@new_iter: A struct ttm_kmap_iter representing the destination resource.

@old_iter: A struct ttm_kmap_iter representing the source resource.

This function is intended to be able to move out async under a

dma-fence if desired.

*/

+void ttm_move_memcpy(struct ttm_buffer_object *bo,
     struct ttm_resource *new_mem,
     struct ttm_kmap_iter *new_iter,
     struct ttm_kmap_iter *old_iter)
{
uint32_t *dstP =
   (uint32_t *) ((unsigned long)dst + (page << PAGE_SHIFT));
uint32_t *srcP =
   (uint32_t *) ((unsigned long)src + (page << PAGE_SHIFT));
int i;

for (i = 0; i < PAGE_SIZE / sizeof(uint32_t); ++i)
iowrite32(ioread32(srcP++), dstP++);
return 0;
struct ttm_device *bdev = bo->bdev;

struct ttm_resource_manager *man = ttm_manager_type(bdev, new_mem->mem_type);

struct ttm_tt *ttm = bo->ttm;

struct ttm_resource *old_mem = &bo->mem;

struct ttm_resource_manager *old_man = ttm_manager_type(bdev, old_mem->mem_type);

struct dma_buf_map old_map, new_map;

pgoff_t i;

/* Single TTM move. NOP */

if (old_man->use_tt && man->use_tt)
return;
/* Don't move nonexistent data. Clear destination instead. */

if (old_man->use_tt && !man->use_tt &&
   (!ttm || !ttm_tt_is_populated(ttm))) {
if (ttm && !(ttm->page_flags & TTM_PAGE_FLAG_ZERO_ALLOC))
	return;
for (i = 0; i < new_mem->num_pages; ++i) {
	new_iter->ops->kmap_local(new_iter, &new_map, i);
	if (new_map.is_iomem)
		memset_io(new_map.vaddr_iomem, 0, PAGE_SIZE);
	else
		memset(new_map.vaddr, 0, PAGE_SIZE);
	kunmap_local_iter(new_iter, &new_map);
}
return;
}

for (i = 0; i < new_mem->num_pages; ++i) {
new_iter->ops->kmap_local(new_iter, &new_map, i);
old_iter->ops->kmap_local(old_iter, &old_map, i);
if (!old_map.is_iomem && !new_map.is_iomem) {
	memcpy(new_map.vaddr, old_map.vaddr, PAGE_SIZE);
} else if (!old_map.is_iomem) {
	dma_buf_map_memcpy_to(&new_map, old_map.vaddr,
			      PAGE_SIZE);
} else if (!new_map.is_iomem) {
	memcpy_fromio(new_map.vaddr, old_map.vaddr_iomem,
		      PAGE_SIZE);
} else {
	int j;
	u32 __iomem *src = old_map.vaddr_iomem;
	u32 __iomem *dst = new_map.vaddr_iomem;
	for (j = 0; j < (PAGE_SIZE >> 2); ++j)
		iowrite32(ioread32(src++), dst++);
}
kunmap_local_iter(old_iter, &old_map);
kunmap_local_iter(new_iter, &new_map);
} }
+EXPORT_SYMBOL(ttm_move_memcpy);

-static int ttm_copy_io_ttm_page(struct ttm_tt *ttm, void *src,
		unsigned long page,
		pgprot_t prot)
+/**

ttm_kmap_iter_iomap_init - Initialize a struct ttm_kmap_iter_iomap

@iter_io: The struct ttm_kmap_iter_iomap to initialize.

@iomap: The struct io_mapping representing the underlying linear io_memory.

@st: sg_table into @iomap, representing the memory of the struct

ttm_resource.

@start: Offset that needs to be subtracted from @st to make

sg_dma_address(st->sgl) - @start == 0 for @iomap start.

Return: Pointer to the embedded struct ttm_kmap_iter.

*/

+struct ttm_kmap_iter * +ttm_kmap_iter_iomap_init(struct ttm_kmap_iter_iomap *iter_io,
	 struct io_mapping *iomap,
	 struct sg_table *st,
	 resource_size_t start)
{
struct page *d = ttm->pages[page];

void *dst;

iter_io->base.ops = &ttm_kmap_iter_io_ops;

iter_io->iomap = iomap;

iter_io->st = st;

iter_io->start = start;

memset(&iter_io->cache, 0, sizeof(iter_io->cache));
if (!d)
return -ENOMEM;
return &iter_io->base;

+} +EXPORT_SYMBOL(ttm_kmap_iter_iomap_init);
src = (void *)((unsigned long)src + (page << PAGE_SHIFT));

dst = kmap_atomic_prot(d, prot);

if (!dst)
return -ENOMEM;
+/**

ttm_kmap_iter_tt_init - Initialize a struct ttm_kmap_iter_tt

@iter_tt: The struct ttm_kmap_iter_tt to initialize.

@tt: Struct ttm_tt holding page pointers of the struct ttm_resource.

Return: Pointer to the embedded struct ttm_kmap_iter.

*/

+struct ttm_kmap_iter * +ttm_kmap_iter_tt_init(struct ttm_kmap_iter_tt *iter_tt,
      struct ttm_tt *tt)
+{

iter_tt->base.ops = &ttm_kmap_iter_tt_ops;

iter_tt->tt = tt;

iter_tt->prot = ttm_prot_from_caching(tt->caching, PAGE_KERNEL);

memcpy_fromio(dst, src, PAGE_SIZE);

return &iter_tt->base;

+} +EXPORT_SYMBOL(ttm_kmap_iter_tt_init);

kunmap_atomic(dst);

+/**

DOC: Linear io iterator

This code should die in the not too near future. Best would be if we could

make io-mapping use memremap for all io memory, and have memremap

implement a kmap_local functionality. We could then strip a huge amount of

code. These linear io iterators are implemented to mimic old functionality,

and they don't use kmap_local semantics at all internally. Rather ioremap or

friends, and at least on 32-bit they add global TLB flushes and points

of failure.

*/

return 0;

+/**

struct ttm_kmap_iter_linear_io - Iterator specialization for linear io

@base: The base iterator

@dmap: Points to the starting address of the region

@needs_unmap: Whether we need to unmap on fini

*/

+struct ttm_kmap_iter_linear_io {

struct ttm_kmap_iter base;

struct dma_buf_map dmap;

bool needs_unmap;

+};

+static void ttm_kmap_iter_linear_io_kmap_local(struct ttm_kmap_iter *iter,
			       struct dma_buf_map *dmap,
			       pgoff_t i)
+{
struct ttm_kmap_iter_linear_io *iter_io =
container_of(iter, typeof(*iter_io), base);
*dmap = iter_io->dmap;

dma_buf_map_incr(dmap, i * PAGE_SIZE); }
-static int ttm_copy_ttm_io_page(struct ttm_tt *ttm, void *dst,
		unsigned long page,
		pgprot_t prot)
+static const struct ttm_kmap_iter_ops ttm_kmap_iter_linear_io_ops = {

.kmap_local = ttm_kmap_iter_linear_io_kmap_local

+};

+static struct ttm_kmap_iter * +ttm_kmap_iter_linear_io_init(struct ttm_kmap_iter_linear_io *iter_io,
	     struct ttm_device *bdev,
	     struct ttm_resource *mem)
{
struct page *s = ttm->pages[page];

void *src;

int ret;
if (!s)
return -ENOMEM;
ret = ttm_mem_io_reserve(bdev, mem);

if (ret)
goto out_err;
if (!mem->bus.is_iomem) {
ret = -EINVAL;
goto out_io_free;
}
dst = (void *)((unsigned long)dst + (page << PAGE_SHIFT));

src = kmap_atomic_prot(s, prot);

if (!src)
return -ENOMEM;
if (mem->bus.addr) {
dma_buf_map_set_vaddr(&iter_io->dmap, mem->bus.addr);
iter_io->needs_unmap = false;
} else {
size_t bus_size = (size_t)mem->num_pages << PAGE_SHIFT;
memcpy_toio(dst, src, PAGE_SIZE);
iter_io->needs_unmap = true;
if (mem->bus.caching == ttm_write_combined)
	dma_buf_map_set_vaddr_iomem(&iter_io->dmap,
				    ioremap_wc(mem->bus.offset,
					       bus_size));
else if (mem->bus.caching == ttm_cached)
	dma_buf_map_set_vaddr(&iter_io->dmap,
			      memremap(mem->bus.offset, bus_size,
				       MEMREMAP_WB));
else
	dma_buf_map_set_vaddr_iomem(&iter_io->dmap,
				    ioremap(mem->bus.offset,
					    bus_size));
if (dma_buf_map_is_null(&iter_io->dmap)) {
	ret = -ENOMEM;
	goto out_io_free;
}
}
kunmap_atomic(src);

iter_io->base.ops = &ttm_kmap_iter_linear_io_ops;

return &iter_io->base;

return 0;

+out_io_free:

ttm_mem_io_free(bdev, mem);

+out_err:

return ERR_PTR(ret);

+}

+static void +ttm_kmap_iter_linear_io_fini(struct ttm_kmap_iter_linear_io *iter_io,
	     struct ttm_device *bdev,
	     struct ttm_resource *mem)
+{
if (iter_io->needs_unmap && dma_buf_map_is_set(&iter_io->dmap)) {
if (iter_io->dmap.is_iomem)
	iounmap(iter_io->dmap.vaddr_iomem);
else
	memunmap(iter_io->dmap.vaddr);
}

ttm_mem_io_free(bdev, mem); }
+static int ttm_bo_wait_free_node(struct ttm_buffer_object *bo,
		 bool dst_use_tt);
int ttm_bo_move_memcpy(struct ttm_buffer_object *bo, struct ttm_operation_ctx *ctx, struct ttm_resource *new_mem) { struct ttm_device *bdev = bo->bdev; struct ttm_resource_manager *man = ttm_manager_type(bdev, new_mem->mem_type);

struct ttm_resource_manager *new_man =
ttm_manager_type(bo->bdev, new_mem->mem_type);
struct ttm_tt *ttm = bo->ttm; struct ttm_resource *old_mem = &bo->mem; struct ttm_resource old_copy = *old_mem;
void *old_iomap;

void *new_iomap;
union {
struct ttm_kmap_iter_tt tt;
struct ttm_kmap_iter_linear_io io;
} _new_iter, _old_iter;

struct ttm_kmap_iter *new_iter, *old_iter; int ret;
unsigned long i;

ret = ttm_bo_wait_ctx(bo, ctx);

if (ret)
return ret;
ret = ttm_resource_ioremap(bdev, old_mem, &old_iomap);

if (ret)
return ret;
ret = ttm_resource_ioremap(bdev, new_mem, &new_iomap);

if (ret)
goto out;
/*
* Single TTM move. NOP.
*/
if (old_iomap == NULL && new_iomap == NULL)
goto out2;
/*
* Don't move nonexistent data. Clear destination instead.
*/
if (old_iomap == NULL &&
   (ttm == NULL || (!ttm_tt_is_populated(ttm) &&
	     !(ttm->page_flags & TTM_PAGE_FLAG_SWAPPED)))) {
memset_io(new_iomap, 0, new_mem->num_pages*PAGE_SIZE);
goto out2;
}

/*
* TTM might be null for moves within the same region.
*/
if (ttm) { ret = ttm_tt_populate(bdev, ttm, ctx); if (ret)
	goto out1;
	return ret;
}
for (i = 0; i < new_mem->num_pages; ++i) {
if (old_iomap == NULL) {
	pgprot_t prot = ttm_io_prot(bo, old_mem, PAGE_KERNEL);
	ret = ttm_copy_ttm_io_page(ttm, new_iomap, i,
				   prot);
} else if (new_iomap == NULL) {
	pgprot_t prot = ttm_io_prot(bo, new_mem, PAGE_KERNEL);
	ret = ttm_copy_io_ttm_page(ttm, old_iomap, i,
				   prot);
} else {
	ret = ttm_copy_io_page(new_iomap, old_iomap, i);
}
if (ret)
	goto out1;
new_iter = new_man->use_tt ?
ttm_kmap_iter_tt_init(&_new_iter.tt, bo->ttm) :
ttm_kmap_iter_linear_io_init(&_new_iter.io, bdev, new_mem);
if (IS_ERR(new_iter))
return PTR_ERR(new_iter);
old_iter = man->use_tt ?
ttm_kmap_iter_tt_init(&_old_iter.tt, bo->ttm) :
ttm_kmap_iter_linear_io_init(&_old_iter.io, bdev, old_mem);
if (IS_ERR(old_iter)) {
ret = PTR_ERR(old_iter);
goto out_old_iter;
}
mb();

-out2:

old_copy = *old_mem;

ttm_bo_assign_mem(bo, new_mem);

ttm_move_memcpy(bo, new_mem, new_iter, old_iter);

old_copy = *old_mem;

ret = ttm_bo_wait_free_node(bo, new_man->use_tt);

if (!man->use_tt)
ttm_bo_tt_destroy(bo);
ttm_kmap_iter_linear_io_fini(&_old_iter.io, bdev, &old_copy);
+out_old_iter:
if (!new_man->use_tt)
ttm_kmap_iter_linear_io_fini(&_new_iter.io, bdev, new_mem);
-out1:

ttm_resource_iounmap(bdev, old_mem, new_iomap);

-out:
ttm_resource_iounmap(bdev, &old_copy, old_iomap);

/*
* On error, keep the mm node!
*/
if (!ret)
ttm_resource_free(bo, &old_copy);
return ret; } EXPORT_SYMBOL(ttm_bo_move_memcpy);
@@ -336,27 +514,7 @@ pgprot_t ttm_io_prot(struct ttm_buffer_object *bo, struct ttm_resource *res, man = ttm_manager_type(bo->bdev, res->mem_type); caching = man->use_tt ? bo->ttm->caching : res->bus.caching;
/* Cached mappings need no adjustment */

if (caching == ttm_cached)
return tmp;
-#if defined(__i386__) || defined(__x86_64__)
if (caching == ttm_write_combined)
tmp = pgprot_writecombine(tmp);
else if (boot_cpu_data.x86 > 3)
tmp = pgprot_noncached(tmp);
-#endif -#if defined(__ia64__) || defined(__arm__) || defined(__aarch64__) || \
defined(__powerpc__) || defined(__mips__)

if (caching == ttm_write_combined)
tmp = pgprot_writecombine(tmp);
else
tmp = pgprot_noncached(tmp);
-#endif -#if defined(__sparc__)

tmp = pgprot_noncached(tmp);

-#endif

return tmp;

return ttm_prot_from_caching(caching, tmp); } EXPORT_SYMBOL(ttm_io_prot);

diff --git a/include/drm/ttm/ttm_bo_driver.h b/include/drm/ttm/ttm_bo_driver.h index dbccac957f8f..30f1dce987b2 100644 --- a/include/drm/ttm/ttm_bo_driver.h +++ b/include/drm/ttm/ttm_bo_driver.h @@ -332,4 +332,98 @@ int ttm_range_man_init(struct ttm_device *bdev, int ttm_range_man_fini(struct ttm_device *bdev, unsigned type);

+struct dma_buf_map; +struct io_mapping; +struct sg_table; +struct scatterlist; +struct ttm_kmap_iter;

+/**

struct ttm_kmap_iter_ops - Ops structure for a struct

ttm_kmap_iter.

@needs_unmap - Whether a kunmap_local is needed to balance @kmap_local.

*/

+struct ttm_kmap_iter_ops {
/**
* kmap_local - Map a PAGE_SIZE part of the resource using
* kmap_local semantics.
* @res_kmap: Pointer to the struct ttm_kmap_iter representing
* the resource.
* @dmap: The struct dma_buf_map holding the virtual address after
* the operation.
* @i: The location within the resource to map. PAGE_SIZE granularity.
*/
void (*kmap_local)(struct ttm_kmap_iter *res_kmap,
	   struct dma_buf_map *dmap, pgoff_t i);
bool needs_unmap;
+};

+/**

struct ttm_kmap_iter - Iterator for kmap_local type operations on a

resource.

@ops: Pointer to the operations struct.

This struct is intended to be embedded in a resource-specific specialization

implementing operations for the resource.

Nothing stops us from extending the operations to vmap, vmap_pfn etc,

replacing some or parts of the ttm_bo_util. cpu-map functionality.

*/

+struct ttm_kmap_iter {

const struct ttm_kmap_iter_ops *ops;

+};

+/**

struct ttm_kmap_iter_tt - Specialization for a tt (page) backed struct

ttm_resource.

@base: Embedded struct ttm_kmap_iter providing the usage interface

@tt: Cached struct ttm_tt.

*/

+struct ttm_kmap_iter_tt {

struct ttm_kmap_iter base;

struct ttm_tt *tt;

pgprot_t prot;

+};

+/**

struct ttm_kmap_iter_iomap - Specialization for a struct io_mapping +

struct sg_table backed struct ttm_resource.

@base: Embedded struct ttm_kmap_iter providing the usage interface.

@iomap: struct io_mapping representing the underlying linear io_memory.

@st: sg_table into @iomap, representing the memory of the struct ttm_resource.

@start: Offset that needs to be subtracted from @st to make

sg_dma_address(st->sgl) - @start == 0 for @iomap start.

@cache: Scatterlist traversal cache for fast lookups.

@cache.sg: Pointer to the currently cached scatterlist segment.

@cache.i: First index of @sg. PAGE_SIZE granularity.

@cache.end: Last index + 1 of @sg. PAGE_SIZE granularity.

@cache.offs: First offset into @iomap of @sg. PAGE_SIZE granularity.

*/

+struct ttm_kmap_iter_iomap {
struct ttm_kmap_iter base;

struct io_mapping *iomap;

struct sg_table *st;

resource_size_t start;

struct {
struct scatterlist *sg;
pgoff_t i;
pgoff_t end;
pgoff_t offs;
} cache;
+};

+void ttm_move_memcpy(struct ttm_buffer_object *bo,
     struct ttm_resource *new_mem,
     struct ttm_kmap_iter *new_iter,
     struct ttm_kmap_iter *old_iter);
+struct ttm_kmap_iter * +ttm_kmap_iter_tt_init(struct ttm_kmap_iter_tt *iter_tt,
      struct ttm_tt *tt);
+struct ttm_kmap_iter * +ttm_kmap_iter_iomap_init(struct ttm_kmap_iter_iomap *iter_io,
	 struct io_mapping *iomap,
	 struct sg_table *st,
	 resource_size_t start);
#endif

kernel test robot

27 May 27 May

5:07 a.m.

New subject: [drm/ttm] 0c13ca6d7f: WARNING:at_drivers/gpu/drm/drm_fb_helper.c:#drm_fb_helper_damage_work

Greeting,

FYI, we noticed the following commit (built with gcc-9):

commit: 0c13ca6d7fbaaf4cc0cfccd94f0ab8ca9af9e81a ("[Intel-gfx] [RFC PATCH 1/5] drm/ttm: Add a generic TTM memcpy move for page-based iomem") url: https://github.com/0day-ci/linux/commits/Thomas-Hellstr-m/Core-TTM-changes-f...

in testcase: trinity version: trinity-i386-4d2343bd-1_20200320 with following parameters:

number: 99999 group: group-03

test-description: Trinity is a linux system call fuzz tester. test-url: http://codemonkey.org.uk/projects/trinity/

on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):

+-----------------------------------------------------------------------+---------------+------------+ | | next-20210521 | 0c13ca6d7f | +-----------------------------------------------------------------------+---------------+------------+ | WARNING:at_drivers/gpu/drm/drm_fb_helper.c:#drm_fb_helper_damage_work | 0 | 8 | | RIP:drm_fb_helper_damage_work | 0 | 8 | +-----------------------------------------------------------------------+---------------+------------+

If you fix the issue, kindly add following tag Reported-by: kernel test robot oliver.sang@intel.com

[ 23.128870] WARNING: CPU: 1 PID: 20 at drivers/gpu/drm/drm_fb_helper.c:451 drm_fb_helper_damage_work (kbuild/src/consumer/drivers/gpu/drm/drm_fb_helper.c:451) [ 23.128883] Modules linked in: intel_rapl_common crct10dif_pclmul ata_piix bochs_drm(+) crc32_pclmul crc32c_intel libata ghash_clmulni_intel rapl drm_vram_helper drm_ttm_helper joydev ttm i2c_piix4 serio_raw parport_pc parport ip_tables [ 23.128925] CPU: 1 PID: 20 Comm: kworker/1:0 Not tainted 5.13.0-rc2-next-20210521-00001-g0c13ca6d7fba #1 [ 23.128931] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 [ 23.128934] Workqueue: events drm_fb_helper_damage_work [ 23.128941] RIP: 0010:drm_fb_helper_damage_work (kbuild/src/consumer/drivers/gpu/drm/drm_fb_helper.c:451) [ 23.128946] Code: 48 8b 78 08 4c 8b 6f 50 4d 85 ed 75 03 4c 8b 2f e8 70 6e 04 00 44 89 e1 4c 89 ea 48 c7 c7 48 f8 c5 8c 48 89 c6 e8 cb 63 60 00 <0f> 0b e9 02 fe ff ff e8 8f a5 66 00 66 66 2e 0f 1f 84 00 00 00 00 All code ======== 0: 48 8b 78 08 mov 0x8(%rax),%rdi 4: 4c 8b 6f 50 mov 0x50(%rdi),%r13 8: 4d 85 ed test %r13,%r13 b: 75 03 jne 0x10 d: 4c 8b 2f mov (%rdi),%r13 10: e8 70 6e 04 00 callq 0x46e85 15: 44 89 e1 mov %r12d,%ecx 18: 4c 89 ea mov %r13,%rdx 1b: 48 c7 c7 48 f8 c5 8c mov $0xffffffff8cc5f848,%rdi 22: 48 89 c6 mov %rax,%rsi 25: e8 cb 63 60 00 callq 0x6063f5 2a:* 0f 0b ud2 <-- trapping instruction 2c: e9 02 fe ff ff jmpq 0xfffffffffffffe33 31: e8 8f a5 66 00 callq 0x66a5c5 36: 66 data16 37: 66 data16 38: 2e cs 39: 0f .byte 0xf 3a: 1f (bad) 3b: 84 00 test %al,(%rax) 3d: 00 00 add %al,(%rax) ...

Code starting with the faulting instruction =========================================== 0: 0f 0b ud2 2: e9 02 fe ff ff jmpq 0xfffffffffffffe09 7: e8 8f a5 66 00 callq 0x66a59b c: 66 data16 d: 66 data16 e: 2e cs f: 0f .byte 0xf 10: 1f (bad) 11: 84 00 test %al,(%rax) 13: 00 00 add %al,(%rax) ... [ 23.128950] RSP: 0018:ffffb685800b3dd0 EFLAGS: 00010286 [ 23.128956] RAX: 0000000000000000 RBX: ffff98e43286bb40 RCX: 0000000000000000 [ 23.128959] RDX: 0000000000000001 RSI: ffffffff8b6349cf RDI: ffffffff8b6349cf [ 23.128962] RBP: ffff98e46c59a940 R08: 0000000000000001 R09: 0000000000000001 [ 23.128965] R10: ffff98e3c0368000 R11: 0000000000000000 R12: 00000000ffffffea [ 23.128968] R13: ffff98e5057936b0 R14: 0000000000000000 R15: ffff98e46c59a980 [ 23.128972] FS: 0000000000000000(0000) GS:ffff98e6efd00000(0000) knlGS:0000000000000000 [ 23.128976] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 23.128979] CR2: 0000000057d64c80 CR3: 00000002444e4000 CR4: 00000000000406e0 [ 23.128986] Call Trace: [ 23.128990] ? lock_acquire (kbuild/src/consumer/kernel/locking/lockdep.c:438 kbuild/src/consumer/kernel/locking/lockdep.c:5514 kbuild/src/consumer/kernel/locking/lockdep.c:5477) [ 23.129014] process_one_work (kbuild/src/consumer/arch/x86/include/asm/jump_label.h:27 kbuild/src/consumer/include/linux/jump_label.h:212 kbuild/src/consumer/include/trace/events/workqueue.h:108 kbuild/src/consumer/kernel/workqueue.c:2281) [ 23.129035] worker_thread (kbuild/src/consumer/include/linux/list.h:282 kbuild/src/consumer/kernel/workqueue.c:2423) [ 23.129043] ? process_one_work (kbuild/src/consumer/kernel/workqueue.c:2365) [ 23.129053] kthread (kbuild/src/consumer/kernel/kthread.c:319) [ 23.129057] ? set_kthread_struct (kbuild/src/consumer/kernel/kthread.c:272) [ 23.129067] ret_from_fork (kbuild/src/consumer/arch/x86/entry/entry_64.S:300) [ 23.129097] irq event stamp: 4449 [ 23.129099] hardirqs last enabled at (4455): vprintk_emit (kbuild/src/consumer/arch/x86/include/asm/irqflags.h:45 kbuild/src/consumer/arch/x86/include/asm/irqflags.h:80 kbuild/src/consumer/arch/x86/include/asm/irqflags.h:140 kbuild/src/consumer/kernel/printk/printk.c:1877 kbuild/src/consumer/kernel/printk/printk.c:2174) [ 23.129104] hardirqs last disabled at (4460): vprintk_emit (kbuild/src/consumer/kernel/printk/printk.c:1856 kbuild/src/consumer/kernel/printk/printk.c:2174) [ 23.129108] softirqs last enabled at (4080): __do_softirq (kbuild/src/consumer/arch/x86/include/asm/preempt.h:27 kbuild/src/consumer/kernel/softirq.c:403 kbuild/src/consumer/kernel/softirq.c:588) [ 23.129115] softirqs last disabled at (4069): do_softirq (kbuild/src/consumer/kernel/softirq.c:460 kbuild/src/consumer/kernel/softirq.c:447) [ 23.129121] ---[ end trace 8fffdf7b74be67dd ]--- [ 23.169660] Console: switching to colour frame buffer device 128x48 [ 23.204582] ata2.01: NODEV after polling detection [ 23.205027] ata2.00: ATAPI: QEMU DVD-ROM, 2.5+, max UDMA/100 [ 23.209001] scsi 1:0:0:0: CD-ROM QEMU QEMU DVD-ROM 2.5+ PQ: 0 ANSI: 5 [ 23.454548] bochs-drm 0000:00:02.0: [drm] fb0: bochs-drmdrmfb frame buffer device [ 23.558522] ppdev: user-space parallel port driver [ 23.639513] scsi 1:0:0:0: Attached scsi generic sg0 type 5 [ 23.711615] sr 1:0:0:0: [sr0] scsi3-mmc drive: 4x/4x cd/rw xa/form2 tray [ 23.721907] cdrom: Uniform CD-ROM driver Revision: 3.20 [ 23.760668] sr 1:0:0:0: Attached scsi CD-ROM sr0 [ 32.345935] Kernel tests: Boot OK! [ 32.345967] [ 33.553631] sctp: Hash tables configured (bind 32/56) [ 33.640389] NET: Registered protocol family 8 [ 33.642951] NET: Registered protocol family 20 [ 33.701683] Loading iSCSI transport class v2.0-870. [ 33.907649] can: controller area network core [ 33.910036] NET: Registered protocol family 29 [ 33.927145] can: raw protocol [ 33.945312] can: broadcast manager protocol [ OK ] Listening on Load/Save RF Kill Switch Status /dev/rfkill Watch. [ 34.146981] Bluetooth: Core ver 2.22 [ 34.149139] NET: Registered protocol family 31 [ 34.151387] Bluetooth: HCI device and connection manager initialized [ 34.154110] Bluetooth: HCI socket layer initialized [ 34.157426] Bluetooth: L2CAP socket layer initialized [ 34.160803] Bluetooth: SCO socket layer initialized [ 34.197826] Bluetooth: RFCOMM TTY layer initialized [ 34.200118] Bluetooth: RFCOMM socket layer initialized [ 34.204803] Bluetooth: RFCOMM ver 1.11 [ 34.229098] Bluetooth: BNEP (Ethernet Emulation) ver 1.3 [ 34.231559] Bluetooth: BNEP filters: protocol multicast [ 34.236340] Bluetooth: BNEP socket layer initialized [ 34.262539] Bluetooth: HIDP (Human Interface Emulation) ver 1.2 [ 34.265008] Bluetooth: HIDP socket layer initialized [ 34.417337] VFS: Warning: trinity-c4 using old stat() call. Recompile your binary. [ 34.423816] VFS: Warning: trinity-c4 using old stat() call. Recompile your binary. [ 34.491288] uffd: Set unprivileged_userfaultfd sysctl knob to 1 if kernel faults must be handled without obtaining CAP_SYS_PTRACE capability [ 34.495709] audit: type=1326 audit(1622057165.671:2): auid=4294967295 uid=65534 gid=65534 ses=4294967295 pid=1418 comm="trinity-c0" exe="/bin/trinity" sig=9 arch=40000003 syscall=19 compat=1 ip=0xf7f9a549 code=0x0 [ 34.510047] mmap: trinity-c7 (1425) uses deprecated remap_file_pages() syscall. See Documentation/vm/remap_file_pages.rst. [ 34.516796] VFS: Warning: trinity-c7 using old stat() call. Recompile your binary. [ 34.516841] ptrace attach of "trinity -q -q -l off -s 364045467 -N 99999"[1424] was attempted by "trinity -q -q -l off -s 364045467 -N 99999"[1426] [ 34.544419] VFS: Warning: trinity-c7 using old stat() call. Recompile your binary. [ 34.554532] audit: type=1326 audit(1622057165.731:3): auid=4294967295 uid=65534 gid=65534 ses=4294967295 pid=1426 comm="trinity-c0" exe="/bin/trinity" sig=9 arch=40000003 syscall=19 compat=1 ip=0xf7f9a549 code=0x0 [ 34.594140] VFS: Warning: trinity-c0 using old stat() call. Recompile your binary. [ 34.661794] NET: Registered protocol family 36 [ 34.687566] ptrace attach of "trinity -q -q -l off -s 364045467 -N 99999"[1419] was attempted by "trinity -q -q -l off -s 364045467 -N 99999"[1428] [ 35.512829] audit: type=1326 audit(1622057166.690:4): auid=4294967295 uid=65534 gid=65534 ses=4294967295 pid=1422 comm="trinity-c4" exe="/bin/trinity" sig=9 arch=40000003 syscall=19 compat=1 ip=0xf7f9a549 code=0x0 [ 35.553632] audit: type=1326 audit(1622057166.725:5): auid=4294967295 uid=65534 gid=65534 ses=4294967295 pid=1420 comm="trinity-c2" exe="/bin/trinity" sig=9 arch=40000003 syscall=19 compat=1 ip=0xf7f9a549 code=0x0 [ 35.937104] random: crng init done [ 35.939376] random: 7 urandom warning(s) missed due to ratelimiting [ 35.997121] scsi_nl_rcv_msg: discarding partial skb [ 36.025621] trinity-c5 (1423): attempted to duplicate a private mapping with mremap. This is not supported. [ 36.116632] Guest personality initialized and is inactive [ 36.124783] VMCI host device registered (name=vmci, major=10, minor=125) [ 36.128992] Initialized host personality [ 36.158378] NET: Registered protocol family 40 [ 36.331855] NET: Registered protocol family 15 [ 36.509084] audit: type=1326 audit(1622057167.686:6): auid=4294967295 uid=65534 gid=65534 ses=4294967295 pid=1423 comm="trinity-c5" exe="/bin/trinity" sig=9 arch=40000003 syscall=19 compat=1 ip=0xf7f9a549 code=0x0 [ 36.536011] audit: type=1326 audit(1622057167.713:7): auid=4294967295 uid=65534 gid=65534 ses=4294967295 pid=1482 comm="trinity-c5" exe="/bin/trinity" sig=9 arch=40000003 syscall=19 compat=1 ip=0xf7f9a549 code=0x0 [ 36.776774] audit: type=1326 audit(1622057167.954:8): auid=4294967295 uid=65534 gid=65534 ses=4294967295 pid=1483 comm="trinity-c5" exe="/bin/trinity" sig=9 arch=40000003 syscall=19 compat=1 ip=0xf7f9a549 code=0x0 [ 37.204551] audit: type=1326 audit(1622057168.382:9): auid=4294967295 uid=65534 gid=65534 ses=4294967295 pid=1447 comm="trinity-c2" exe="/bin/trinity" sig=9 arch=40000003 syscall=19 compat=1 ip=0xf7f9a549 code=0x0 [ 37.870830] audit: type=1326 audit(1622057169.048:10): auid=4294967295 uid=65534 gid=65534 ses=4294967295 pid=1495 comm="trinity-c5" exe="/bin/trinity" sig=9 arch=40000003 syscall=19 compat=1 ip=0xf7f9a549 code=0x0 [ 37.933295] audit: type=1326 audit(1622057169.110:11): auid=4294967295 uid=65534 gid=65534 ses=4294967295 pid=1516 comm="trinity-c5" exe="/bin/trinity" sig=9 arch=40000003 syscall=19 compat=1 ip=0xf7f9a549 code=0x0 [ 39.032389] install debs round one: dpkg -i --force-confdef --force-depends /opt/deb/gawk_1%3a4.1.4+dfsg-1_i386.deb [ 39.032434] [ 39.046183] Selecting previously unselected package gawk. [ 39.046203] [ 39.056645] (Reading database ... 16210 files and directories currently installed.) [ 39.056658] [ 39.067990] Preparing to unpack .../gawk_1%3a4.1.4+dfsg-1_i386.deb ... [ 39.068002] [ 39.077038] Unpacking gawk (1:4.1.4+dfsg-1) ... [ 39.077050] [ 39.085224] Setting up gawk (1:4.1.4+dfsg-1) ... [ 39.085237] [ 39.092668] /lkp/lkp/src/bin/run-lkp [ 39.092678] [ 39.650207] kauditd_printk_skb: 1 callbacks suppressed [ 39.650213] audit: type=1326 audit(1622057170.827:13): auid=4294967295 uid=65534 gid=65534 ses=4294967295 pid=1502 comm="trinity-c2" exe="/bin/trinity" sig=9 arch=40000003 syscall=19 compat=1 ip=0xf7f9a549 code=0x0 [ 41.150540] audit: type=1326 audit(1622057172.327:14): auid=4294967295 uid=65534 gid=65534 ses=4294967295 pid=1424 comm="trinity-c6" exe="/bin/trinity" sig=9 arch=40000003 syscall=19 compat=1 ip=0xf7f9a549 code=0x0 [ 41.262543] audit: type=1326 audit(1622057172.439:15): auid=4294967295 uid=65534 gid=65534 ses=4294967295 pid=1545 comm="trinity-c6" exe="/bin/trinity" sig=9 arch=40000003 syscall=19 compat=1 ip=0xf7f9a549 code=0x0 [ 41.925799] RESULT_ROOT=/result/trinity/group-03-99999/vm-snb/debian-i386-20191205.cgz/x86_64-rhel-8.3-kselftests/gcc-9/0c13ca6d7fbaaf4cc0cfccd94f0ab8ca9af9e81a/3 [ 41.925827]

To reproduce:

# build kernel cd linux cp config-5.13.0-rc2-next-20210521-00001-g0c13ca6d7fba .config make HOSTCC=gcc-9 CC=gcc-9 ARCH=x86_64 olddefconfig prepare modules_prepare bzImage

git clone https://github.com/intel/lkp-tests.git cd lkp-tests bin/lkp qemu -k <bzImage> job-script # job-script is attached in this email

--- 0DAY/LKP+ Test Infrastructure Open Source Technology Center https://lists.01.org/hyperkitty/list/lkp@lists.01.org Intel Corporation

Thanks, Oliver Sang

Thomas Hellström

20 May 20 May

3:09 p.m.

New subject: [RFC PATCH 2/5] drm, drm/i915: Move the memcpy_from_wc functionality to core drm

Memcpy from wc will be used as well by TTM memcpy. Move it to core drm, and make the interface do the right thing even on !X86.

Cc: Christian König christian.koenig@amd.com Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: Dave Airlie airlied@gmail.com Signed-off-by: Thomas Hellström thomas.hellstrom@linux.intel.com --- drivers/gpu/drm/Makefile | 2 +- drivers/gpu/drm/drm_drv.c | 2 + .../drm/{i915/i915_memcpy.c => drm_memcpy.c} | 31 +++++++------- drivers/gpu/drm/i915/Makefile | 1 - .../gpu/drm/i915/gem/i915_gem_execbuffer.c | 4 +- drivers/gpu/drm/i915/gem/i915_gem_object.c | 5 ++- drivers/gpu/drm/i915/gt/selftest_reset.c | 7 ++-- drivers/gpu/drm/i915/gt/uc/intel_guc_log.c | 11 ++--- drivers/gpu/drm/i915/i915_cmd_parser.c | 4 +- drivers/gpu/drm/i915/i915_drv.c | 2 - drivers/gpu/drm/i915/i915_gpu_error.c | 8 ++-- drivers/gpu/drm/i915/i915_memcpy.h | 34 --------------- .../drm/i915/selftests/intel_memory_region.c | 7 ++-- include/drm/drm_memcpy.h | 41 +++++++++++++++++++ 14 files changed, 83 insertions(+), 76 deletions(-) rename drivers/gpu/drm/{i915/i915_memcpy.c => drm_memcpy.c} (84%) delete mode 100644 drivers/gpu/drm/i915/i915_memcpy.h create mode 100644 include/drm/drm_memcpy.h

diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile index a91cc7684904..f3ab8586c3d7 100644 --- a/drivers/gpu/drm/Makefile +++ b/drivers/gpu/drm/Makefile @@ -18,7 +18,7 @@ drm-y := drm_aperture.o drm_auth.o drm_cache.o \ drm_dumb_buffers.o drm_mode_config.o drm_vblank.o \ drm_syncobj.o drm_lease.o drm_writeback.o drm_client.o \ drm_client_modeset.o drm_atomic_uapi.o drm_hdcp.o \ - drm_managed.o drm_vblank_work.o + drm_managed.o drm_vblank_work.o drm_memcpy.o \

drm-$(CONFIG_DRM_LEGACY) += drm_agpsupport.o drm_bufs.o drm_context.o drm_dma.o \ drm_legacy_misc.o drm_lock.o drm_memory.o drm_scatter.o \ diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c index 3d8d68a98b95..351cc2900cf1 100644 --- a/drivers/gpu/drm/drm_drv.c +++ b/drivers/gpu/drm/drm_drv.c @@ -40,6 +40,7 @@ #include <drm/drm_drv.h> #include <drm/drm_file.h> #include <drm/drm_managed.h> +#include <drm/drm_memcpy.h> #include <drm/drm_mode_object.h> #include <drm/drm_print.h>

@@ -1041,6 +1042,7 @@ static int __init drm_core_init(void)

drm_connector_ida_init(); idr_init(&drm_minors_idr); + drm_memcpy_init_early();

ret = drm_sysfs_init(); if (ret < 0) { diff --git a/drivers/gpu/drm/i915/i915_memcpy.c b/drivers/gpu/drm/drm_memcpy.c similarity index 84% rename from drivers/gpu/drm/i915/i915_memcpy.c rename to drivers/gpu/drm/drm_memcpy.c index 1b021a4902de..03688425a096 100644 --- a/drivers/gpu/drm/i915/i915_memcpy.c +++ b/drivers/gpu/drm/drm_memcpy.c @@ -1,3 +1,4 @@ +// SPDX-License-Identifier: MIT /* * Copyright © 2016 Intel Corporation * @@ -22,16 +23,11 @@ * */

+#ifdef CONFIG_X86 #include <linux/kernel.h> #include <asm/fpu/api.h>

-#include "i915_memcpy.h" - -#if IS_ENABLED(CONFIG_DRM_I915_DEBUG) -#define CI_BUG_ON(expr) BUG_ON(expr) -#else -#define CI_BUG_ON(expr) BUILD_BUG_ON_INVALID(expr) -#endif +#include "drm/drm_memcpy.h"

static DEFINE_STATIC_KEY_FALSE(has_movntdqa);

@@ -94,23 +90,23 @@ static void __memcpy_ntdqu(void *dst, const void *src, unsigned long len) }

/** - * i915_memcpy_from_wc: perform an accelerated *aligned* read from WC + * drm_memcpy_from_wc: perform an accelerated *aligned* read from WC * @dst: destination pointer * @src: source pointer * @len: how many bytes to copy * - * i915_memcpy_from_wc copies @len bytes from @src to @dst using + * drm_memcpy_from_wc copies @len bytes from @src to @dst using * non-temporal instructions where available. Note that all arguments * (@src, @dst) must be aligned to 16 bytes and @len must be a multiple * of 16. * * To test whether accelerated reads from WC are supported, use - * i915_memcpy_from_wc(NULL, NULL, 0); + * drm_memcpy_from_wc(NULL, NULL, 0); * * Returns true if the copy was successful, false if the preconditions * are not met. */ -bool i915_memcpy_from_wc(void *dst, const void *src, unsigned long len) +bool drm_memcpy_from_wc(void *dst, const void *src, unsigned long len) { if (unlikely(((unsigned long)dst | (unsigned long)src | len) & 15)) return false; @@ -123,24 +119,23 @@ bool i915_memcpy_from_wc(void *dst, const void *src, unsigned long len)

return false; } +EXPORT_SYMBOL(drm_memcpy_from_wc);

/** - * i915_unaligned_memcpy_from_wc: perform a mostly accelerated read from WC + * drm_unaligned_memcpy_from_wc: perform a mostly accelerated read from WC * @dst: destination pointer * @src: source pointer * @len: how many bytes to copy * - * Like i915_memcpy_from_wc(), the unaligned variant copies @len bytes from + * Like drm_memcpy_from_wc(), the unaligned variant copies @len bytes from * @src to @dst using * non-temporal instructions where available, but * accepts that its arguments may not be aligned, but are valid for the * potential 16-byte read past the end. */ -void i915_unaligned_memcpy_from_wc(void *dst, const void *src, unsigned long len) +void drm_unaligned_memcpy_from_wc(void *dst, const void *src, unsigned long len) { unsigned long addr;

- CI_BUG_ON(!i915_has_memcpy_from_wc()); - addr = (unsigned long)src; if (!IS_ALIGNED(addr, 16)) { unsigned long x = min(ALIGN(addr, 16) - addr, len); @@ -155,8 +150,9 @@ void i915_unaligned_memcpy_from_wc(void *dst, const void *src, unsigned long len if (likely(len)) __memcpy_ntdqu(dst, src, DIV_ROUND_UP(len, 16)); } +EXPORT_SYMBOL(drm_unaligned_memcpy_from_wc);

-void i915_memcpy_init_early(struct drm_i915_private *dev_priv) +void drm_memcpy_init_early(void) { /* * Some hypervisors (e.g. KVM) don't support VEX-prefix instructions @@ -166,3 +162,4 @@ void i915_memcpy_init_early(struct drm_i915_private *dev_priv) !boot_cpu_has(X86_FEATURE_HYPERVISOR)) static_branch_enable(&has_movntdqa); } +#endif diff --git a/drivers/gpu/drm/i915/Makefile b/drivers/gpu/drm/i915/Makefile index cb8823570996..998606b7f49f 100644 --- a/drivers/gpu/drm/i915/Makefile +++ b/drivers/gpu/drm/i915/Makefile @@ -61,7 +61,6 @@ i915-y += i915_drv.o \ # core library code i915-y += \ dma_resv_utils.o \ - i915_memcpy.o \ i915_mm.o \ i915_sw_fence.o \ i915_sw_fence_work.o \ diff --git a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c index 297143511f99..77285e421fb8 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c @@ -10,6 +10,7 @@ #include <linux/uaccess.h>

#include <drm/drm_syncobj.h> +#include <drm/drm_memcpy.h>

#include "display/intel_frontbuffer.h"

@@ -28,7 +29,6 @@ #include "i915_sw_fence_work.h" #include "i915_trace.h" #include "i915_user_extensions.h" -#include "i915_memcpy.h"

struct eb_vma { struct i915_vma *vma; @@ -2503,7 +2503,7 @@ static int eb_parse_pipeline(struct i915_execbuffer *eb, !(batch->cache_coherent & I915_BO_CACHE_COHERENT_FOR_READ);

pw->batch_map = ERR_PTR(-ENODEV); - if (needs_clflush && i915_has_memcpy_from_wc()) + if (needs_clflush && drm_has_memcpy_from_wc()) pw->batch_map = i915_gem_object_pin_map(batch, I915_MAP_WC);

if (IS_ERR(pw->batch_map)) { diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.c b/drivers/gpu/drm/i915/gem/i915_gem_object.c index c8953e3f5c70..f1eb857fa172 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_object.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_object.c @@ -24,6 +24,8 @@

#include <linux/sched/mm.h>

+#include <drm/drm_memcpy.h> + #include "display/intel_frontbuffer.h" #include "i915_drv.h" #include "i915_gem_clflush.h" @@ -31,7 +33,6 @@ #include "i915_gem_mman.h" #include "i915_gem_object.h" #include "i915_globals.h" -#include "i915_memcpy.h" #include "i915_trace.h"

static struct i915_global_object { @@ -374,7 +375,7 @@ i915_gem_object_read_from_page_iomap(struct drm_i915_gem_object *obj, u64 offset PAGE_SIZE);

src_ptr = src_map + offset_in_page(offset); - if (!i915_memcpy_from_wc(dst, (void __force *)src_ptr, size)) + if (!drm_memcpy_from_wc(dst, (void __force *)src_ptr, size)) memcpy_fromio(dst, src_ptr, size);

io_mapping_unmap(src_map); diff --git a/drivers/gpu/drm/i915/gt/selftest_reset.c b/drivers/gpu/drm/i915/gt/selftest_reset.c index 8784257ec808..92ada67a3835 100644 --- a/drivers/gpu/drm/i915/gt/selftest_reset.c +++ b/drivers/gpu/drm/i915/gt/selftest_reset.c @@ -5,9 +5,10 @@

#include <linux/crc32.h>

+#include <drm/drm_memcpy.h> + #include "gem/i915_gem_stolen.h"

-#include "i915_memcpy.h" #include "i915_selftest.h" #include "intel_gpu_commands.h" #include "selftests/igt_reset.h" @@ -99,7 +100,7 @@ __igt_reset_stolen(struct intel_gt *gt, memset_io(s, STACK_MAGIC, PAGE_SIZE);

in = (void __force *)s; - if (i915_memcpy_from_wc(tmp, in, PAGE_SIZE)) + if (drm_memcpy_from_wc(tmp, in, PAGE_SIZE)) in = tmp; crc[page] = crc32_le(0, in, PAGE_SIZE);

@@ -135,7 +136,7 @@ __igt_reset_stolen(struct intel_gt *gt, PAGE_SIZE);

in = (void __force *)s; - if (i915_memcpy_from_wc(tmp, in, PAGE_SIZE)) + if (drm_memcpy_from_wc(tmp, in, PAGE_SIZE)) in = tmp; x = crc32_le(0, in, PAGE_SIZE);

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_log.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_log.c index c36d5eb5bbb9..f045e42be6ca 100644 --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_log.c +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_log.c @@ -5,9 +5,10 @@

#include <linux/debugfs.h>

+#include <drm/drm_memcpy.h> + #include "gt/intel_gt.h" #include "i915_drv.h" -#include "i915_memcpy.h" #include "intel_guc_log.h"

static void guc_log_capture_logs(struct intel_guc_log *log); @@ -295,13 +296,13 @@ static void guc_read_update_log_buffer(struct intel_guc_log *log)

/* Just copy the newly written data */ if (read_offset > write_offset) { - i915_memcpy_from_wc(dst_data, src_data, write_offset); + drm_memcpy_from_wc(dst_data, src_data, write_offset); bytes_to_copy = buffer_size - read_offset; } else { bytes_to_copy = write_offset - read_offset; } - i915_memcpy_from_wc(dst_data + read_offset, - src_data + read_offset, bytes_to_copy); + drm_memcpy_from_wc(dst_data + read_offset, + src_data + read_offset, bytes_to_copy);

src_data += buffer_size; dst_data += buffer_size; @@ -569,7 +570,7 @@ int intel_guc_log_relay_open(struct intel_guc_log *log) * it should be present on the chipsets supporting GuC based * submisssions. */ - if (!i915_has_memcpy_from_wc()) { + if (!drm_has_memcpy_from_wc()) { ret = -ENXIO; goto out_unlock; } diff --git a/drivers/gpu/drm/i915/i915_cmd_parser.c b/drivers/gpu/drm/i915/i915_cmd_parser.c index 5b4b2bd46e7c..98653f1a2b1d 100644 --- a/drivers/gpu/drm/i915/i915_cmd_parser.c +++ b/drivers/gpu/drm/i915/i915_cmd_parser.c @@ -24,12 +24,12 @@ * Brad Volkin bradley.d.volkin@intel.com * */ +#include <drm/drm_memcpy.h>

#include "gt/intel_engine.h" #include "gt/intel_gpu_commands.h"

#include "i915_drv.h" -#include "i915_memcpy.h"

/** * DOC: batch buffer command parser @@ -1152,7 +1152,7 @@ static u32 *copy_batch(struct drm_i915_gem_object *dst_obj,

if (src) { GEM_BUG_ON(!needs_clflush); - i915_unaligned_memcpy_from_wc(dst, src + offset, length); + drm_unaligned_memcpy_from_wc(dst, src + offset, length); } else { struct scatterlist *sg; void *ptr; diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c index 122dd297b6af..0df9dd62c717 100644 --- a/drivers/gpu/drm/i915/i915_drv.c +++ b/drivers/gpu/drm/i915/i915_drv.c @@ -72,7 +72,6 @@ #include "i915_drv.h" #include "i915_ioc32.h" #include "i915_irq.h" -#include "i915_memcpy.h" #include "i915_perf.h" #include "i915_query.h" #include "i915_suspend.h" @@ -325,7 +324,6 @@ static int i915_driver_early_probe(struct drm_i915_private *dev_priv) mutex_init(&dev_priv->pps_mutex); mutex_init(&dev_priv->hdcp_comp_mutex);

- i915_memcpy_init_early(dev_priv); intel_runtime_pm_init_early(&dev_priv->runtime_pm);

ret = i915_workqueues_init(dev_priv); diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c index 99ca242ec13b..ee11920fbea5 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.c +++ b/drivers/gpu/drm/i915/i915_gpu_error.c @@ -34,6 +34,7 @@ #include <linux/utsname.h> #include <linux/zlib.h>

+#include <drm/drm_memcpy.h> #include <drm/drm_print.h>

#include "display/intel_csr.h" @@ -46,7 +47,6 @@

#include "i915_drv.h" #include "i915_gpu_error.h" -#include "i915_memcpy.h" #include "i915_scatterlist.h"

#define ALLOW_FAIL (GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN) @@ -255,7 +255,7 @@ static bool compress_init(struct i915_vma_compress *c) }

c->tmp = NULL; - if (i915_has_memcpy_from_wc()) + if (drm_has_memcpy_from_wc()) c->tmp = pool_alloc(&c->pool, ALLOW_FAIL);

return true; @@ -295,7 +295,7 @@ static int compress_page(struct i915_vma_compress *c, struct z_stream_s *zstream = &c->zstream;

zstream->next_in = src; - if (wc && c->tmp && i915_memcpy_from_wc(c->tmp, src, PAGE_SIZE)) + if (wc && c->tmp && drm_memcpy_from_wc(c->tmp, src, PAGE_SIZE)) zstream->next_in = c->tmp; zstream->avail_in = PAGE_SIZE;

@@ -395,7 +395,7 @@ static int compress_page(struct i915_vma_compress *c, if (!ptr) return -ENOMEM;

- if (!(wc && i915_memcpy_from_wc(ptr, src, PAGE_SIZE))) + if (!(wc && drm_memcpy_from_wc(ptr, src, PAGE_SIZE))) memcpy(ptr, src, PAGE_SIZE); dst->pages[dst->page_count++] = ptr; cond_resched(); diff --git a/drivers/gpu/drm/i915/i915_memcpy.h b/drivers/gpu/drm/i915/i915_memcpy.h deleted file mode 100644 index 3df063a3293b..000000000000 --- a/drivers/gpu/drm/i915/i915_memcpy.h +++ /dev/null @@ -1,34 +0,0 @@ -/* SPDX-License-Identifier: MIT */ -/* - * Copyright © 2019 Intel Corporation - */ - -#ifndef __I915_MEMCPY_H__ -#define __I915_MEMCPY_H__ - -#include <linux/types.h> - -struct drm_i915_private; - -void i915_memcpy_init_early(struct drm_i915_private *i915); - -bool i915_memcpy_from_wc(void *dst, const void *src, unsigned long len); -void i915_unaligned_memcpy_from_wc(void *dst, const void *src, unsigned long len); - -/* The movntdqa instructions used for memcpy-from-wc require 16-byte alignment, - * as well as SSE4.1 support. i915_memcpy_from_wc() will report if it cannot - * perform the operation. To check beforehand, pass in the parameters to - * to i915_can_memcpy_from_wc() - since we only care about the low 4 bits, - * you only need to pass in the minor offsets, page-aligned pointers are - * always valid. - * - * For just checking for SSE4.1, in the foreknowledge that the future use - * will be correctly aligned, just use i915_has_memcpy_from_wc(). - */ -#define i915_can_memcpy_from_wc(dst, src, len) \ - i915_memcpy_from_wc((void *)((unsigned long)(dst) | (unsigned long)(src) | (len)), NULL, 0) - -#define i915_has_memcpy_from_wc() \ - i915_memcpy_from_wc(NULL, NULL, 0) - -#endif /* __I915_MEMCPY_H__ */ diff --git a/drivers/gpu/drm/i915/selftests/intel_memory_region.c b/drivers/gpu/drm/i915/selftests/intel_memory_region.c index c85d516b85cd..6bb399e9be78 100644 --- a/drivers/gpu/drm/i915/selftests/intel_memory_region.c +++ b/drivers/gpu/drm/i915/selftests/intel_memory_region.c @@ -6,6 +6,8 @@ #include <linux/prime_numbers.h> #include <linux/sort.h>

+#include <drm/drm_memcpy.h> + #include "../i915_selftest.h"

#include "mock_drm.h" @@ -20,7 +22,6 @@ #include "gem/selftests/mock_context.h" #include "gt/intel_engine_user.h" #include "gt/intel_gt.h" -#include "i915_memcpy.h" #include "selftests/igt_flush_test.h" #include "selftests/i915_random.h"

@@ -901,7 +902,7 @@ static inline void igt_memcpy(void *dst, const void *src, size_t size)

static inline void igt_memcpy_from_wc(void *dst, const void *src, size_t size) { - i915_memcpy_from_wc(dst, src, size); + drm_memcpy_from_wc(dst, src, size); }

static int _perf_memcpy(struct intel_memory_region *src_mr, @@ -925,7 +926,7 @@ static int _perf_memcpy(struct intel_memory_region *src_mr, { "memcpy_from_wc", igt_memcpy_from_wc, - !i915_has_memcpy_from_wc(), + !drm_has_memcpy_from_wc(), }, }; struct drm_i915_gem_object *src, *dst; diff --git a/include/drm/drm_memcpy.h b/include/drm/drm_memcpy.h new file mode 100644 index 000000000000..dbf52cd43382 --- /dev/null +++ b/include/drm/drm_memcpy.h @@ -0,0 +1,41 @@ +/* SPDX-License-Identifier: MIT */ +/* + * Copyright © 2019 Intel Corporation + */ + +#ifndef __DRM_MEMCPY_H__ +#define __DRM_MEMCPY_H__ + +#include <linux/types.h> + +#ifdef CONFIG_X86 +bool drm_memcpy_from_wc(void *dst, const void *src, unsigned long len); +void drm_unaligned_memcpy_from_wc(void *dst, const void *src, unsigned long len); + +/* The movntdqa instructions used for memcpy-from-wc require 16-byte alignment, + * as well as SSE4.1 support. drm_memcpy_from_wc() will report if it cannot + * perform the operation. To check beforehand, pass in the parameters to + * drm_can_memcpy_from_wc() - since we only care about the low 4 bits, + * you only need to pass in the minor offsets, page-aligned pointers are + * always valid. + * + * For just checking for SSE4.1, in the foreknowledge that the future use + * will be correctly aligned, just use drm_has_memcpy_from_wc(). + */ +#define drm_can_memcpy_from_wc(dst, src, len) \ + drm_memcpy_from_wc((void *)((unsigned long)(dst) | (unsigned long)(src) | (len)), NULL, 0) + +#define drm_has_memcpy_from_wc() \ + drm_memcpy_from_wc(NULL, NULL, 0) + +void drm_memcpy_init_early(void); + +#else + +#define drm_memcpy_from_wc(_dst, _src, _len) (false) +#define drm_can_memcpy_from_wc(_dst, _src, _len) (false) +#define drm_has_memcpy_from_wc() (false) +#define drm_unaligned_memcpy_from_wc(_dst, _src, _len) WARN_ON(1) +#define drm_memcpy_init_early() do {} while (0) +#endif /* CONFIG_X86 */ +#endif /* __DRM_MEMCPY_H__ */

-- 2.31.1

Thomas Hellström

3:09 p.m.

New subject: [RFC PATCH 3/5] drm/ttm: Use drm_memcpy_from_wc for TTM bo moves

Use fast wc memcpy for reading out of wc memory for TTM bo moves.

Cc: Dave Airlie airlied@gmail.com Cc: Christian König christian.koenig@amd.com Cc: Daniel Vetter daniel.vetter@ffwll.ch Signed-off-by: Thomas Hellström thomas.hellstrom@linux.intel.com --- drivers/gpu/drm/ttm/ttm_bo_util.c | 18 +++++++++++++++++- 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c index bad9b16e96ba..919ee03f7eb3 100644 --- a/drivers/gpu/drm/ttm/ttm_bo_util.c +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c @@ -31,6 +31,7 @@

#include <drm/ttm/ttm_bo_driver.h> #include <drm/ttm/ttm_placement.h> +#include <drm/drm_memcpy.h> #include <drm/drm_vma_manager.h> #include <linux/dma-buf-map.h> #include <linux/io.h> @@ -185,6 +186,7 @@ void ttm_move_memcpy(struct ttm_buffer_object *bo, struct ttm_resource *old_mem = &bo->mem; struct ttm_resource_manager *old_man = ttm_manager_type(bdev, old_mem->mem_type); struct dma_buf_map old_map, new_map; + bool wc_memcpy; pgoff_t i;

/* Single TTM move. NOP */ @@ -208,11 +210,25 @@ void ttm_move_memcpy(struct ttm_buffer_object *bo, return; }

+ wc_memcpy = ((!old_man->use_tt || bo->ttm->caching != ttm_cached) && + drm_has_memcpy_from_wc()); + + /* + * We use some nasty aliasing for drm_memcpy_from_wc, but assuming + * that we can move to memremapping in the not too distant future, + * reduce the fragility for now with a build assert. + */ + BUILD_BUG_ON(offsetof(typeof(old_map), vaddr) != + offsetof(typeof(old_map), vaddr_iomem)); + for (i = 0; i < new_mem->num_pages; ++i) { new_iter->ops->kmap_local(new_iter, &new_map, i); old_iter->ops->kmap_local(old_iter, &old_map, i);

- if (!old_map.is_iomem && !new_map.is_iomem) { + if (wc_memcpy) { + drm_memcpy_from_wc(new_map.vaddr, old_map.vaddr, + PAGE_SIZE); + } else if (!old_map.is_iomem && !new_map.is_iomem) { memcpy(new_map.vaddr, old_map.vaddr, PAGE_SIZE); } else if (!old_map.is_iomem) { dma_buf_map_memcpy_to(&new_map, old_map.vaddr,

-- 2.31.1

Christian König

21 May 21 May

8:10 a.m.

New subject: [RFC PATCH 3/5] drm/ttm: Use drm_memcpy_from_wc for TTM bo moves

Am 20.05.21 um 17:09 schrieb Thomas Hellström:

...

Use fast wc memcpy for reading out of wc memory for TTM bo moves.

Cc: Dave Airlie airlied@gmail.com Cc: Christian König christian.koenig@amd.com Cc: Daniel Vetter daniel.vetter@ffwll.ch Signed-off-by: Thomas Hellström thomas.hellstrom@linux.intel.com

Oh, yes I really wanted to have that in TTM for quite some time.

But I'm wondering if we shouldn't fix the memremap stuff first.

Christian.

...

drivers/gpu/drm/ttm/ttm_bo_util.c | 18 +++++++++++++++++- 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c index bad9b16e96ba..919ee03f7eb3 100644 --- a/drivers/gpu/drm/ttm/ttm_bo_util.c +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c @@ -31,6 +31,7 @@

#include <drm/ttm/ttm_bo_driver.h> #include <drm/ttm/ttm_placement.h> +#include <drm/drm_memcpy.h> #include <drm/drm_vma_manager.h> #include <linux/dma-buf-map.h> #include <linux/io.h> @@ -185,6 +186,7 @@ void ttm_move_memcpy(struct ttm_buffer_object *bo, struct ttm_resource *old_mem = &bo->mem; struct ttm_resource_manager *old_man = ttm_manager_type(bdev, old_mem->mem_type); struct dma_buf_map old_map, new_map;

bool wc_memcpy; pgoff_t i;

/* Single TTM move. NOP */

@@ -208,11 +210,25 @@ void ttm_move_memcpy(struct ttm_buffer_object *bo, return; }
wc_memcpy = ((!old_man->use_tt || bo->ttm->caching != ttm_cached) &&
     drm_has_memcpy_from_wc());
/*
* We use some nasty aliasing for drm_memcpy_from_wc, but assuming
* that we can move to memremapping in the not too distant future,
* reduce the fragility for now with a build assert.
*/
BUILD_BUG_ON(offsetof(typeof(old_map), vaddr) !=
     offsetof(typeof(old_map), vaddr_iomem));
for (i = 0; i < new_mem->num_pages; ++i) { new_iter->ops->kmap_local(new_iter, &new_map, i); old_iter->ops->kmap_local(old_iter, &old_map, i);
if (!old_map.is_iomem && !new_map.is_iomem) {
if (wc_memcpy) {
	drm_memcpy_from_wc(new_map.vaddr, old_map.vaddr,
			   PAGE_SIZE);
} else if (!old_map.is_iomem && !new_map.is_iomem) {
memcpy(new_map.vaddr, old_map.vaddr, PAGE_SIZE);
} else if (!old_map.is_iomem) { dma_buf_map_memcpy_to(&new_map, old_map.vaddr,

Thomas Hellström

8:30 a.m.

New subject: [RFC PATCH 3/5] drm/ttm: Use drm_memcpy_from_wc for TTM bo moves

On 5/21/21 10:10 AM, Christian König wrote:

...

Am 20.05.21 um 17:09 schrieb Thomas Hellström:

...
Use fast wc memcpy for reading out of wc memory for TTM bo moves.

Cc: Dave Airlie airlied@gmail.com Cc: Christian König christian.koenig@amd.com Cc: Daniel Vetter daniel.vetter@ffwll.ch Signed-off-by: Thomas Hellström thomas.hellstrom@linux.intel.com

Oh, yes I really wanted to have that in TTM for quite some time.

We should use it for swap copy from WC as well IMO. A todo-task for somebody.

...

But I'm wondering if we shouldn't fix the memremap stuff first.

Using memremap all over is a fairly big change probably with lots of opinions involved all over the place. What I can do for now is to add a dma_buf_map interface to the memcpy itself, to move the aliasing out of TTM to the arch specific code that knows what it's doing?

/Thomas

...

Christian.

...

drivers/gpu/drm/ttm/ttm_bo_util.c | 18 +++++++++++++++++- 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c index bad9b16e96ba..919ee03f7eb3 100644 --- a/drivers/gpu/drm/ttm/ttm_bo_util.c +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c @@ -31,6 +31,7 @@ #include <drm/ttm/ttm_bo_driver.h> #include <drm/ttm/ttm_placement.h> +#include <drm/drm_memcpy.h> #include <drm/drm_vma_manager.h> #include <linux/dma-buf-map.h> #include <linux/io.h> @@ -185,6 +186,7 @@ void ttm_move_memcpy(struct ttm_buffer_object *bo,       struct ttm_resource *old_mem = &bo->mem;       struct ttm_resource_manager *old_man = ttm_manager_type(bdev, old_mem->mem_type);       struct dma_buf_map old_map, new_map; +    bool wc_memcpy;       pgoff_t i;       /* Single TTM move. NOP */ @@ -208,11 +210,25 @@ void ttm_move_memcpy(struct ttm_buffer_object *bo,           return;       } +    wc_memcpy = ((!old_man->use_tt || bo->ttm->caching != ttm_cached) && +             drm_has_memcpy_from_wc());

+    /* +     * We use some nasty aliasing for drm_memcpy_from_wc, but assuming +     * that we can move to memremapping in the not too distant future, +     * reduce the fragility for now with a build assert. +     */ +    BUILD_BUG_ON(offsetof(typeof(old_map), vaddr) != +             offsetof(typeof(old_map), vaddr_iomem));

for (i = 0; i < new_mem->num_pages; ++i) {           new_iter->ops->kmap_local(new_iter, &new_map, i);           old_iter->ops->kmap_local(old_iter, &old_map, i); -        if (!old_map.is_iomem && !new_map.is_iomem) { +        if (wc_memcpy) { +            drm_memcpy_from_wc(new_map.vaddr, old_map.vaddr, +                       PAGE_SIZE); +        } else if (!old_map.is_iomem && !new_map.is_iomem) {               memcpy(new_map.vaddr, old_map.vaddr, PAGE_SIZE);           } else if (!old_map.is_iomem) {               dma_buf_map_memcpy_to(&new_map, old_map.vaddr,

Thomas Hellström

20 May 20 May

3:09 p.m.

New subject: [RFC PATCH 4/5] drm/ttm: Document and optimize ttm_bo_pipeline_gutting()

If the bo is idle when calling ttm_bo_pipeline_gutting(), we unnecessarily create a ghost object and push it out to delayed destroy. Fix this by adding a path for idle, and document the function.

Also avoid having the bo end up in a bad state vulnerable to user-space triggered kernel BUGs if the call to ttm_tt_create() fails.

Finally reuse ttm_bo_pipeline_gutting() in ttm_bo_evict().

Cc: Christian König christian.koenig@amd.com Signed-off-by: Thomas Hellström thomas.hellstrom@linux.intel.com --- drivers/gpu/drm/ttm/ttm_bo.c | 20 +++++----- drivers/gpu/drm/ttm/ttm_bo_util.c | 63 ++++++++++++++++++++++++------- drivers/gpu/drm/ttm/ttm_tt.c | 5 +++ include/drm/ttm/ttm_tt.h | 10 +++++ 4 files changed, 75 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index ca1b098b6a56..a8fa3375b8aa 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -501,10 +501,15 @@ static int ttm_bo_evict(struct ttm_buffer_object *bo, bdev->funcs->evict_flags(bo, &placement);

if (!placement.num_placement && !placement.num_busy_placement) { - ttm_bo_wait(bo, false, false); + ret = ttm_bo_wait(bo, true, false); + if (ret) + return ret;

- ttm_bo_cleanup_memtype_use(bo); - return ttm_tt_create(bo, false); + /* + * Since we've already synced, this frees backing store + * immediately. + */ + return ttm_bo_pipeline_gutting(bo); }

ret = ttm_bo_mem_space(bo, &placement, &evict_mem, ctx); @@ -974,13 +979,8 @@ int ttm_bo_validate(struct ttm_buffer_object *bo, /* * Remove the backing store if no placement is given. */ - if (!placement->num_placement && !placement->num_busy_placement) { - ret = ttm_bo_pipeline_gutting(bo); - if (ret) - return ret; - - return ttm_tt_create(bo, false); - } + if (!placement->num_placement && !placement->num_busy_placement) + return ttm_bo_pipeline_gutting(bo);

/* * Check whether we need to move buffer. diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c index 919ee03f7eb3..1860e2e7563f 100644 --- a/drivers/gpu/drm/ttm/ttm_bo_util.c +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c @@ -479,7 +479,8 @@ static void ttm_transfered_destroy(struct ttm_buffer_object *bo) */

static int ttm_buffer_object_transfer(struct ttm_buffer_object *bo, - struct ttm_buffer_object **new_obj) + struct ttm_buffer_object **new_obj, + bool realloc_tt) { struct ttm_transfer_obj *fbo; int ret; @@ -493,6 +494,17 @@ static int ttm_buffer_object_transfer(struct ttm_buffer_object *bo, ttm_bo_get(bo); fbo->bo = bo;

+ if (realloc_tt) { + bo->ttm = NULL; + ret = ttm_tt_create(bo, true); + if (ret) { + bo->ttm = fbo->base.ttm; + kfree(fbo); + ttm_bo_put(bo); + return ret; + } + } + /** * Fix up members that we shouldn't copy directly: * TODO: Explicit member copy would probably be better here. @@ -763,7 +775,7 @@ static int ttm_bo_move_to_ghost(struct ttm_buffer_object *bo, dma_fence_put(bo->moving); bo->moving = dma_fence_get(fence);

- ret = ttm_buffer_object_transfer(bo, &ghost_obj); + ret = ttm_buffer_object_transfer(bo, &ghost_obj, false); if (ret) return ret;

@@ -836,26 +848,51 @@ int ttm_bo_move_accel_cleanup(struct ttm_buffer_object *bo, } EXPORT_SYMBOL(ttm_bo_move_accel_cleanup);

+/** + * ttm_bo_pipeline_gutting - purge the contents of a bo + * @bo: The buffer object + * + * Purge the contents of a bo, async if the bo is not idle. + * After a successful call, the bo is left unpopulated in + * system placement. The function may wait uninterruptible + * for idle on OOM. + * + * Return: 0 if successful, negative error code on failure. + */ int ttm_bo_pipeline_gutting(struct ttm_buffer_object *bo) { static const struct ttm_place sys_mem = { .mem_type = TTM_PL_SYSTEM }; struct ttm_buffer_object *ghost; int ret;

- ret = ttm_buffer_object_transfer(bo, &ghost); - if (ret) - return ret; + /* If already idle, no need for ghost object dance. */ + ret = ttm_bo_wait(bo, false, true); + if (ret == -EBUSY) { + ret = ttm_buffer_object_transfer(bo, &ghost, true); + if (ret) + return ret;

- ret = dma_resv_copy_fences(&ghost->base._resv, bo->base.resv); - /* Last resort, wait for the BO to be idle when we are OOM */ - if (ret) - ttm_bo_wait(bo, false, false); + ret = dma_resv_copy_fences(&ghost->base._resv, bo->base.resv); + /* Last resort, wait for the BO to be idle when we are OOM */ + if (ret) + ttm_bo_wait(bo, false, false);

- ttm_resource_alloc(bo, &sys_mem, &bo->mem); - bo->ttm = NULL; + dma_resv_unlock(&ghost->base._resv); + ttm_bo_put(ghost); + } else { + if (!bo->ttm) { + ret = ttm_tt_create(bo, true); + if (ret) + return ret; + } else { + ttm_tt_unpopulate(bo->bdev, bo->ttm); + if (bo->type == ttm_bo_type_device) + ttm_tt_mark_for_clear(bo->ttm); + } + ttm_resource_free(bo, &bo->mem); + }

- dma_resv_unlock(&ghost->base._resv); - ttm_bo_put(ghost); + ttm_resource_alloc(bo, &sys_mem, &bo->mem);

return 0; } diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c index 539e0232cb3b..0b1053e93db2 100644 --- a/drivers/gpu/drm/ttm/ttm_tt.c +++ b/drivers/gpu/drm/ttm/ttm_tt.c @@ -134,6 +134,11 @@ void ttm_tt_destroy_common(struct ttm_device *bdev, struct ttm_tt *ttm) } EXPORT_SYMBOL(ttm_tt_destroy_common);

+void ttm_tt_mark_for_clear(struct ttm_tt *ttm) +{ + ttm->page_flags |= TTM_PAGE_FLAG_ZERO_ALLOC; +} + void ttm_tt_destroy(struct ttm_device *bdev, struct ttm_tt *ttm) { bdev->funcs->ttm_tt_destroy(bdev, ttm); diff --git a/include/drm/ttm/ttm_tt.h b/include/drm/ttm/ttm_tt.h index 134d09ef7766..91552c83ac79 100644 --- a/include/drm/ttm/ttm_tt.h +++ b/include/drm/ttm/ttm_tt.h @@ -157,6 +157,16 @@ int ttm_tt_populate(struct ttm_device *bdev, struct ttm_tt *ttm, struct ttm_oper */ void ttm_tt_unpopulate(struct ttm_device *bdev, struct ttm_tt *ttm);

+/** + * ttm_tt_mark_for_clear - Mark pages for clearing on populate. + * + * @ttm: Pointer to the ttm_tt structure + * + * Marks pages for clearing so that the next time the page vector is + * populated, the pages will be cleared. + */ +void ttm_tt_mark_for_clear(struct ttm_tt *ttm); + void ttm_tt_mgr_init(unsigned long num_pages, unsigned long num_dma32_pages);

#if IS_ENABLED(CONFIG_AGP)

-- 2.31.1

Christian König

21 May 21 May

8:21 a.m.

New subject: [RFC PATCH 4/5] drm/ttm: Document and optimize ttm_bo_pipeline_gutting()

Am 20.05.21 um 17:09 schrieb Thomas Hellström:

...

If the bo is idle when calling ttm_bo_pipeline_gutting(), we unnecessarily create a ghost object and push it out to delayed destroy. Fix this by adding a path for idle, and document the function.

Also avoid having the bo end up in a bad state vulnerable to user-space triggered kernel BUGs if the call to ttm_tt_create() fails.

Finally reuse ttm_bo_pipeline_gutting() in ttm_bo_evict().

Cc: Christian König christian.koenig@amd.com Signed-off-by: Thomas Hellström thomas.hellstrom@linux.intel.com

drivers/gpu/drm/ttm/ttm_bo.c | 20 +++++----- drivers/gpu/drm/ttm/ttm_bo_util.c | 63 ++++++++++++++++++++++++------- drivers/gpu/drm/ttm/ttm_tt.c | 5 +++ include/drm/ttm/ttm_tt.h | 10 +++++ 4 files changed, 75 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index ca1b098b6a56..a8fa3375b8aa 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -501,10 +501,15 @@ static int ttm_bo_evict(struct ttm_buffer_object *bo, bdev->funcs->evict_flags(bo, &placement);

if (!placement.num_placement && !placement.num_busy_placement) {
ttm_bo_wait(bo, false, false);
ret = ttm_bo_wait(bo, true, false);
if (ret)
	return ret;
ttm_bo_cleanup_memtype_use(bo);
return ttm_tt_create(bo, false);
/*
 * Since we've already synced, this frees backing store
 * immediately.
 */
return ttm_bo_pipeline_gutting(bo);

Yeah, we tried to avoid pipeline_gutting here because of eviction. But I think when you wait before that should work.

...

}

ret = ttm_bo_mem_space(bo, &placement, &evict_mem, ctx); @@ -974,13 +979,8 @@ int ttm_bo_validate(struct ttm_buffer_object *bo, /* * Remove the backing store if no placement is given. */
if (!placement->num_placement && !placement->num_busy_placement) {
ret = ttm_bo_pipeline_gutting(bo);
if (ret)
	return ret;
return ttm_tt_create(bo, false);
}
if (!placement->num_placement && !placement->num_busy_placement)
return ttm_bo_pipeline_gutting(bo);
/*

Check whether we need to move buffer.
diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c index 919ee03f7eb3..1860e2e7563f 100644 --- a/drivers/gpu/drm/ttm/ttm_bo_util.c +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c @@ -479,7 +479,8 @@ static void ttm_transfered_destroy(struct ttm_buffer_object *bo) */

static int ttm_buffer_object_transfer(struct ttm_buffer_object *bo,
		      struct ttm_buffer_object **new_obj)
		      struct ttm_buffer_object **new_obj,
		      bool realloc_tt)
{ struct ttm_transfer_obj *fbo; int ret;
@@ -493,6 +494,17 @@ static int ttm_buffer_object_transfer(struct ttm_buffer_object *bo, ttm_bo_get(bo); fbo->bo = bo;
if (realloc_tt) {
bo->ttm = NULL;
ret = ttm_tt_create(bo, true);
if (ret) {
	bo->ttm = fbo->base.ttm;
	kfree(fbo);
	ttm_bo_put(bo);
	return ret;
}
}

Can't we keep that logic in the caller? I think that would be cleaner.

...

/** * Fix up members that we shouldn't copy directly: * TODO: Explicit member copy would probably be better here. @@ -763,7 +775,7 @@ static int ttm_bo_move_to_ghost(struct ttm_buffer_object *bo, dma_fence_put(bo->moving); bo->moving = dma_fence_get(fence);

ret = ttm_buffer_object_transfer(bo, &ghost_obj);

ret = ttm_buffer_object_transfer(bo, &ghost_obj, false); if (ret) return ret;

@@ -836,26 +848,51 @@ int ttm_bo_move_accel_cleanup(struct ttm_buffer_object *bo, } EXPORT_SYMBOL(ttm_bo_move_accel_cleanup);

+/**

ttm_bo_pipeline_gutting - purge the contents of a bo

@bo: The buffer object

Purge the contents of a bo, async if the bo is not idle.

After a successful call, the bo is left unpopulated in

system placement. The function may wait uninterruptible

for idle on OOM.

Return: 0 if successful, negative error code on failure.

*/ int ttm_bo_pipeline_gutting(struct ttm_buffer_object *bo) { static const struct ttm_place sys_mem = { .mem_type = TTM_PL_SYSTEM }; struct ttm_buffer_object *ghost; int ret;
ret = ttm_buffer_object_transfer(bo, &ghost);

if (ret)
return ret;
/* If already idle, no need for ghost object dance. */

ret = ttm_bo_wait(bo, false, true);

if (ret == -EBUSY) {
ret = ttm_buffer_object_transfer(bo, &ghost, true);
if (ret)
	return ret;

When this is a shortcout to avoid work we should rather use the inverse notation.

In other words something like that:

if (ret != -EBUSY) { ttm_resource_free(bo, &bo->mem); ttm_resource_alloc(bo, &sys_mem, &bo->mem); ttm_tt_create()... return ret; }

...

ret = dma_resv_copy_fences(&ghost->base._resv, bo->base.resv);
/* Last resort, wait for the BO to be idle when we are OOM */
if (ret)
```
ttm_bo_wait(bo, false, false);
```

ret = dma_resv_copy_fences(&ghost->base._resv, bo->base.resv);

/* Last resort, wait for the BO to be idle when we are OOM */

```
if (ret)
```
```
	ttm_bo_wait(bo, false, false);
```

ttm_resource_alloc(bo, &sys_mem, &bo->mem);
bo->ttm = NULL;

```
dma_resv_unlock(&ghost->base._resv);
```
```
ttm_bo_put(ghost);
```
} else {
```
if (!bo->ttm) {
```
```
	ret = ttm_tt_create(bo, true);
```
```
	if (ret)
```
```
		return ret;
```
```
} else {
```
```
	ttm_tt_unpopulate(bo->bdev, bo->ttm);
```
```
	if (bo->type == ttm_bo_type_device)
```
```
		ttm_tt_mark_for_clear(bo->ttm);
```

That's not legal, you can't unpopulate it when the BO is busy.

Instead the TT object must be destroyed with the ghost and a new one created.

Christian.

...

}
ttm_resource_free(bo, &bo->mem);
}
dma_resv_unlock(&ghost->base._resv);

ttm_bo_put(ghost);

ttm_resource_alloc(bo, &sys_mem, &bo->mem);

return 0; }

diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c index 539e0232cb3b..0b1053e93db2 100644 --- a/drivers/gpu/drm/ttm/ttm_tt.c +++ b/drivers/gpu/drm/ttm/ttm_tt.c @@ -134,6 +134,11 @@ void ttm_tt_destroy_common(struct ttm_device *bdev, struct ttm_tt *ttm) } EXPORT_SYMBOL(ttm_tt_destroy_common);

+void ttm_tt_mark_for_clear(struct ttm_tt *ttm) +{

ttm->page_flags |= TTM_PAGE_FLAG_ZERO_ALLOC;

+}

void ttm_tt_destroy(struct ttm_device *bdev, struct ttm_tt *ttm) { bdev->funcs->ttm_tt_destroy(bdev, ttm);

diff --git a/include/drm/ttm/ttm_tt.h b/include/drm/ttm/ttm_tt.h index 134d09ef7766..91552c83ac79 100644 --- a/include/drm/ttm/ttm_tt.h +++ b/include/drm/ttm/ttm_tt.h @@ -157,6 +157,16 @@ int ttm_tt_populate(struct ttm_device *bdev, struct ttm_tt *ttm, struct ttm_oper */ void ttm_tt_unpopulate(struct ttm_device *bdev, struct ttm_tt *ttm);

+/**

ttm_tt_mark_for_clear - Mark pages for clearing on populate.

@ttm: Pointer to the ttm_tt structure

Marks pages for clearing so that the next time the page vector is

populated, the pages will be cleared.

*/

+void ttm_tt_mark_for_clear(struct ttm_tt *ttm);

void ttm_tt_mgr_init(unsigned long num_pages, unsigned long num_dma32_pages);

#if IS_ENABLED(CONFIG_AGP)

Thomas Hellström

8:43 a.m.

New subject: [RFC PATCH 4/5] drm/ttm: Document and optimize ttm_bo_pipeline_gutting()

On 5/21/21 10:21 AM, Christian König wrote:

...

Am 20.05.21 um 17:09 schrieb Thomas Hellström:

...
If the bo is idle when calling ttm_bo_pipeline_gutting(), we unnecessarily create a ghost object and push it out to delayed destroy. Fix this by adding a path for idle, and document the function.

Also avoid having the bo end up in a bad state vulnerable to user-space triggered kernel BUGs if the call to ttm_tt_create() fails.

Finally reuse ttm_bo_pipeline_gutting() in ttm_bo_evict().

Cc: Christian König christian.koenig@amd.com Signed-off-by: Thomas Hellström thomas.hellstrom@linux.intel.com

drivers/gpu/drm/ttm/ttm_bo.c      | 20 +++++----- drivers/gpu/drm/ttm/ttm_bo_util.c | 63 ++++++++++++++++++++++++------- drivers/gpu/drm/ttm/ttm_tt.c      | 5 +++ include/drm/ttm/ttm_tt.h          | 10 +++++ 4 files changed, 75 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index ca1b098b6a56..a8fa3375b8aa 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -501,10 +501,15 @@ static int ttm_bo_evict(struct ttm_buffer_object *bo,       bdev->funcs->evict_flags(bo, &placement);       if (!placement.num_placement && !placement.num_busy_placement) { -        ttm_bo_wait(bo, false, false); +        ret = ttm_bo_wait(bo, true, false); +        if (ret) +            return ret; -        ttm_bo_cleanup_memtype_use(bo); -        return ttm_tt_create(bo, false); +        /* +         * Since we've already synced, this frees backing store +         * immediately. +         */ +        return ttm_bo_pipeline_gutting(bo);

Yeah, we tried to avoid pipeline_gutting here because of eviction. But I think when you wait before that should work.

...
}       ret = ttm_bo_mem_space(bo, &placement, &evict_mem, ctx); @@ -974,13 +979,8 @@ int ttm_bo_validate(struct ttm_buffer_object *bo,       /*        * Remove the backing store if no placement is given.        */ -    if (!placement->num_placement && !placement->num_busy_placement) { -        ret = ttm_bo_pipeline_gutting(bo); -        if (ret) -            return ret;

-        return ttm_tt_create(bo, false); -    } +    if (!placement->num_placement && !placement->num_busy_placement) +        return ttm_bo_pipeline_gutting(bo);       /*        * Check whether we need to move buffer. diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c index 919ee03f7eb3..1860e2e7563f 100644 --- a/drivers/gpu/drm/ttm/ttm_bo_util.c +++ b/drivers/gpu/drm/ttm/ttm_bo_util.c @@ -479,7 +479,8 @@ static void ttm_transfered_destroy(struct ttm_buffer_object *bo)    */ static int ttm_buffer_object_transfer(struct ttm_buffer_object *bo, -                      struct ttm_buffer_object **new_obj) +                      struct ttm_buffer_object **new_obj, +                      bool realloc_tt) {       struct ttm_transfer_obj *fbo;       int ret; @@ -493,6 +494,17 @@ static int ttm_buffer_object_transfer(struct ttm_buffer_object *bo,       ttm_bo_get(bo);       fbo->bo = bo; +    if (realloc_tt) { +        bo->ttm = NULL; +        ret = ttm_tt_create(bo, true); +        if (ret) { +            bo->ttm = fbo->base.ttm; +            kfree(fbo); +            ttm_bo_put(bo); +            return ret; +        } +    }

Can't we keep that logic in the caller? I think that would be cleaner.

Indeed, let me see if we can do that without breaking anything.

...

...
/**        * Fix up members that we shouldn't copy directly:        * TODO: Explicit member copy would probably be better here. @@ -763,7 +775,7 @@ static int ttm_bo_move_to_ghost(struct ttm_buffer_object *bo,       dma_fence_put(bo->moving);       bo->moving = dma_fence_get(fence); -    ret = ttm_buffer_object_transfer(bo, &ghost_obj); +    ret = ttm_buffer_object_transfer(bo, &ghost_obj, false);       if (ret)           return ret; @@ -836,26 +848,51 @@ int ttm_bo_move_accel_cleanup(struct ttm_buffer_object *bo, } EXPORT_SYMBOL(ttm_bo_move_accel_cleanup); +/**

ttm_bo_pipeline_gutting - purge the contents of a bo

@bo: The buffer object

Purge the contents of a bo, async if the bo is not idle.

After a successful call, the bo is left unpopulated in

system placement. The function may wait uninterruptible

for idle on OOM.

Return: 0 if successful, negative error code on failure.

*/

int ttm_bo_pipeline_gutting(struct ttm_buffer_object *bo) {       static const struct ttm_place sys_mem = { .mem_type = TTM_PL_SYSTEM };       struct ttm_buffer_object *ghost;       int ret; -    ret = ttm_buffer_object_transfer(bo, &ghost); -    if (ret) -        return ret; +    /* If already idle, no need for ghost object dance. */ +    ret = ttm_bo_wait(bo, false, true); +    if (ret == -EBUSY) { +        ret = ttm_buffer_object_transfer(bo, &ghost, true); +        if (ret) +            return ret;

When this is a shortcout to avoid work we should rather use the inverse notation.

In other words something like that:

if (ret != -EBUSY) {     ttm_resource_free(bo, &bo->mem);     ttm_resource_alloc(bo, &sys_mem, &bo->mem);     ttm_tt_create()...     return ret; }

OK.

...

...
- ret = dma_resv_copy_fences(&ghost->base._resv, bo->base.resv); - /* Last resort, wait for the BO to be idle when we are OOM */ - if (ret) - ttm_bo_wait(bo, false, false); + ret = dma_resv_copy_fences(&ghost->base._resv, bo->base.resv); + /* Last resort, wait for the BO to be idle when we are OOM */ + if (ret) + ttm_bo_wait(bo, false, false); - ttm_resource_alloc(bo, &sys_mem, &bo->mem); - bo->ttm = NULL; + dma_resv_unlock(&ghost->base._resv); + ttm_bo_put(ghost); + } else { + if (!bo->ttm) { + ret = ttm_tt_create(bo, true); + if (ret) + return ret; + } else { + ttm_tt_unpopulate(bo->bdev, bo->ttm); + if (bo->type == ttm_bo_type_device) + ttm_tt_mark_for_clear(bo->ttm);

That's not legal, you can't unpopulate it when the BO is busy.

Instead the TT object must be destroyed with the ghost and a new one created.

We've already verified that the bo is idle here, so we should be fine.

/Thomas

...

Christian.

...
+ } + ttm_resource_free(bo, &bo->mem); + } - dma_resv_unlock(&ghost->base._resv); - ttm_bo_put(ghost); + ttm_resource_alloc(bo, &sys_mem, &bo->mem); return 0; } diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c index 539e0232cb3b..0b1053e93db2 100644 --- a/drivers/gpu/drm/ttm/ttm_tt.c +++ b/drivers/gpu/drm/ttm/ttm_tt.c @@ -134,6 +134,11 @@ void ttm_tt_destroy_common(struct ttm_device *bdev, struct ttm_tt *ttm) } EXPORT_SYMBOL(ttm_tt_destroy_common); +void ttm_tt_mark_for_clear(struct ttm_tt *ttm) +{ + ttm->page_flags |= TTM_PAGE_FLAG_ZERO_ALLOC; +}

void ttm_tt_destroy(struct ttm_device *bdev, struct ttm_tt *ttm) { bdev->funcs->ttm_tt_destroy(bdev, ttm); diff --git a/include/drm/ttm/ttm_tt.h b/include/drm/ttm/ttm_tt.h index 134d09ef7766..91552c83ac79 100644 --- a/include/drm/ttm/ttm_tt.h +++ b/include/drm/ttm/ttm_tt.h @@ -157,6 +157,16 @@ int ttm_tt_populate(struct ttm_device *bdev, struct ttm_tt *ttm, struct ttm_oper */ void ttm_tt_unpopulate(struct ttm_device *bdev, struct ttm_tt *ttm); +/**

ttm_tt_mark_for_clear - Mark pages for clearing on populate.

@ttm: Pointer to the ttm_tt structure

Marks pages for clearing so that the next time the page vector is

populated, the pages will be cleared.

*/

+void ttm_tt_mark_for_clear(struct ttm_tt *ttm);

void ttm_tt_mgr_init(unsigned long num_pages, unsigned long num_dma32_pages); #if IS_ENABLED(CONFIG_AGP)

Thomas Hellström

20 May 20 May

3:09 p.m.

New subject: [RFC PATCH 5/5] drm/ttm, drm/amdgpu: Allow the driver some control over swapping

We are calling the eviction_valuable driver callback at eviction time to determine whether we actually can evict a buffer object. The upcoming i915 TTM backend needs the same functionality for swapout, and that might actually be beneficial to other drivers as well.

Add an eviction_valuable call also in the swapout path. Try to keep the current behaviour for all drivers by returning true if the buffer object is already in the TTM_PL_SYSTEM placement. We change behaviour for the case where a buffer object is in a TT backed placement when swapped out, in which case the drivers normal eviction_valuable path is run.

Finally make sure we don't try to swapout a bo that was recently purged and therefore unpopulated.

Cc: Christian König christian.koenig@amd.com Signed-off-by: Thomas Hellström thomas.hellstrom@linux.intel.com --- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 4 +++ drivers/gpu/drm/ttm/ttm_bo.c | 41 +++++++++++++++---------- drivers/gpu/drm/ttm/ttm_tt.c | 4 +++ 3 files changed, 33 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c index 8c7ec09eb1a4..d5a9d7a88315 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c @@ -1399,6 +1399,10 @@ static bool amdgpu_ttm_bo_eviction_valuable(struct ttm_buffer_object *bo, struct dma_fence *f; int i;

+ /* Swapout? */ + if (bo->mem.mem_type == TTM_PL_SYSTEM) + return true; + if (bo->type == ttm_bo_type_kernel && !amdgpu_vm_evictable(ttm_to_amdgpu_bo(bo))) return false; diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index a8fa3375b8aa..3f7c64a1cda1 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -536,6 +536,10 @@ static int ttm_bo_evict(struct ttm_buffer_object *bo, bool ttm_bo_eviction_valuable(struct ttm_buffer_object *bo, const struct ttm_place *place) { + dma_resv_assert_held(bo->base.resv); + if (bo->mem.mem_type == TTM_PL_SYSTEM) + return true; + /* Don't evict this BO if it's outside of the * requested placement range */ @@ -558,7 +562,9 @@ EXPORT_SYMBOL(ttm_bo_eviction_valuable); * b. Otherwise, trylock it. */ static bool ttm_bo_evict_swapout_allowable(struct ttm_buffer_object *bo, - struct ttm_operation_ctx *ctx, bool *locked, bool *busy) + struct ttm_operation_ctx *ctx, + const struct ttm_place *place, + bool *locked, bool *busy) { bool ret = false;

@@ -576,6 +582,12 @@ static bool ttm_bo_evict_swapout_allowable(struct ttm_buffer_object *bo, *busy = !ret; }

+ if (ret && place && !bo->bdev->funcs->eviction_valuable(bo, place)) { + ret = false; + if (locked) + dma_resv_unlock(bo->base.resv); + } + return ret; }

@@ -630,20 +642,14 @@ int ttm_mem_evict_first(struct ttm_device *bdev, list_for_each_entry(bo, &man->lru[i], lru) { bool busy;

- if (!ttm_bo_evict_swapout_allowable(bo, ctx, &locked, - &busy)) { + if (!ttm_bo_evict_swapout_allowable(bo, ctx, place, + &locked, &busy)) { if (busy && !busy_bo && ticket != dma_resv_locking_ctx(bo->base.resv)) busy_bo = bo; continue; }

- if (place && !bdev->funcs->eviction_valuable(bo, - place)) { - if (locked) - dma_resv_unlock(bo->base.resv); - continue; - } if (!ttm_bo_get_unless_zero(bo)) { if (locked) dma_resv_unlock(bo->base.resv); @@ -1138,10 +1144,18 @@ EXPORT_SYMBOL(ttm_bo_wait); int ttm_bo_swapout(struct ttm_buffer_object *bo, struct ttm_operation_ctx *ctx, gfp_t gfp_flags) { + struct ttm_place place = {}; bool locked; int ret;

- if (!ttm_bo_evict_swapout_allowable(bo, ctx, &locked, NULL)) + /* + * While the bo may already reside in SYSTEM placement, set + * SYSTEM as new placement to cover also the move further below. + * The driver may use the fact that we're moving from SYSTEM + * as an indication that we're about to swap out. + */ + place.mem_type = TTM_PL_SYSTEM; + if (!ttm_bo_evict_swapout_allowable(bo, ctx, &place, &locked, NULL)) return -EBUSY;

if (!ttm_bo_get_unless_zero(bo)) { @@ -1166,12 +1180,7 @@ int ttm_bo_swapout(struct ttm_buffer_object *bo, struct ttm_operation_ctx *ctx, if (bo->mem.mem_type != TTM_PL_SYSTEM) { struct ttm_operation_ctx ctx = { false, false }; struct ttm_resource evict_mem; - struct ttm_place place, hop; - - memset(&place, 0, sizeof(place)); - memset(&hop, 0, sizeof(hop)); - - place.mem_type = TTM_PL_SYSTEM; + struct ttm_place hop = {};

ret = ttm_resource_alloc(bo, &place, &evict_mem); if (unlikely(ret)) diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c index 0b1053e93db2..d46c702ce0ca 100644 --- a/drivers/gpu/drm/ttm/ttm_tt.c +++ b/drivers/gpu/drm/ttm/ttm_tt.c @@ -263,6 +263,9 @@ int ttm_tt_swapout(struct ttm_device *bdev, struct ttm_tt *ttm, struct page *to_page; int i, ret;

+ if (!ttm_tt_is_populated(ttm)) + return 0; + swap_storage = shmem_file_setup("ttm swap", size, 0); if (IS_ERR(swap_storage)) { pr_err("Failed allocating swap storage\n"); @@ -404,6 +407,7 @@ void ttm_tt_unpopulate(struct ttm_device *bdev, struct ttm_tt *ttm)

ttm->page_flags &= ~TTM_PAGE_FLAG_PRIV_POPULATED; } +EXPORT_SYMBOL(ttm_tt_unpopulate);

#ifdef CONFIG_DEBUG_FS

-- 2.31.1

Thomas Hellström

3:17 p.m.

New subject: [RFC PATCH 5/5] drm/ttm, drm/amdgpu: Allow the driver some control over swapping

On Thu, 2021-05-20 at 17:09 +0200, Thomas Hellström wrote:

...

+EXPORT_SYMBOL(ttm_tt_unpopulate);

Oh, this one was a leftover. It's not meant to be included anymore.

/Thomas

...

#ifdef CONFIG_DEBUG_FS

1455

Age (days ago)

1462

Last active (days ago)

dri-devel@lists.freedesktop.org

12 comments

3 participants

tags (0)

participants (3)

Christian König
kernel test robot
Thomas Hellström