-----Original Message----- From: Daniel Vetter daniel@ffwll.ch Sent: Tuesday, October 06, 2020 2:22 AM To: Xiong, Jianxin jianxin.xiong@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca; Leon Romanovsky leon@kernel.org; linux-rdma@vger.kernel.org; dri-devel@lists.freedesktop.org; Doug Ledford dledford@redhat.com; Vetter, Daniel daniel.vetter@intel.com; Christian Koenig christian.koenig@amd.com Subject: Re: [RFC PATCH v3 1/4] RDMA/umem: Support importing dma-buf as user memory region
On Mon, Oct 05, 2020 at 04:18:11PM +0000, Xiong, Jianxin wrote:
-----Original Message----- From: Jason Gunthorpe jgg@ziepe.ca Sent: Monday, October 05, 2020 6:13 AM To: Xiong, Jianxin jianxin.xiong@intel.com Cc: linux-rdma@vger.kernel.org; dri-devel@lists.freedesktop.org; Doug Ledford dledford@redhat.com; Leon Romanovsky leon@kernel.org; Sumit Semwal sumit.semwal@linaro.org; Christian Koenig christian.koenig@amd.com; Vetter, Daniel daniel.vetter@intel.com Subject: Re: [RFC PATCH v3 1/4] RDMA/umem: Support importing dma-buf as user memory region
On Sun, Oct 04, 2020 at 12:12:28PM -0700, Jianxin Xiong wrote:
Dma-buf is a standard cross-driver buffer sharing mechanism that can be used to support peer-to-peer access from RDMA devices.
Device memory exported via dma-buf is associated with a file descriptor. This is passed to the user space as a property associated with the buffer allocation. When the buffer is registered as a memory region, the file descriptor is passed to the RDMA driver along with other parameters.
Implement the common code for importing dma-buf object and mapping dma-buf pages.
Signed-off-by: Jianxin Xiong jianxin.xiong@intel.com Reviewed-by: Sean Hefty sean.hefty@intel.com Acked-by: Michael J. Ruhl michael.j.ruhl@intel.com
drivers/infiniband/core/Makefile | 2 +- drivers/infiniband/core/umem.c | 4 + drivers/infiniband/core/umem_dmabuf.c | 291 ++++++++++++++++++++++++++++++++++ drivers/infiniband/core/umem_dmabuf.h | 14 ++ drivers/infiniband/core/umem_odp.c | 12 ++ include/rdma/ib_umem.h | 19 ++- 6 files changed, 340 insertions(+), 2 deletions(-) create mode 100644 drivers/infiniband/core/umem_dmabuf.c create mode 100644 drivers/infiniband/core/umem_dmabuf.h
I think this is using ODP too literally, dmabuf isn't going to need fine grained page faults, and I'm not sure this locking scheme is OK - ODP is horrifically complicated.
If this is the approach then I think we should make dmabuf its own stand alone API, reg_user_mr_dmabuf()
That's the original approach in the first version. We can go back there.
The implementation in mlx5 will be much more understandable, it would just do dma_buf_dynamic_attach() and program the XLT exactly the same as a normal umem.
The move_notify() simply zap's the XLT and triggers a work to reload it after the move. Locking is provided by the dma_resv_lock. Only a small disruption to the page fault handler is needed.
We considered such scheme but didn't go that way due to the lack of notification when the move is done and thus the work wouldn't know when it can reload.
Now I think it again, we could probably signal the reload in the page fault handler.
For reinstanting the pages you need:
- dma_resv_lock, this prevents anyone else from issuing new moves or anything like that
- dma_resv_get_excl + dma_fence_wait to wait for any pending moves to finish. gpus generally don't wait on the cpu, but block the dependent dma operations from being scheduled until that fence fired. But for rdma odp I think you need the cpu wait in your worker here.
- get the new sg list, write it into your ptes
- dma_resv_unlock to make sure you're not racing with a concurrent move_notify
You can also grab multiple dma_resv_lock in atomically, but I think the odp rdma model doesn't require that (gpus need that).
Note that you're allowed to allocate memory with GFP_KERNEL while holding dma_resv_lock, so this shouldn't impose any issues. You are otoh not allowed to cause userspace faults (so no gup/pup or copy*user with faulting enabled). So all in all this shouldn't be any worse that calling pup for normal umem.
Unlike mmu notifier the caller holds dma_resv_lock already for you around the move_notify callback, so you shouldn't need any additional locking in there (aside from what you need to zap the ptes and flush hw tlbs).
Cheers, Daniel
Hi Daniel, thanks for providing the details. I would have missed the dma_resv_get_excl + dma_fence_wait part otherwise.
- dma_resv_lock(umem_dmabuf->attach->dmabuf->resv, NULL);
- sgt = dma_buf_map_attachment(umem_dmabuf->attach,
DMA_BIDIRECTIONAL);
- dma_resv_unlock(umem_dmabuf->attach->dmabuf->resv);
This doesn't look right, this lock has to be held up until the HW is programmed
The mapping remains valid until being invalidated again. There is a sequence number checking before programming the HW.
The use of atomic looks probably wrong as well.
Do you mean umem_dmabuf->notifier_seq? Could you elaborate the concern?
- k = 0;
- total_pages = ib_umem_odp_num_pages(umem_odp);
- for_each_sg(umem->sg_head.sgl, sg, umem->sg_head.nents, j) {
addr = sg_dma_address(sg);
pages = sg_dma_len(sg) >> page_shift;
while (pages > 0 && k < total_pages) {
umem_odp->dma_list[k++] = addr | access_mask;
umem_odp->npages++;
addr += page_size;
pages--;
This isn't fragmenting the sg into a page list properly, won't work for unaligned things
I thought the addresses are aligned, but will add explicit alignment here.
And really we don't need the dma_list for this case, with a fixed whole mapping DMA SGL a normal umem sgl is OK and the normal umem XLT programming in mlx5 is fine.
The dma_list is used by both "polulate_mtt()" and "mlx5_ib_invalidate_range", which are used for XLT programming and invalidating
(zapping), respectively.
Jason
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch