Hey all,
This is a followup to my previous RFCs [1][2], which now adds a new api to the RDMA subsystem that allows drivers to get a pinned dmabuf memory region without requiring an implementation of the move_notify callback. The new api makes use of the dynamic attachment api implemented in the RDMA subsystem, but calls dma_buf_pin() in order to make sure that the callback will not be called, as suggested by Christian.
As explained in the previous RFC, move_notify requires the RDMA device to support on-demand-paging (ODP) which is not common on most devices (only supported by mlx5).
While the dynamic requirement makes sense for certain GPUs, some devices (such as habanalabs) have device memory that is always "pinned" and do not need/use the move_notify operation.
Patch #1 changes the dmabuf documentation to make it clear that pinning does not necessarily mean the memory must be moved to system memory, it is up to the exporter to decide. Patch #2 adds the RDMA api that allows drivers to get pinned dmabuf memory regions. Patch #3 adds the EFA implementation of the dmabuf importer.
The motivation of this submission is to use habanalabs as the dmabuf exporter, and EFA as the importer to allow for peer2peer access through libibverbs.
[1] https://lore.kernel.org/linux-rdma/20210818074352.29950-1-galpress@amazon.co... [2] https://lore.kernel.org/linux-rdma/20211007104301.76693-1-galpress@amazon.co...
Thanks
Gal Pressman (3): dma-buf: Fix pin callback comment RDMA/umem: Allow pinned dmabuf umem usage RDMA/efa: Add support for dmabuf memory regions
drivers/infiniband/core/umem_dmabuf.c | 51 +++++++++++ drivers/infiniband/hw/efa/efa.h | 4 + drivers/infiniband/hw/efa/efa_main.c | 1 + drivers/infiniband/hw/efa/efa_verbs.c | 127 +++++++++++++++++++------- include/linux/dma-buf.h | 4 +- include/rdma/ib_umem.h | 12 +++ 6 files changed, 166 insertions(+), 33 deletions(-)
The pin callback does not necessarily have to move the memory to system memory, remove the sentence from the comment.
Signed-off-by: Gal Pressman galpress@amazon.com --- include/linux/dma-buf.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index efdc56b9d95f..225e09caeb98 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -86,8 +86,8 @@ struct dma_buf_ops { * @pin: * * This is called by dma_buf_pin() and lets the exporter know that the - * DMA-buf can't be moved any more. The exporter should pin the buffer - * into system memory to make sure it is generally accessible by other + * DMA-buf can't be moved any more. Ideally, the exporter should + * pin the buffer so that it is generally accessible by all * devices. * * This is called with the &dmabuf.resv object locked and is mutual
Am 12.10.21 um 14:09 schrieb Gal Pressman:
The pin callback does not necessarily have to move the memory to system memory, remove the sentence from the comment.
Signed-off-by: Gal Pressman galpress@amazon.com
Reviewed-by: Christian König christian.koenig@amd.com
include/linux/dma-buf.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h index efdc56b9d95f..225e09caeb98 100644 --- a/include/linux/dma-buf.h +++ b/include/linux/dma-buf.h @@ -86,8 +86,8 @@ struct dma_buf_ops { * @pin: * * This is called by dma_buf_pin() and lets the exporter know that the
* DMA-buf can't be moved any more. The exporter should pin the buffer
* into system memory to make sure it is generally accessible by other
* DMA-buf can't be moved any more. Ideally, the exporter should
* pin the buffer so that it is generally accessible by all
- devices.
- This is called with the &dmabuf.resv object locked and is mutual
Introduce ib_umem_dmabuf_get_pinned() which allows the driver to get a dmabuf umem which is pinned and does not require move_notify callback implementation. The returned umem is pinned and DMA mapped like standard cpu umems, and is released through ib_umem_release() (incl. unpinning and unmapping).
Signed-off-by: Gal Pressman galpress@amazon.com --- drivers/infiniband/core/umem_dmabuf.c | 51 +++++++++++++++++++++++++++ include/rdma/ib_umem.h | 12 +++++++ 2 files changed, 63 insertions(+)
diff --git a/drivers/infiniband/core/umem_dmabuf.c b/drivers/infiniband/core/umem_dmabuf.c index c6e875619fac..4fd6abc4dd17 100644 --- a/drivers/infiniband/core/umem_dmabuf.c +++ b/drivers/infiniband/core/umem_dmabuf.c @@ -164,12 +164,63 @@ struct ib_umem_dmabuf *ib_umem_dmabuf_get(struct ib_device *device, } EXPORT_SYMBOL(ib_umem_dmabuf_get);
+static void +ib_umem_dmabuf_unsupported_move_notify(struct dma_buf_attachment *attach) +{ + struct ib_umem_dmabuf *umem_dmabuf = attach->importer_priv; + + ibdev_warn_ratelimited(umem_dmabuf->umem.ibdev, + "Invalidate callback should not be called when memory is pinned\n"); +} + +static struct dma_buf_attach_ops ib_umem_dmabuf_attach_pinned_ops = { + .allow_peer2peer = true, + .move_notify = ib_umem_dmabuf_unsupported_move_notify, +}; + +struct ib_umem_dmabuf *ib_umem_dmabuf_get_pinned(struct ib_device *device, + unsigned long offset, + size_t size, int fd, + int access) +{ + struct ib_umem_dmabuf *umem_dmabuf; + int err; + + umem_dmabuf = ib_umem_dmabuf_get(device, offset, size, fd, access, + &ib_umem_dmabuf_attach_pinned_ops); + if (IS_ERR(umem_dmabuf)) + return umem_dmabuf; + + dma_resv_lock(umem_dmabuf->attach->dmabuf->resv, NULL); + err = dma_buf_pin(umem_dmabuf->attach); + if (err) + goto err_release; + umem_dmabuf->pinned = 1; + + err = ib_umem_dmabuf_map_pages(umem_dmabuf); + if (err) + goto err_unpin; + dma_resv_unlock(umem_dmabuf->attach->dmabuf->resv); + + return umem_dmabuf; + +err_unpin: + dma_buf_unpin(umem_dmabuf->attach); +err_release: + dma_resv_unlock(umem_dmabuf->attach->dmabuf->resv); + ib_umem_release(&umem_dmabuf->umem); + return ERR_PTR(err); +} +EXPORT_SYMBOL(ib_umem_dmabuf_get_pinned); + void ib_umem_dmabuf_release(struct ib_umem_dmabuf *umem_dmabuf) { struct dma_buf *dmabuf = umem_dmabuf->attach->dmabuf;
dma_resv_lock(dmabuf->resv, NULL); ib_umem_dmabuf_unmap_pages(umem_dmabuf); + if (umem_dmabuf->pinned) + dma_buf_unpin(umem_dmabuf->attach); dma_resv_unlock(dmabuf->resv);
dma_buf_detach(dmabuf, umem_dmabuf->attach); diff --git a/include/rdma/ib_umem.h b/include/rdma/ib_umem.h index 676c57f5ca80..db9b73e765c3 100644 --- a/include/rdma/ib_umem.h +++ b/include/rdma/ib_umem.h @@ -40,6 +40,7 @@ struct ib_umem_dmabuf { unsigned long first_sg_offset; unsigned long last_sg_trim; void *private; + u8 pinned : 1; };
static inline struct ib_umem_dmabuf *to_ib_umem_dmabuf(struct ib_umem *umem) @@ -140,6 +141,10 @@ struct ib_umem_dmabuf *ib_umem_dmabuf_get(struct ib_device *device, unsigned long offset, size_t size, int fd, int access, const struct dma_buf_attach_ops *ops); +struct ib_umem_dmabuf *ib_umem_dmabuf_get_pinned(struct ib_device *device, + unsigned long offset, + size_t size, int fd, + int access); int ib_umem_dmabuf_map_pages(struct ib_umem_dmabuf *umem_dmabuf); void ib_umem_dmabuf_unmap_pages(struct ib_umem_dmabuf *umem_dmabuf); void ib_umem_dmabuf_release(struct ib_umem_dmabuf *umem_dmabuf); @@ -180,6 +185,13 @@ struct ib_umem_dmabuf *ib_umem_dmabuf_get(struct ib_device *device, { return ERR_PTR(-EOPNOTSUPP); } +struct ib_umem_dmabuf *ib_umem_dmabuf_get_pinned(struct ib_device *device, + unsigned long offset, + size_t size, int fd, + int access) +{ + return ERR_PTR(-EOPNOTSUPP); +} static inline int ib_umem_dmabuf_map_pages(struct ib_umem_dmabuf *umem_dmabuf) { return -EOPNOTSUPP;
Implement a dmabuf importer for the EFA driver. As ODP is not supported, the pinned dmabuf are used to prevent the move_notify callback from being called.
Signed-off-by: Gal Pressman galpress@amazon.com --- drivers/infiniband/hw/efa/efa.h | 4 + drivers/infiniband/hw/efa/efa_main.c | 1 + drivers/infiniband/hw/efa/efa_verbs.c | 127 +++++++++++++++++++------- 3 files changed, 101 insertions(+), 31 deletions(-)
diff --git a/drivers/infiniband/hw/efa/efa.h b/drivers/infiniband/hw/efa/efa.h index 2b8ca099b381..407d7c4baa16 100644 --- a/drivers/infiniband/hw/efa/efa.h +++ b/drivers/infiniband/hw/efa/efa.h @@ -141,6 +141,10 @@ int efa_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr, struct ib_mr *efa_reg_mr(struct ib_pd *ibpd, u64 start, u64 length, u64 virt_addr, int access_flags, struct ib_udata *udata); +struct ib_mr *efa_reg_user_mr_dmabuf(struct ib_pd *ibpd, u64 start, + u64 length, u64 virt_addr, + int fd, int access_flags, + struct ib_udata *udata); int efa_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata); int efa_get_port_immutable(struct ib_device *ibdev, u32 port_num, struct ib_port_immutable *immutable); diff --git a/drivers/infiniband/hw/efa/efa_main.c b/drivers/infiniband/hw/efa/efa_main.c index 203e6ddcacbc..72cd7d952a07 100644 --- a/drivers/infiniband/hw/efa/efa_main.c +++ b/drivers/infiniband/hw/efa/efa_main.c @@ -267,6 +267,7 @@ static const struct ib_device_ops efa_dev_ops = { .query_port = efa_query_port, .query_qp = efa_query_qp, .reg_user_mr = efa_reg_mr, + .reg_user_mr_dmabuf = efa_reg_user_mr_dmabuf,
INIT_RDMA_OBJ_SIZE(ib_ah, efa_ah, ibah), INIT_RDMA_OBJ_SIZE(ib_cq, efa_cq, ibcq), diff --git a/drivers/infiniband/hw/efa/efa_verbs.c b/drivers/infiniband/hw/efa/efa_verbs.c index be6d3ff0f1be..659f5aab3d9e 100644 --- a/drivers/infiniband/hw/efa/efa_verbs.c +++ b/drivers/infiniband/hw/efa/efa_verbs.c @@ -3,6 +3,8 @@ * Copyright 2018-2020 Amazon.com, Inc. or its affiliates. All rights reserved. */
+#include <linux/dma-buf.h> +#include <linux/dma-resv.h> #include <linux/vmalloc.h> #include <linux/log2.h>
@@ -1491,26 +1493,18 @@ static int efa_create_pbl(struct efa_dev *dev, return 0; }
-struct ib_mr *efa_reg_mr(struct ib_pd *ibpd, u64 start, u64 length, - u64 virt_addr, int access_flags, - struct ib_udata *udata) +static struct efa_mr *efa_alloc_mr(struct ib_pd *ibpd, int access_flags, + struct ib_udata *udata) { struct efa_dev *dev = to_edev(ibpd->device); - struct efa_com_reg_mr_params params = {}; - struct efa_com_reg_mr_result result = {}; - struct pbl_context pbl; int supp_access_flags; - unsigned int pg_sz; struct efa_mr *mr; - int inline_size; - int err;
if (udata && udata->inlen && !ib_is_udata_cleared(udata, 0, sizeof(udata->inlen))) { ibdev_dbg(&dev->ibdev, "Incompatible ABI params, udata not cleared\n"); - err = -EINVAL; - goto err_out; + return ERR_PTR(-EINVAL); }
supp_access_flags = @@ -1522,23 +1516,26 @@ struct ib_mr *efa_reg_mr(struct ib_pd *ibpd, u64 start, u64 length, ibdev_dbg(&dev->ibdev, "Unsupported access flags[%#x], supported[%#x]\n", access_flags, supp_access_flags); - err = -EOPNOTSUPP; - goto err_out; + return ERR_PTR(-EOPNOTSUPP); }
mr = kzalloc(sizeof(*mr), GFP_KERNEL); - if (!mr) { - err = -ENOMEM; - goto err_out; - } + if (!mr) + return ERR_PTR(-ENOMEM);
- mr->umem = ib_umem_get(ibpd->device, start, length, access_flags); - if (IS_ERR(mr->umem)) { - err = PTR_ERR(mr->umem); - ibdev_dbg(&dev->ibdev, - "Failed to pin and map user space memory[%d]\n", err); - goto err_free; - } + return mr; +} + +static int efa_register_mr(struct ib_pd *ibpd, struct efa_mr *mr, u64 start, + u64 length, u64 virt_addr, int access_flags) +{ + struct efa_dev *dev = to_edev(ibpd->device); + struct efa_com_reg_mr_params params = {}; + struct efa_com_reg_mr_result result = {}; + struct pbl_context pbl; + unsigned int pg_sz; + int inline_size; + int err;
params.pd = to_epd(ibpd)->pdn; params.iova = virt_addr; @@ -1549,10 +1546,9 @@ struct ib_mr *efa_reg_mr(struct ib_pd *ibpd, u64 start, u64 length, dev->dev_attr.page_size_cap, virt_addr); if (!pg_sz) { - err = -EOPNOTSUPP; ibdev_dbg(&dev->ibdev, "Failed to find a suitable page size in page_size_cap %#llx\n", dev->dev_attr.page_size_cap); - goto err_unmap; + return -EOPNOTSUPP; }
params.page_shift = order_base_2(pg_sz); @@ -1566,21 +1562,21 @@ struct ib_mr *efa_reg_mr(struct ib_pd *ibpd, u64 start, u64 length, if (params.page_num <= inline_size) { err = efa_create_inline_pbl(dev, mr, ¶ms); if (err) - goto err_unmap; + return err;
err = efa_com_register_mr(&dev->edev, ¶ms, &result); if (err) - goto err_unmap; + return err; } else { err = efa_create_pbl(dev, &pbl, mr, ¶ms); if (err) - goto err_unmap; + return err;
err = efa_com_register_mr(&dev->edev, ¶ms, &result); pbl_destroy(dev, &pbl);
if (err) - goto err_unmap; + return err; }
mr->ibmr.lkey = result.l_key; @@ -1588,9 +1584,78 @@ struct ib_mr *efa_reg_mr(struct ib_pd *ibpd, u64 start, u64 length, mr->ibmr.length = length; ibdev_dbg(&dev->ibdev, "Registered mr[%d]\n", mr->ibmr.lkey);
+ return 0; +} + +struct ib_mr *efa_reg_user_mr_dmabuf(struct ib_pd *ibpd, u64 start, + u64 length, u64 virt_addr, + int fd, int access_flags, + struct ib_udata *udata) +{ + struct efa_dev *dev = to_edev(ibpd->device); + struct ib_umem_dmabuf *umem_dmabuf; + struct efa_mr *mr; + int err; + + mr = efa_alloc_mr(ibpd, access_flags, udata); + if (IS_ERR(mr)) { + err = PTR_ERR(mr); + goto err_out; + } + + umem_dmabuf = ib_umem_dmabuf_get_pinned(ibpd->device, start, length, fd, + access_flags); + if (IS_ERR(umem_dmabuf)) { + ibdev_dbg(&dev->ibdev, "Failed to get dmabuf umem[%d]\n", err); + err = PTR_ERR(umem_dmabuf); + goto err_free; + } + + mr->umem = &umem_dmabuf->umem; + err = efa_register_mr(ibpd, mr, start, length, virt_addr, access_flags); + if (err) + goto err_release; + return &mr->ibmr;
-err_unmap: +err_release: + ib_umem_release(mr->umem); +err_free: + kfree(mr); +err_out: + atomic64_inc(&dev->stats.reg_mr_err); + return ERR_PTR(err); +} + +struct ib_mr *efa_reg_mr(struct ib_pd *ibpd, u64 start, u64 length, + u64 virt_addr, int access_flags, + struct ib_udata *udata) +{ + struct efa_dev *dev = to_edev(ibpd->device); + struct efa_mr *mr; + int err; + + mr = efa_alloc_mr(ibpd, access_flags, udata); + if (IS_ERR(mr)) { + err = PTR_ERR(mr); + goto err_out; + } + + mr->umem = ib_umem_get(ibpd->device, start, length, access_flags); + if (IS_ERR(mr->umem)) { + err = PTR_ERR(mr->umem); + ibdev_dbg(&dev->ibdev, + "Failed to pin and map user space memory[%d]\n", err); + goto err_free; + } + + err = efa_register_mr(ibpd, mr, start, length, virt_addr, access_flags); + if (err) + goto err_release; + + return &mr->ibmr; + +err_release: ib_umem_release(mr->umem); err_free: kfree(mr);
On 12/10/2021 15:09, Gal Pressman wrote:
Hey all,
This is a followup to my previous RFCs [1][2], which now adds a new api to the RDMA subsystem that allows drivers to get a pinned dmabuf memory region without requiring an implementation of the move_notify callback. The new api makes use of the dynamic attachment api implemented in the RDMA subsystem, but calls dma_buf_pin() in order to make sure that the callback will not be called, as suggested by Christian.
As explained in the previous RFC, move_notify requires the RDMA device to support on-demand-paging (ODP) which is not common on most devices (only supported by mlx5).
While the dynamic requirement makes sense for certain GPUs, some devices (such as habanalabs) have device memory that is always "pinned" and do not need/use the move_notify operation.
Patch #1 changes the dmabuf documentation to make it clear that pinning does not necessarily mean the memory must be moved to system memory, it is up to the exporter to decide. Patch #2 adds the RDMA api that allows drivers to get pinned dmabuf memory regions. Patch #3 adds the EFA implementation of the dmabuf importer.
The motivation of this submission is to use habanalabs as the dmabuf exporter, and EFA as the importer to allow for peer2peer access through libibverbs.
[1] https://lore.kernel.org/linux-rdma/20210818074352.29950-1-galpress@amazon.co... [2] https://lore.kernel.org/linux-rdma/20211007104301.76693-1-galpress@amazon.co...
Thanks
Hey Jason, did you get a chance to take a look?
On Tue, Oct 12, 2021 at 03:09:00PM +0300, Gal Pressman wrote:
Hey all,
This is a followup to my previous RFCs [1][2], which now adds a new api to the RDMA subsystem that allows drivers to get a pinned dmabuf memory region without requiring an implementation of the move_notify callback. The new api makes use of the dynamic attachment api implemented in the RDMA subsystem, but calls dma_buf_pin() in order to make sure that the callback will not be called, as suggested by Christian.
As explained in the previous RFC, move_notify requires the RDMA device to support on-demand-paging (ODP) which is not common on most devices (only supported by mlx5).
While the dynamic requirement makes sense for certain GPUs, some devices (such as habanalabs) have device memory that is always "pinned" and do not need/use the move_notify operation.
Patch #1 changes the dmabuf documentation to make it clear that pinning does not necessarily mean the memory must be moved to system memory, it is up to the exporter to decide. Patch #2 adds the RDMA api that allows drivers to get pinned dmabuf memory regions. Patch #3 adds the EFA implementation of the dmabuf importer.
The motivation of this submission is to use habanalabs as the dmabuf exporter, and EFA as the importer to allow for peer2peer access through libibverbs.
[1] https://lore.kernel.org/linux-rdma/20210818074352.29950-1-galpress@amazon.co... [2] https://lore.kernel.org/linux-rdma/20211007104301.76693-1-galpress@amazon.co...
Thanks
Gal Pressman (3): dma-buf: Fix pin callback comment RDMA/umem: Allow pinned dmabuf umem usage RDMA/efa: Add support for dmabuf memory regions
Applied to for-next, thanks
Jason
dri-devel@lists.freedesktop.org