Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support

1 Jul 2020


      On Wed, Jul 01, 2020 at 02:07:44PM +0200, Daniel Vetter wrote:
...
Either my mailer ate half the thread or it's still stuck somewhere, so
jumping in the middle a bit.
On Wed, Jul 01, 2020 at 11:03:06AM +0200, Christian König wrote:
...
Am 30.06.20 um 20:46 schrieb Xiong, Jianxin:
...
...
-----Original Message-----
From: Jason Gunthorpe jgg@ziepe.ca
Sent: Tuesday, June 30, 2020 10:35 AM
To: Xiong, Jianxin jianxin.xiong@intel.com
Cc: linux-rdma@vger.kernel.org; Doug Ledford dledford@redhat.com; Sumit Semwal sumit.semwal@linaro.org; Leon Romanovsky
leon@kernel.org; Vetter, Daniel daniel.vetter@intel.com; Christian Koenig christian.koenig@amd.com
Subject: Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
On Tue, Jun 30, 2020 at 05:21:33PM +0000, Xiong, Jianxin wrote:
...
...
> Heterogeneous Memory Management (HMM) utilizes
> mmu_interval_notifier and ZONE_DEVICE to support shared virtual
> address space and page migration between system memory and device
> memory. HMM doesn't support pinning device memory because pages
> located on device must be able to migrate to system memory when
> accessed by CPU. Peer-to-peer access is possible if the peer can
> handle page fault. For RDMA, that means the NIC must support on-demand paging.
peer-peer access is currently not possible with hmm_range_fault().
Currently hmm_range_fault() always sets the cpu access flag and device
private pages are migrated to the system RAM in the fault handler.
However, it's possible to have a modified code flow to keep the device
private page info for use with peer to peer access.
Sort of, but only within the same device, RDMA or anything else generic can't reach inside a DEVICE_PRIVATE and extract anything useful.
But pfn is supposed to be all that is needed.
...
...
...
So.. this patch doesn't really do anything new? We could just make a MR against the DMA buf mmap and get to the same place?
That's right, the patch alone is just half of the story. The
functionality depends on availability of dma-buf exporter that can pin
the device memory.
Well, what do you want to happen here? The RDMA parts are reasonable, but I don't want to add new functionality without a purpose - the
other parts need to be settled out first.
At the RDMA side, we mainly want to check if the changes are acceptable. For example,
the part about adding 'fd' to the device ops and the ioctl interface. All the previous
comments are very helpful for us to refine the patch so that we can be ready when
GPU side support becomes available.
...
The need for the dynamic mapping support for even the current DMA Buf hacky P2P users is really too bad. Can you get any GPU driver to
support non-dynamic mapping?
We are working on direct direction.
...
...
...
> migrate to system RAM. This is due to the lack of knowledge about
> whether the importer can perform peer-to-peer access and the lack
> of resource limit control measure for GPU. For the first part, the
> latest dma-buf driver has a peer-to-peer flag for the importer,
> but the flag is currently tied to dynamic mapping support, which
> requires on-demand paging support from the NIC to work.
ODP for DMA buf?
Right.
Hum. This is not actually so hard to do. The whole dma buf proposal would make a lot more sense if the 'dma buf MR' had to be the
dynamic kind and the driver had to provide the faulting. It would not be so hard to change mlx5 to be able to work like this, perhaps. (the
locking might be a bit tricky though)
The main issue is that not all NICs support ODP.
You don't need on-demand paging support from the NIC for dynamic mapping to
work.
All you need is the ability to stop wait for ongoing accesses to end and
make sure that new ones grab a new mapping.
So having no clue about rdma myself much, this sounds rather interesting.
Sure it would result in immediately re-acquiring the pages, but that's
also really all we need to be able to move buffers around on the gpu side.
And with dma_resv_lock there's no livelock risk if the NIC immediately
starts a kthread/work_struct which reacquires all the dma-buf and
everything else it needs. Plus also with the full ww_mutex deadlock
backoff dance there's no locking issues with having to acquire an entire
pile of dma_resv_lock, that's natively supported (gpus very much need to
be able to lock arbitrary set of buffers).
And I think if that would allow us to avoid the entire "avoid random
drivers pinning dma-buf into vram" discussions, much better and quicker to
land something like that.
I guess the big question is going to be how to fit this into rdma, since
the ww_mutex deadlock backoff dance needs to be done at a fairly high
level. For gpu drivers it's always done at the top level ioctl entry
point.
Also, just to alleviate fears: I think all that dynamic dma-buf stuff for
rdma should be doable this way _without_ having to interact with
dma_fence. Avoiding that I think is the biggest request Jason has in this
area :-)
Furthermore, it is officially ok to allocate memory while holding a
dma_resv_lock. What is not ok (and might cause issues if you somehow mix
up things in strange ways) is taking a userspace fault, because gpu
drivers must be able to take the dma_resv_lock in their fault handlers.
That might pose a problem.
Also, all these rules are now enforced by lockdep, might_fault() and
similar checks.
-Daniel
...
...
Apart from that this is a rather interesting work.
Regards,
Christian.
...
...
...
...
> There are a few possible ways to address these issues, such as
> decoupling peer-to-peer flag from dynamic mapping, allowing more
> leeway for individual drivers to make the pinning decision and
> adding GPU resource limit control via cgroup. We would like to get
> comments on this patch series with the assumption that device
> memory pinning via dma-buf is supported by some GPU drivers, and
> at the same time welcome open discussions on how to address the
> aforementioned issues as well as GPU-NIC peer-to-peer access solutions in general.
These seem like DMA buf problems, not RDMA problems, why are you
asking these questions with a RDMA patch set? The usual DMA buf people are not even Cc'd here.
The intention is to have people from both RDMA and DMA buffer side to
comment. Sumit Semwal is the DMA buffer maintainer according to the
MAINTAINERS file. I agree more people could be invited to the discussion.
Just added Christian Koenig to the cc-list.
MAINTAINERS also says to cc and entire pile of mailing lists, where the
usual suspects (including Christian and me) hang around. Is that the
reason I got only like half the thread here?
For next time around, really include everyone relevant here please.
-Daniel
...
...
...
Would be good to have added the drm lists too
Thanks, cc'd dri-devel here, and will also do the same for the previous part of the thread.
...
...
If the umem_description you mentioned is for information used to
create the umem (e.g. a structure for all the parameters), then this would work better.
It would make some more sense, and avoid all these weird EOPNOTSUPPS.
Good, thanks for the suggestion.
...
Jason

dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support