-----Original Message----- From: Jason Gunthorpe jgg@ziepe.ca Sent: Tuesday, June 30, 2020 10:35 AM To: Xiong, Jianxin jianxin.xiong@intel.com Cc: linux-rdma@vger.kernel.org; Doug Ledford dledford@redhat.com; Sumit Semwal sumit.semwal@linaro.org; Leon Romanovsky leon@kernel.org; Vetter, Daniel daniel.vetter@intel.com; Christian Koenig christian.koenig@amd.com Subject: Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
On Tue, Jun 30, 2020 at 05:21:33PM +0000, Xiong, Jianxin wrote:
Heterogeneous Memory Management (HMM) utilizes mmu_interval_notifier and ZONE_DEVICE to support shared virtual address space and page migration between system memory and device memory. HMM doesn't support pinning device memory because pages located on device must be able to migrate to system memory when accessed by CPU. Peer-to-peer access is possible if the peer can handle page fault. For RDMA, that means the NIC must support on-demand paging.
peer-peer access is currently not possible with hmm_range_fault().
Currently hmm_range_fault() always sets the cpu access flag and device private pages are migrated to the system RAM in the fault handler. However, it's possible to have a modified code flow to keep the device private page info for use with peer to peer access.
Sort of, but only within the same device, RDMA or anything else generic can't reach inside a DEVICE_PRIVATE and extract anything useful.
But pfn is supposed to be all that is needed.
So.. this patch doesn't really do anything new? We could just make a MR against the DMA buf mmap and get to the same place?
That's right, the patch alone is just half of the story. The functionality depends on availability of dma-buf exporter that can pin the device memory.
Well, what do you want to happen here? The RDMA parts are reasonable, but I don't want to add new functionality without a purpose - the other parts need to be settled out first.
At the RDMA side, we mainly want to check if the changes are acceptable. For example, the part about adding 'fd' to the device ops and the ioctl interface. All the previous comments are very helpful for us to refine the patch so that we can be ready when GPU side support becomes available.
The need for the dynamic mapping support for even the current DMA Buf hacky P2P users is really too bad. Can you get any GPU driver to support non-dynamic mapping?
We are working on direct direction.
migrate to system RAM. This is due to the lack of knowledge about whether the importer can perform peer-to-peer access and the lack of resource limit control measure for GPU. For the first part, the latest dma-buf driver has a peer-to-peer flag for the importer, but the flag is currently tied to dynamic mapping support, which requires on-demand paging support from the NIC to work.
ODP for DMA buf?
Right.
Hum. This is not actually so hard to do. The whole dma buf proposal would make a lot more sense if the 'dma buf MR' had to be the dynamic kind and the driver had to provide the faulting. It would not be so hard to change mlx5 to be able to work like this, perhaps. (the locking might be a bit tricky though)
The main issue is that not all NICs support ODP.
There are a few possible ways to address these issues, such as decoupling peer-to-peer flag from dynamic mapping, allowing more leeway for individual drivers to make the pinning decision and adding GPU resource limit control via cgroup. We would like to get comments on this patch series with the assumption that device memory pinning via dma-buf is supported by some GPU drivers, and at the same time welcome open discussions on how to address the aforementioned issues as well as GPU-NIC peer-to-peer access solutions in general.
These seem like DMA buf problems, not RDMA problems, why are you asking these questions with a RDMA patch set? The usual DMA buf people are not even Cc'd here.
The intention is to have people from both RDMA and DMA buffer side to comment. Sumit Semwal is the DMA buffer maintainer according to the MAINTAINERS file. I agree more people could be invited to the discussion. Just added Christian Koenig to the cc-list.
Would be good to have added the drm lists too
Thanks, cc'd dri-devel here, and will also do the same for the previous part of the thread.
If the umem_description you mentioned is for information used to create the umem (e.g. a structure for all the parameters), then this would work better.
It would make some more sense, and avoid all these weird EOPNOTSUPPS.
Good, thanks for the suggestion.
Jason
On Tue, Jun 30, 2020 at 06:46:17PM +0000, Xiong, Jianxin wrote:
From: Jason Gunthorpe jgg@ziepe.ca Sent: Tuesday, June 30, 2020 10:35 AM To: Xiong, Jianxin jianxin.xiong@intel.com Cc: linux-rdma@vger.kernel.org; Doug Ledford dledford@redhat.com; Sumit Semwal sumit.semwal@linaro.org; Leon Romanovsky leon@kernel.org; Vetter, Daniel daniel.vetter@intel.com; Christian Koenig christian.koenig@amd.com Subject: Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
On Tue, Jun 30, 2020 at 05:21:33PM +0000, Xiong, Jianxin wrote:
Heterogeneous Memory Management (HMM) utilizes mmu_interval_notifier and ZONE_DEVICE to support shared virtual address space and page migration between system memory and device memory. HMM doesn't support pinning device memory because pages located on device must be able to migrate to system memory when accessed by CPU. Peer-to-peer access is possible if the peer can handle page fault. For RDMA, that means the NIC must support on-demand paging.
peer-peer access is currently not possible with hmm_range_fault().
Currently hmm_range_fault() always sets the cpu access flag and device private pages are migrated to the system RAM in the fault handler. However, it's possible to have a modified code flow to keep the device private page info for use with peer to peer access.
Sort of, but only within the same device, RDMA or anything else generic can't reach inside a DEVICE_PRIVATE and extract anything useful.
But pfn is supposed to be all that is needed.
Needed for what? The PFN of the DEVICE_PRIVATE pages is useless for anything.
Well, what do you want to happen here? The RDMA parts are reasonable, but I don't want to add new functionality without a purpose - the other parts need to be settled out first.
At the RDMA side, we mainly want to check if the changes are acceptable. For example, the part about adding 'fd' to the device ops and the ioctl interface. All the previous comments are very helpful for us to refine the patch so that we can be ready when GPU side support becomes available.
Well, I'm not totally happy with the way the umem and the fd is handled so roughly and incompletely..
Hum. This is not actually so hard to do. The whole dma buf proposal would make a lot more sense if the 'dma buf MR' had to be the dynamic kind and the driver had to provide the faulting. It would not be so hard to change mlx5 to be able to work like this, perhaps. (the locking might be a bit tricky though)
The main issue is that not all NICs support ODP.
Sure, but there is lots of infrastructure work here to be done on dma buf, having a correct consumer in the form of ODP might be helpful to advance it.
Jason
-----Original Message----- From: Jason Gunthorpe jgg@ziepe.ca Sent: Tuesday, June 30, 2020 12:17 PM To: Xiong, Jianxin jianxin.xiong@intel.com Cc: linux-rdma@vger.kernel.org; Doug Ledford dledford@redhat.com; Sumit Semwal sumit.semwal@linaro.org; Leon Romanovsky leon@kernel.org; Vetter, Daniel daniel.vetter@intel.com; Christian Koenig christian.koenig@amd.com; dri- devel@lists.freedesktop.org Subject: Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
On Tue, Jun 30, 2020 at 05:21:33PM +0000, Xiong, Jianxin wrote:
Heterogeneous Memory Management (HMM) utilizes mmu_interval_notifier and ZONE_DEVICE to support shared virtual address space and page migration between system memory and device memory. HMM doesn't support pinning device memory because pages located on device must be able to migrate to system memory when accessed by CPU. Peer-to-peer access is possible if the peer can handle page fault. For RDMA, that means the NIC must support on-demand paging.
peer-peer access is currently not possible with hmm_range_fault().
Currently hmm_range_fault() always sets the cpu access flag and device private pages are migrated to the system RAM in the fault handler. However, it's possible to have a modified code flow to keep the device private page info for use with peer to peer access.
Sort of, but only within the same device, RDMA or anything else generic can't reach inside a DEVICE_PRIVATE and extract anything
useful.
But pfn is supposed to be all that is needed.
Needed for what? The PFN of the DEVICE_PRIVATE pages is useless for anything.
Hmm. I thought the pfn corresponds to the address in the BAR range. I could be wrong here.
Well, what do you want to happen here? The RDMA parts are reasonable, but I don't want to add new functionality without a purpose - the other parts need to be settled out first.
At the RDMA side, we mainly want to check if the changes are acceptable. For example, the part about adding 'fd' to the device ops and the ioctl interface. All the previous comments are very helpful for us to refine the patch so that we can be ready when GPU side support becomes available.
Well, I'm not totally happy with the way the umem and the fd is handled so roughly and incompletely..
Yes, this feedback is very helpful. Will work on improving the code.
Hum. This is not actually so hard to do. The whole dma buf proposal would make a lot more sense if the 'dma buf MR' had to be the dynamic kind and the driver had to provide the faulting. It would not be so hard to change mlx5 to be able to work like this, perhaps. (the locking might be a bit tricky though)
The main issue is that not all NICs support ODP.
Sure, but there is lots of infrastructure work here to be done on dma buf, having a correct consumer in the form of ODP might be helpful to advance it.
Good point. Thanks.
Jason
On Tue, Jun 30, 2020 at 08:08:46PM +0000, Xiong, Jianxin wrote:
From: Jason Gunthorpe jgg@ziepe.ca Sent: Tuesday, June 30, 2020 12:17 PM To: Xiong, Jianxin jianxin.xiong@intel.com Cc: linux-rdma@vger.kernel.org; Doug Ledford dledford@redhat.com; Sumit Semwal sumit.semwal@linaro.org; Leon Romanovsky leon@kernel.org; Vetter, Daniel daniel.vetter@intel.com; Christian Koenig christian.koenig@amd.com; dri- devel@lists.freedesktop.org Subject: Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
On Tue, Jun 30, 2020 at 05:21:33PM +0000, Xiong, Jianxin wrote:
> Heterogeneous Memory Management (HMM) utilizes > mmu_interval_notifier and ZONE_DEVICE to support shared > virtual address space and page migration between system memory > and device memory. HMM doesn't support pinning device memory > because pages located on device must be able to migrate to > system memory when accessed by CPU. Peer-to-peer access is > possible if the peer can handle page fault. For RDMA, that means the NIC must support on-demand paging.
peer-peer access is currently not possible with hmm_range_fault().
Currently hmm_range_fault() always sets the cpu access flag and device private pages are migrated to the system RAM in the fault handler. However, it's possible to have a modified code flow to keep the device private page info for use with peer to peer access.
Sort of, but only within the same device, RDMA or anything else generic can't reach inside a DEVICE_PRIVATE and extract anything
useful.
But pfn is supposed to be all that is needed.
Needed for what? The PFN of the DEVICE_PRIVATE pages is useless for anything.
Hmm. I thought the pfn corresponds to the address in the BAR range. I could be wrong here.
No, DEVICE_PRIVATE is a dummy pfn to empty address space.
Jason
Am 30.06.20 um 20:46 schrieb Xiong, Jianxin:
-----Original Message----- From: Jason Gunthorpe jgg@ziepe.ca Sent: Tuesday, June 30, 2020 10:35 AM To: Xiong, Jianxin jianxin.xiong@intel.com Cc: linux-rdma@vger.kernel.org; Doug Ledford dledford@redhat.com; Sumit Semwal sumit.semwal@linaro.org; Leon Romanovsky leon@kernel.org; Vetter, Daniel daniel.vetter@intel.com; Christian Koenig christian.koenig@amd.com Subject: Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
On Tue, Jun 30, 2020 at 05:21:33PM +0000, Xiong, Jianxin wrote:
Heterogeneous Memory Management (HMM) utilizes mmu_interval_notifier and ZONE_DEVICE to support shared virtual address space and page migration between system memory and device memory. HMM doesn't support pinning device memory because pages located on device must be able to migrate to system memory when accessed by CPU. Peer-to-peer access is possible if the peer can handle page fault. For RDMA, that means the NIC must support on-demand paging.
peer-peer access is currently not possible with hmm_range_fault().
Currently hmm_range_fault() always sets the cpu access flag and device private pages are migrated to the system RAM in the fault handler. However, it's possible to have a modified code flow to keep the device private page info for use with peer to peer access.
Sort of, but only within the same device, RDMA or anything else generic can't reach inside a DEVICE_PRIVATE and extract anything useful.
But pfn is supposed to be all that is needed.
So.. this patch doesn't really do anything new? We could just make a MR against the DMA buf mmap and get to the same place?
That's right, the patch alone is just half of the story. The functionality depends on availability of dma-buf exporter that can pin the device memory.
Well, what do you want to happen here? The RDMA parts are reasonable, but I don't want to add new functionality without a purpose - the other parts need to be settled out first.
At the RDMA side, we mainly want to check if the changes are acceptable. For example, the part about adding 'fd' to the device ops and the ioctl interface. All the previous comments are very helpful for us to refine the patch so that we can be ready when GPU side support becomes available.
The need for the dynamic mapping support for even the current DMA Buf hacky P2P users is really too bad. Can you get any GPU driver to support non-dynamic mapping?
We are working on direct direction.
migrate to system RAM. This is due to the lack of knowledge about whether the importer can perform peer-to-peer access and the lack of resource limit control measure for GPU. For the first part, the latest dma-buf driver has a peer-to-peer flag for the importer, but the flag is currently tied to dynamic mapping support, which requires on-demand paging support from the NIC to work.
ODP for DMA buf?
Right.
Hum. This is not actually so hard to do. The whole dma buf proposal would make a lot more sense if the 'dma buf MR' had to be the dynamic kind and the driver had to provide the faulting. It would not be so hard to change mlx5 to be able to work like this, perhaps. (the locking might be a bit tricky though)
The main issue is that not all NICs support ODP.
You don't need on-demand paging support from the NIC for dynamic mapping to work.
All you need is the ability to stop wait for ongoing accesses to end and make sure that new ones grab a new mapping.
Apart from that this is a rather interesting work.
Regards, Christian.
There are a few possible ways to address these issues, such as decoupling peer-to-peer flag from dynamic mapping, allowing more leeway for individual drivers to make the pinning decision and adding GPU resource limit control via cgroup. We would like to get comments on this patch series with the assumption that device memory pinning via dma-buf is supported by some GPU drivers, and at the same time welcome open discussions on how to address the aforementioned issues as well as GPU-NIC peer-to-peer access solutions in general.
These seem like DMA buf problems, not RDMA problems, why are you asking these questions with a RDMA patch set? The usual DMA buf people are not even Cc'd here.
The intention is to have people from both RDMA and DMA buffer side to comment. Sumit Semwal is the DMA buffer maintainer according to the MAINTAINERS file. I agree more people could be invited to the discussion. Just added Christian Koenig to the cc-list.
Would be good to have added the drm lists too
Thanks, cc'd dri-devel here, and will also do the same for the previous part of the thread.
If the umem_description you mentioned is for information used to create the umem (e.g. a structure for all the parameters), then this would work better.
It would make some more sense, and avoid all these weird EOPNOTSUPPS.
Good, thanks for the suggestion.
Jason
Either my mailer ate half the thread or it's still stuck somewhere, so jumping in the middle a bit.
On Wed, Jul 01, 2020 at 11:03:06AM +0200, Christian König wrote:
Am 30.06.20 um 20:46 schrieb Xiong, Jianxin:
-----Original Message----- From: Jason Gunthorpe jgg@ziepe.ca Sent: Tuesday, June 30, 2020 10:35 AM To: Xiong, Jianxin jianxin.xiong@intel.com Cc: linux-rdma@vger.kernel.org; Doug Ledford dledford@redhat.com; Sumit Semwal sumit.semwal@linaro.org; Leon Romanovsky leon@kernel.org; Vetter, Daniel daniel.vetter@intel.com; Christian Koenig christian.koenig@amd.com Subject: Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
On Tue, Jun 30, 2020 at 05:21:33PM +0000, Xiong, Jianxin wrote:
Heterogeneous Memory Management (HMM) utilizes mmu_interval_notifier and ZONE_DEVICE to support shared virtual address space and page migration between system memory and device memory. HMM doesn't support pinning device memory because pages located on device must be able to migrate to system memory when accessed by CPU. Peer-to-peer access is possible if the peer can handle page fault. For RDMA, that means the NIC must support on-demand paging.
peer-peer access is currently not possible with hmm_range_fault().
Currently hmm_range_fault() always sets the cpu access flag and device private pages are migrated to the system RAM in the fault handler. However, it's possible to have a modified code flow to keep the device private page info for use with peer to peer access.
Sort of, but only within the same device, RDMA or anything else generic can't reach inside a DEVICE_PRIVATE and extract anything useful.
But pfn is supposed to be all that is needed.
So.. this patch doesn't really do anything new? We could just make a MR against the DMA buf mmap and get to the same place?
That's right, the patch alone is just half of the story. The functionality depends on availability of dma-buf exporter that can pin the device memory.
Well, what do you want to happen here? The RDMA parts are reasonable, but I don't want to add new functionality without a purpose - the other parts need to be settled out first.
At the RDMA side, we mainly want to check if the changes are acceptable. For example, the part about adding 'fd' to the device ops and the ioctl interface. All the previous comments are very helpful for us to refine the patch so that we can be ready when GPU side support becomes available.
The need for the dynamic mapping support for even the current DMA Buf hacky P2P users is really too bad. Can you get any GPU driver to support non-dynamic mapping?
We are working on direct direction.
migrate to system RAM. This is due to the lack of knowledge about whether the importer can perform peer-to-peer access and the lack of resource limit control measure for GPU. For the first part, the latest dma-buf driver has a peer-to-peer flag for the importer, but the flag is currently tied to dynamic mapping support, which requires on-demand paging support from the NIC to work.
ODP for DMA buf?
Right.
Hum. This is not actually so hard to do. The whole dma buf proposal would make a lot more sense if the 'dma buf MR' had to be the dynamic kind and the driver had to provide the faulting. It would not be so hard to change mlx5 to be able to work like this, perhaps. (the locking might be a bit tricky though)
The main issue is that not all NICs support ODP.
You don't need on-demand paging support from the NIC for dynamic mapping to work.
All you need is the ability to stop wait for ongoing accesses to end and make sure that new ones grab a new mapping.
So having no clue about rdma myself much, this sounds rather interesting. Sure it would result in immediately re-acquiring the pages, but that's also really all we need to be able to move buffers around on the gpu side. And with dma_resv_lock there's no livelock risk if the NIC immediately starts a kthread/work_struct which reacquires all the dma-buf and everything else it needs. Plus also with the full ww_mutex deadlock backoff dance there's no locking issues with having to acquire an entire pile of dma_resv_lock, that's natively supported (gpus very much need to be able to lock arbitrary set of buffers).
And I think if that would allow us to avoid the entire "avoid random drivers pinning dma-buf into vram" discussions, much better and quicker to land something like that.
I guess the big question is going to be how to fit this into rdma, since the ww_mutex deadlock backoff dance needs to be done at a fairly high level. For gpu drivers it's always done at the top level ioctl entry point.
Apart from that this is a rather interesting work.
Regards, Christian.
There are a few possible ways to address these issues, such as decoupling peer-to-peer flag from dynamic mapping, allowing more leeway for individual drivers to make the pinning decision and adding GPU resource limit control via cgroup. We would like to get comments on this patch series with the assumption that device memory pinning via dma-buf is supported by some GPU drivers, and at the same time welcome open discussions on how to address the aforementioned issues as well as GPU-NIC peer-to-peer access solutions in general.
These seem like DMA buf problems, not RDMA problems, why are you asking these questions with a RDMA patch set? The usual DMA buf people are not even Cc'd here.
The intention is to have people from both RDMA and DMA buffer side to comment. Sumit Semwal is the DMA buffer maintainer according to the MAINTAINERS file. I agree more people could be invited to the discussion. Just added Christian Koenig to the cc-list.
MAINTAINERS also says to cc and entire pile of mailing lists, where the usual suspects (including Christian and me) hang around. Is that the reason I got only like half the thread here?
For next time around, really include everyone relevant here please. -Daniel
Would be good to have added the drm lists too
Thanks, cc'd dri-devel here, and will also do the same for the previous part of the thread.
If the umem_description you mentioned is for information used to create the umem (e.g. a structure for all the parameters), then this would work better.
It would make some more sense, and avoid all these weird EOPNOTSUPPS.
Good, thanks for the suggestion.
Jason
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Wed, Jul 01, 2020 at 02:07:44PM +0200, Daniel Vetter wrote:
Either my mailer ate half the thread or it's still stuck somewhere, so jumping in the middle a bit.
On Wed, Jul 01, 2020 at 11:03:06AM +0200, Christian König wrote:
Am 30.06.20 um 20:46 schrieb Xiong, Jianxin:
-----Original Message----- From: Jason Gunthorpe jgg@ziepe.ca Sent: Tuesday, June 30, 2020 10:35 AM To: Xiong, Jianxin jianxin.xiong@intel.com Cc: linux-rdma@vger.kernel.org; Doug Ledford dledford@redhat.com; Sumit Semwal sumit.semwal@linaro.org; Leon Romanovsky leon@kernel.org; Vetter, Daniel daniel.vetter@intel.com; Christian Koenig christian.koenig@amd.com Subject: Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
On Tue, Jun 30, 2020 at 05:21:33PM +0000, Xiong, Jianxin wrote:
> Heterogeneous Memory Management (HMM) utilizes > mmu_interval_notifier and ZONE_DEVICE to support shared virtual > address space and page migration between system memory and device > memory. HMM doesn't support pinning device memory because pages > located on device must be able to migrate to system memory when > accessed by CPU. Peer-to-peer access is possible if the peer can > handle page fault. For RDMA, that means the NIC must support on-demand paging. peer-peer access is currently not possible with hmm_range_fault().
Currently hmm_range_fault() always sets the cpu access flag and device private pages are migrated to the system RAM in the fault handler. However, it's possible to have a modified code flow to keep the device private page info for use with peer to peer access.
Sort of, but only within the same device, RDMA or anything else generic can't reach inside a DEVICE_PRIVATE and extract anything useful.
But pfn is supposed to be all that is needed.
So.. this patch doesn't really do anything new? We could just make a MR against the DMA buf mmap and get to the same place?
That's right, the patch alone is just half of the story. The functionality depends on availability of dma-buf exporter that can pin the device memory.
Well, what do you want to happen here? The RDMA parts are reasonable, but I don't want to add new functionality without a purpose - the other parts need to be settled out first.
At the RDMA side, we mainly want to check if the changes are acceptable. For example, the part about adding 'fd' to the device ops and the ioctl interface. All the previous comments are very helpful for us to refine the patch so that we can be ready when GPU side support becomes available.
The need for the dynamic mapping support for even the current DMA Buf hacky P2P users is really too bad. Can you get any GPU driver to support non-dynamic mapping?
We are working on direct direction.
> migrate to system RAM. This is due to the lack of knowledge about > whether the importer can perform peer-to-peer access and the lack > of resource limit control measure for GPU. For the first part, the > latest dma-buf driver has a peer-to-peer flag for the importer, > but the flag is currently tied to dynamic mapping support, which > requires on-demand paging support from the NIC to work. ODP for DMA buf?
Right.
Hum. This is not actually so hard to do. The whole dma buf proposal would make a lot more sense if the 'dma buf MR' had to be the dynamic kind and the driver had to provide the faulting. It would not be so hard to change mlx5 to be able to work like this, perhaps. (the locking might be a bit tricky though)
The main issue is that not all NICs support ODP.
You don't need on-demand paging support from the NIC for dynamic mapping to work.
All you need is the ability to stop wait for ongoing accesses to end and make sure that new ones grab a new mapping.
So having no clue about rdma myself much, this sounds rather interesting. Sure it would result in immediately re-acquiring the pages, but that's also really all we need to be able to move buffers around on the gpu side. And with dma_resv_lock there's no livelock risk if the NIC immediately starts a kthread/work_struct which reacquires all the dma-buf and everything else it needs. Plus also with the full ww_mutex deadlock backoff dance there's no locking issues with having to acquire an entire pile of dma_resv_lock, that's natively supported (gpus very much need to be able to lock arbitrary set of buffers).
And I think if that would allow us to avoid the entire "avoid random drivers pinning dma-buf into vram" discussions, much better and quicker to land something like that.
I guess the big question is going to be how to fit this into rdma, since the ww_mutex deadlock backoff dance needs to be done at a fairly high level. For gpu drivers it's always done at the top level ioctl entry point.
Also, just to alleviate fears: I think all that dynamic dma-buf stuff for rdma should be doable this way _without_ having to interact with dma_fence. Avoiding that I think is the biggest request Jason has in this area :-)
Furthermore, it is officially ok to allocate memory while holding a dma_resv_lock. What is not ok (and might cause issues if you somehow mix up things in strange ways) is taking a userspace fault, because gpu drivers must be able to take the dma_resv_lock in their fault handlers. That might pose a problem.
Also, all these rules are now enforced by lockdep, might_fault() and similar checks. -Daniel
Apart from that this is a rather interesting work.
Regards, Christian.
> There are a few possible ways to address these issues, such as > decoupling peer-to-peer flag from dynamic mapping, allowing more > leeway for individual drivers to make the pinning decision and > adding GPU resource limit control via cgroup. We would like to get > comments on this patch series with the assumption that device > memory pinning via dma-buf is supported by some GPU drivers, and > at the same time welcome open discussions on how to address the > aforementioned issues as well as GPU-NIC peer-to-peer access solutions in general. These seem like DMA buf problems, not RDMA problems, why are you asking these questions with a RDMA patch set? The usual DMA buf people are not even Cc'd here.
The intention is to have people from both RDMA and DMA buffer side to comment. Sumit Semwal is the DMA buffer maintainer according to the MAINTAINERS file. I agree more people could be invited to the discussion. Just added Christian Koenig to the cc-list.
MAINTAINERS also says to cc and entire pile of mailing lists, where the usual suspects (including Christian and me) hang around. Is that the reason I got only like half the thread here?
For next time around, really include everyone relevant here please. -Daniel
Would be good to have added the drm lists too
Thanks, cc'd dri-devel here, and will also do the same for the previous part of the thread.
If the umem_description you mentioned is for information used to create the umem (e.g. a structure for all the parameters), then this would work better.
It would make some more sense, and avoid all these weird EOPNOTSUPPS.
Good, thanks for the suggestion.
Jason
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Wed, Jul 01, 2020 at 11:03:06AM +0200, Christian König wrote:
Am 30.06.20 um 20:46 schrieb Xiong, Jianxin:
From: Jason Gunthorpe jgg@ziepe.ca Sent: Tuesday, June 30, 2020 10:35 AM To: Xiong, Jianxin jianxin.xiong@intel.com Cc: linux-rdma@vger.kernel.org; Doug Ledford dledford@redhat.com; Sumit Semwal sumit.semwal@linaro.org; Leon Romanovsky leon@kernel.org; Vetter, Daniel daniel.vetter@intel.com; Christian Koenig christian.koenig@amd.com Subject: Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
On Tue, Jun 30, 2020 at 05:21:33PM +0000, Xiong, Jianxin wrote:
Heterogeneous Memory Management (HMM) utilizes mmu_interval_notifier and ZONE_DEVICE to support shared virtual address space and page migration between system memory and device memory. HMM doesn't support pinning device memory because pages located on device must be able to migrate to system memory when accessed by CPU. Peer-to-peer access is possible if the peer can handle page fault. For RDMA, that means the NIC must support on-demand paging.
peer-peer access is currently not possible with hmm_range_fault().
Currently hmm_range_fault() always sets the cpu access flag and device private pages are migrated to the system RAM in the fault handler. However, it's possible to have a modified code flow to keep the device private page info for use with peer to peer access.
Sort of, but only within the same device, RDMA or anything else generic can't reach inside a DEVICE_PRIVATE and extract anything useful.
But pfn is supposed to be all that is needed.
So.. this patch doesn't really do anything new? We could just make a MR against the DMA buf mmap and get to the same place?
That's right, the patch alone is just half of the story. The functionality depends on availability of dma-buf exporter that can pin the device memory.
Well, what do you want to happen here? The RDMA parts are reasonable, but I don't want to add new functionality without a purpose - the other parts need to be settled out first.
At the RDMA side, we mainly want to check if the changes are acceptable. For example, the part about adding 'fd' to the device ops and the ioctl interface. All the previous comments are very helpful for us to refine the patch so that we can be ready when GPU side support becomes available.
The need for the dynamic mapping support for even the current DMA Buf hacky P2P users is really too bad. Can you get any GPU driver to support non-dynamic mapping?
We are working on direct direction.
migrate to system RAM. This is due to the lack of knowledge about whether the importer can perform peer-to-peer access and the lack of resource limit control measure for GPU. For the first part, the latest dma-buf driver has a peer-to-peer flag for the importer, but the flag is currently tied to dynamic mapping support, which requires on-demand paging support from the NIC to work.
ODP for DMA buf?
Right.
Hum. This is not actually so hard to do. The whole dma buf proposal would make a lot more sense if the 'dma buf MR' had to be the dynamic kind and the driver had to provide the faulting. It would not be so hard to change mlx5 to be able to work like this, perhaps. (the locking might be a bit tricky though)
The main issue is that not all NICs support ODP.
You don't need on-demand paging support from the NIC for dynamic mapping to work.
All you need is the ability to stop wait for ongoing accesses to end and make sure that new ones grab a new mapping.
Swap and flush isn't a general HW ability either..
I'm unclear how this could be useful, it is guarenteed to corrupt in-progress writes?
Did you mean pause, swap and resume? That's ODP.
Jason
Am 01.07.20 um 14:39 schrieb Jason Gunthorpe:
On Wed, Jul 01, 2020 at 11:03:06AM +0200, Christian König wrote:
Am 30.06.20 um 20:46 schrieb Xiong, Jianxin:
From: Jason Gunthorpe jgg@ziepe.ca Sent: Tuesday, June 30, 2020 10:35 AM To: Xiong, Jianxin jianxin.xiong@intel.com Cc: linux-rdma@vger.kernel.org; Doug Ledford dledford@redhat.com; Sumit Semwal sumit.semwal@linaro.org; Leon Romanovsky leon@kernel.org; Vetter, Daniel daniel.vetter@intel.com; Christian Koenig christian.koenig@amd.com Subject: Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
On Tue, Jun 30, 2020 at 05:21:33PM +0000, Xiong, Jianxin wrote:
> Heterogeneous Memory Management (HMM) utilizes > mmu_interval_notifier and ZONE_DEVICE to support shared virtual > address space and page migration between system memory and device > memory. HMM doesn't support pinning device memory because pages > located on device must be able to migrate to system memory when > accessed by CPU. Peer-to-peer access is possible if the peer can > handle page fault. For RDMA, that means the NIC must support on-demand paging. peer-peer access is currently not possible with hmm_range_fault().
Currently hmm_range_fault() always sets the cpu access flag and device private pages are migrated to the system RAM in the fault handler. However, it's possible to have a modified code flow to keep the device private page info for use with peer to peer access.
Sort of, but only within the same device, RDMA or anything else generic can't reach inside a DEVICE_PRIVATE and extract anything useful.
But pfn is supposed to be all that is needed.
So.. this patch doesn't really do anything new? We could just make a MR against the DMA buf mmap and get to the same place?
That's right, the patch alone is just half of the story. The functionality depends on availability of dma-buf exporter that can pin the device memory.
Well, what do you want to happen here? The RDMA parts are reasonable, but I don't want to add new functionality without a purpose - the other parts need to be settled out first.
At the RDMA side, we mainly want to check if the changes are acceptable. For example, the part about adding 'fd' to the device ops and the ioctl interface. All the previous comments are very helpful for us to refine the patch so that we can be ready when GPU side support becomes available.
The need for the dynamic mapping support for even the current DMA Buf hacky P2P users is really too bad. Can you get any GPU driver to support non-dynamic mapping?
We are working on direct direction.
> migrate to system RAM. This is due to the lack of knowledge about > whether the importer can perform peer-to-peer access and the lack > of resource limit control measure for GPU. For the first part, the > latest dma-buf driver has a peer-to-peer flag for the importer, > but the flag is currently tied to dynamic mapping support, which > requires on-demand paging support from the NIC to work. ODP for DMA buf?
Right.
Hum. This is not actually so hard to do. The whole dma buf proposal would make a lot more sense if the 'dma buf MR' had to be the dynamic kind and the driver had to provide the faulting. It would not be so hard to change mlx5 to be able to work like this, perhaps. (the locking might be a bit tricky though)
The main issue is that not all NICs support ODP.
You don't need on-demand paging support from the NIC for dynamic mapping to work.
All you need is the ability to stop wait for ongoing accesses to end and make sure that new ones grab a new mapping.
Swap and flush isn't a general HW ability either..
I'm unclear how this could be useful, it is guarenteed to corrupt in-progress writes?
Did you mean pause, swap and resume? That's ODP.
Yes, something like this. And good to know, never heard of ODP.
On the GPU side we can pipeline things, e.g. you can program the hardware that page tables are changed at a certain point in time.
So what we do is when we get a notification that a buffer will move around is to mark this buffer in our structures as invalid and return a fence to so that the exporter is able to wait for ongoing stuff to finish.
The actual move then happens only after the ongoing operations on the GPU are finished and on the next operation we grab the new location of the buffer and re-program the page tables to it.
This way all the CPU does is really just planning asynchronous page table changes which are executed on the GPU later on.
You can of course do it synchronized as well, but this would hurt our performance pretty badly.
Regards, Christian.
Jason
On Wed, Jul 1, 2020 at 2:56 PM Christian König christian.koenig@amd.com wrote:
Am 01.07.20 um 14:39 schrieb Jason Gunthorpe:
On Wed, Jul 01, 2020 at 11:03:06AM +0200, Christian König wrote:
Am 30.06.20 um 20:46 schrieb Xiong, Jianxin:
From: Jason Gunthorpe jgg@ziepe.ca Sent: Tuesday, June 30, 2020 10:35 AM To: Xiong, Jianxin jianxin.xiong@intel.com Cc: linux-rdma@vger.kernel.org; Doug Ledford dledford@redhat.com; Sumit Semwal sumit.semwal@linaro.org; Leon Romanovsky leon@kernel.org; Vetter, Daniel daniel.vetter@intel.com; Christian Koenig christian.koenig@amd.com Subject: Re: [RFC PATCH v2 0/3] RDMA: add dma-buf support
On Tue, Jun 30, 2020 at 05:21:33PM +0000, Xiong, Jianxin wrote:
>> Heterogeneous Memory Management (HMM) utilizes >> mmu_interval_notifier and ZONE_DEVICE to support shared virtual >> address space and page migration between system memory and device >> memory. HMM doesn't support pinning device memory because pages >> located on device must be able to migrate to system memory when >> accessed by CPU. Peer-to-peer access is possible if the peer can >> handle page fault. For RDMA, that means the NIC must support on-demand paging. > peer-peer access is currently not possible with hmm_range_fault(). Currently hmm_range_fault() always sets the cpu access flag and device private pages are migrated to the system RAM in the fault handler. However, it's possible to have a modified code flow to keep the device private page info for use with peer to peer access.
Sort of, but only within the same device, RDMA or anything else generic can't reach inside a DEVICE_PRIVATE and extract anything useful.
But pfn is supposed to be all that is needed.
> So.. this patch doesn't really do anything new? We could just make a MR against the DMA buf mmap and get to the same place? That's right, the patch alone is just half of the story. The functionality depends on availability of dma-buf exporter that can pin the device memory.
Well, what do you want to happen here? The RDMA parts are reasonable, but I don't want to add new functionality without a purpose - the other parts need to be settled out first.
At the RDMA side, we mainly want to check if the changes are acceptable. For example, the part about adding 'fd' to the device ops and the ioctl interface. All the previous comments are very helpful for us to refine the patch so that we can be ready when GPU side support becomes available.
The need for the dynamic mapping support for even the current DMA Buf hacky P2P users is really too bad. Can you get any GPU driver to support non-dynamic mapping?
We are working on direct direction.
>> migrate to system RAM. This is due to the lack of knowledge about >> whether the importer can perform peer-to-peer access and the lack >> of resource limit control measure for GPU. For the first part, the >> latest dma-buf driver has a peer-to-peer flag for the importer, >> but the flag is currently tied to dynamic mapping support, which >> requires on-demand paging support from the NIC to work. > ODP for DMA buf? Right.
Hum. This is not actually so hard to do. The whole dma buf proposal would make a lot more sense if the 'dma buf MR' had to be the dynamic kind and the driver had to provide the faulting. It would not be so hard to change mlx5 to be able to work like this, perhaps. (the locking might be a bit tricky though)
The main issue is that not all NICs support ODP.
You don't need on-demand paging support from the NIC for dynamic mapping to work.
All you need is the ability to stop wait for ongoing accesses to end and make sure that new ones grab a new mapping.
Swap and flush isn't a general HW ability either..
I'm unclear how this could be useful, it is guarenteed to corrupt in-progress writes?
Did you mean pause, swap and resume? That's ODP.
Yes, something like this. And good to know, never heard of ODP.
Hm I thought ODP was full hw page faults at an individual page level, and this stop&resume is for the entire nic. Under the hood both apply back-pressure on the network if a transmission can't be received, but I thought the ODP one is a lore more fine-grained. For this dma_buf stop-the-world approach we'd also need to guarantee _all_ buffers are present again (without hw page faults on the nic side of things), which has some additional locking implications.
On the GPU side we can pipeline things, e.g. you can program the hardware that page tables are changed at a certain point in time.
So what we do is when we get a notification that a buffer will move around is to mark this buffer in our structures as invalid and return a fence to so that the exporter is able to wait for ongoing stuff to finish.
The actual move then happens only after the ongoing operations on the GPU are finished and on the next operation we grab the new location of the buffer and re-program the page tables to it.
This way all the CPU does is really just planning asynchronous page table changes which are executed on the GPU later on.
You can of course do it synchronized as well, but this would hurt our performance pretty badly.
So since Jason really doesn't like dma_fence much I think for rdma synchronous it is. And it shouldn't really matter, since waiting for a small transaction to complete at rdma wire speed isn't really that long an operation. Should be much faster than waiting for the gpu to complete quite sizeable amounts of rendering first. Plus with real hw page faults it's really just pte write plus tlb flush, that should definitely be possible from the ->move_notify hook. But even the global stop-the-world approach I expect is going to be much faster than a preempt request on a gpu. -Daniel
Regards, Christian.
Jason
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Wed, Jul 01, 2020 at 05:42:21PM +0200, Daniel Vetter wrote:
All you need is the ability to stop wait for ongoing accesses to end and make sure that new ones grab a new mapping.
Swap and flush isn't a general HW ability either..
I'm unclear how this could be useful, it is guarenteed to corrupt in-progress writes?
Did you mean pause, swap and resume? That's ODP.
Yes, something like this. And good to know, never heard of ODP.
Hm I thought ODP was full hw page faults at an individual page level,
Yes
and this stop&resume is for the entire nic. Under the hood both apply back-pressure on the network if a transmission can't be received, but
NIC's don't do stop and resume, blocking the Rx pipe is very problematic and performance destroying.
The strategy for something like ODP is more complex, and so far no NIC has deployed it at any granularity larger than per-page.
So since Jason really doesn't like dma_fence much I think for rdma synchronous it is. And it shouldn't really matter, since waiting for a small transaction to complete at rdma wire speed isn't really that long an operation.
Even if DMA fence were to somehow be involved, how would it look?
Jason
On Wed, Jul 01, 2020 at 02:15:24PM -0300, Jason Gunthorpe wrote:
On Wed, Jul 01, 2020 at 05:42:21PM +0200, Daniel Vetter wrote:
All you need is the ability to stop wait for ongoing accesses to end and make sure that new ones grab a new mapping.
Swap and flush isn't a general HW ability either..
I'm unclear how this could be useful, it is guarenteed to corrupt in-progress writes?
Did you mean pause, swap and resume? That's ODP.
Yes, something like this. And good to know, never heard of ODP.
Hm I thought ODP was full hw page faults at an individual page level,
Yes
and this stop&resume is for the entire nic. Under the hood both apply back-pressure on the network if a transmission can't be received, but
NIC's don't do stop and resume, blocking the Rx pipe is very problematic and performance destroying.
The strategy for something like ODP is more complex, and so far no NIC has deployed it at any granularity larger than per-page.
So since Jason really doesn't like dma_fence much I think for rdma synchronous it is. And it shouldn't really matter, since waiting for a small transaction to complete at rdma wire speed isn't really that long an operation.
Even if DMA fence were to somehow be involved, how would it look?
Well above you're saying it would be performance destroying, but let's pretend that's not a problem :-) Also, I have no clue about rdma, so this is really just the flow we have on the gpu side.
0. rdma driver maintains a list of all dma-buf that it has mapped somewhere and is currently using for transactions
1. rdma driver gets a dma_buf->notify_move callback on one of these buffers. To handle that it: 1. stops hw access somehow at the rx 2. flushes caches and whatever else is needed 3. moves the unmapped buffer on a special list or marks it in some different way as unavailable 4. launch the kthread/work_struct to fix everything back up
2. dma-buf export (gpu driver) can now issue the commands to move the buffer around
3. rdma driver worker gets busy to restart rx: 1. lock all dma-buf that are currently in use (dma_resv_lock). thanks to ww_mutex deadlock avoidance this is possible 2. for any buffers which have been marked as unavailable in 1.3 grab a new mapping (which might now be in system memory, or again peer2peer but different address) 3. restart hw and rx 4. unlock all dma-buf locks (dma_resv_unlock)
There is a minor problem because step 2 only queues up the entire buffer moves behind a pile of dma_fence, and atm we haven't made it absolutely clear who's responsible for waiting for those to complete. For gpu drivers it's the importer since gpu drivers don't have big qualms about dma_fences, so 3.2. would perhaps also include a dma_fence_wait to make sure the buffer move has actually completed.
Above flow is more or less exactly what happens for gpu workloads where we can preempt running computations. Instead of stopping rx we preempt the compute job and remove it from the scheduler queues, and instead of restarting rx we just put the compute job back onto the scheduler queue as eligible for gpu time. Otherwise it's exactly the same stuff.
Of course if you only have a single compute job and too many such interruptions, then performance is going to tank. Don't do that, instead make sure you have enough vram or system memory or whatever :-)
Cheers, Daniel
On Thu, Jul 02, 2020 at 03:10:00PM +0200, Daniel Vetter wrote:
On Wed, Jul 01, 2020 at 02:15:24PM -0300, Jason Gunthorpe wrote:
On Wed, Jul 01, 2020 at 05:42:21PM +0200, Daniel Vetter wrote:
All you need is the ability to stop wait for ongoing accesses to end and make sure that new ones grab a new mapping.
Swap and flush isn't a general HW ability either..
I'm unclear how this could be useful, it is guarenteed to corrupt in-progress writes?
Did you mean pause, swap and resume? That's ODP.
Yes, something like this. And good to know, never heard of ODP.
Hm I thought ODP was full hw page faults at an individual page level,
Yes
and this stop&resume is for the entire nic. Under the hood both apply back-pressure on the network if a transmission can't be received, but
NIC's don't do stop and resume, blocking the Rx pipe is very problematic and performance destroying.
The strategy for something like ODP is more complex, and so far no NIC has deployed it at any granularity larger than per-page.
So since Jason really doesn't like dma_fence much I think for rdma synchronous it is. And it shouldn't really matter, since waiting for a small transaction to complete at rdma wire speed isn't really that long an operation.
Even if DMA fence were to somehow be involved, how would it look?
Well above you're saying it would be performance destroying, but let's pretend that's not a problem :-) Also, I have no clue about rdma, so this is really just the flow we have on the gpu side.
I see, no, this is not workable, the command flow in RDMA is not at all like GPU - what you are a proposing is a global 'stop the whole chip' Tx and Rx flows for an undetermined time. Not feasible
What we can do is use ODP techniques and pause only the MR attached to the DMA buf with the process you outline below. This is not so hard to implement.
- rdma driver worker gets busy to restart rx:
thanks to ww_mutex deadlock avoidance this is possible
- lock all dma-buf that are currently in use (dma_resv_lock).
Why all? Why not just lock the one that was invalidated to restore the mappings? That is some artifact of the GPU approach?
And why is this done with work queues and locking instead of a callback saying the buffer is valid again?
Jason
Am 02.07.20 um 15:29 schrieb Jason Gunthorpe:
On Thu, Jul 02, 2020 at 03:10:00PM +0200, Daniel Vetter wrote:
On Wed, Jul 01, 2020 at 02:15:24PM -0300, Jason Gunthorpe wrote:
On Wed, Jul 01, 2020 at 05:42:21PM +0200, Daniel Vetter wrote:
> All you need is the ability to stop wait for ongoing accesses to end and > make sure that new ones grab a new mapping. Swap and flush isn't a general HW ability either..
I'm unclear how this could be useful, it is guarenteed to corrupt in-progress writes?
Did you mean pause, swap and resume? That's ODP.
Yes, something like this. And good to know, never heard of ODP.
Hm I thought ODP was full hw page faults at an individual page level,
Yes
and this stop&resume is for the entire nic. Under the hood both apply back-pressure on the network if a transmission can't be received, but
NIC's don't do stop and resume, blocking the Rx pipe is very problematic and performance destroying.
The strategy for something like ODP is more complex, and so far no NIC has deployed it at any granularity larger than per-page.
So since Jason really doesn't like dma_fence much I think for rdma synchronous it is. And it shouldn't really matter, since waiting for a small transaction to complete at rdma wire speed isn't really that long an operation.
Even if DMA fence were to somehow be involved, how would it look?
Well above you're saying it would be performance destroying, but let's pretend that's not a problem :-) Also, I have no clue about rdma, so this is really just the flow we have on the gpu side.
I see, no, this is not workable, the command flow in RDMA is not at all like GPU - what you are a proposing is a global 'stop the whole chip' Tx and Rx flows for an undetermined time. Not feasible
What we can do is use ODP techniques and pause only the MR attached to the DMA buf with the process you outline below. This is not so hard to implement.
Well it boils down to only two requirements:
1. You can stop accessing the memory or addresses exported by the DMA-buf.
2. Before the next access you need to acquire a new mapping.
How you do this is perfectly up to you. E.g. you can stop everything, just prevent access to this DMA-buf, or just pause the users of this DMA-buf....
- rdma driver worker gets busy to restart rx:
thanks to ww_mutex deadlock avoidance this is possible
- lock all dma-buf that are currently in use (dma_resv_lock).
Why all? Why not just lock the one that was invalidated to restore the mappings? That is some artifact of the GPU approach?
No, but you must make sure that mapping one doesn't invalidate others you need.
Otherwise you can end up in a nice live lock :)
And why is this done with work queues and locking instead of a callback saying the buffer is valid again?
You can do this as well, but a work queue is usually easier to handle than a notification in an interrupt context of a foreign driver.
Regards, Christian.
Jason
On Thu, Jul 02, 2020 at 04:50:32PM +0200, Christian König wrote:
Am 02.07.20 um 15:29 schrieb Jason Gunthorpe:
On Thu, Jul 02, 2020 at 03:10:00PM +0200, Daniel Vetter wrote:
On Wed, Jul 01, 2020 at 02:15:24PM -0300, Jason Gunthorpe wrote:
On Wed, Jul 01, 2020 at 05:42:21PM +0200, Daniel Vetter wrote:
> > All you need is the ability to stop wait for ongoing accesses to end and > > make sure that new ones grab a new mapping. > Swap and flush isn't a general HW ability either.. > > I'm unclear how this could be useful, it is guarenteed to corrupt > in-progress writes? > > Did you mean pause, swap and resume? That's ODP. Yes, something like this. And good to know, never heard of ODP.
Hm I thought ODP was full hw page faults at an individual page level,
Yes
and this stop&resume is for the entire nic. Under the hood both apply back-pressure on the network if a transmission can't be received, but
NIC's don't do stop and resume, blocking the Rx pipe is very problematic and performance destroying.
The strategy for something like ODP is more complex, and so far no NIC has deployed it at any granularity larger than per-page.
So since Jason really doesn't like dma_fence much I think for rdma synchronous it is. And it shouldn't really matter, since waiting for a small transaction to complete at rdma wire speed isn't really that long an operation.
Even if DMA fence were to somehow be involved, how would it look?
Well above you're saying it would be performance destroying, but let's pretend that's not a problem :-) Also, I have no clue about rdma, so this is really just the flow we have on the gpu side.
I see, no, this is not workable, the command flow in RDMA is not at all like GPU - what you are a proposing is a global 'stop the whole chip' Tx and Rx flows for an undetermined time. Not feasible
Yeah, I said I have no clue about rdma :-)
What we can do is use ODP techniques and pause only the MR attached to the DMA buf with the process you outline below. This is not so hard to implement.
Well it boils down to only two requirements:
You can stop accessing the memory or addresses exported by the DMA-buf.
Before the next access you need to acquire a new mapping.
How you do this is perfectly up to you. E.g. you can stop everything, just prevent access to this DMA-buf, or just pause the users of this DMA-buf....
Yeah in a gpu we also don't stop the entire world, only the context that needs the buffer. If there's other stuff to run, we do keep running that. Usually the reason for a buffer move is that we do actually have other stuff that needs to be run, and which needs more vram for itself, so we might have to throw out a few buffers from vram that can also be placed in system memory.
Note that a modern gpu has multiple engines and most have or are gaining hw scheduling of some sort, so there's a pile of concurrent gpu context execution going on at any moment. The days where we just push gpu work into a fifo are (mostly) long gone.
- rdma driver worker gets busy to restart rx:
thanks to ww_mutex deadlock avoidance this is possible
- lock all dma-buf that are currently in use (dma_resv_lock).
Why all? Why not just lock the one that was invalidated to restore the mappings? That is some artifact of the GPU approach?
No, but you must make sure that mapping one doesn't invalidate others you need.
Otherwise you can end up in a nice live lock :)
Also if you don't have pagefaults, but have to track busy memory at a context level, you do need to grab all locks of all buffers you need, or you'd race. There's nothing stopping a concurrent ->notify_move on some other buffer you'll need otherwise, and if you try to be clever and roll you're own locking, you'll anger lockdep - you're own lock will have to be on both sides of ww_mutex or it wont work, and that deadlocks.
So ww_mutex multi-lock dance or bust.
And why is this done with work queues and locking instead of a callback saying the buffer is valid again?
You can do this as well, but a work queue is usually easier to handle than a notification in an interrupt context of a foreign driver.
Yeah you can just install a dma_fence callback but - that's hardirq context - if you don't have per-page hw faults you need the multi-lock ww_mutex dance anyway to avoid races.
On top of that the exporter has no idea whether the importer still really needs the mapping, or whether it was just kept around opportunistically. pte setup isn't free, so by default gpus keep everything mapped. At least for the classic gl/vk execution model, compute like rocm/amdkfd is a bit different here.
Cheers, Daniel
Regards, Christian.
Jason
On Thu, Jul 02, 2020 at 08:15:40PM +0200, Daniel Vetter wrote:
- rdma driver worker gets busy to restart rx:
thanks to ww_mutex deadlock avoidance this is possible
- lock all dma-buf that are currently in use (dma_resv_lock).
Why all? Why not just lock the one that was invalidated to restore the mappings? That is some artifact of the GPU approach?
No, but you must make sure that mapping one doesn't invalidate others you need.
Otherwise you can end up in a nice live lock :)
Also if you don't have pagefaults, but have to track busy memory at a context level, you do need to grab all locks of all buffers you need, or you'd race. There's nothing stopping a concurrent ->notify_move on some other buffer you'll need otherwise, and if you try to be clever and roll you're own locking, you'll anger lockdep - you're own lock will have to be on both sides of ww_mutex or it wont work, and that deadlocks.
So you are worried about atomically building some multi buffer transaction? I don't think this applies to RDMA which isn't going to be transcational here..
And why is this done with work queues and locking instead of a callback saying the buffer is valid again?
You can do this as well, but a work queue is usually easier to handle than a notification in an interrupt context of a foreign driver.
Yeah you can just install a dma_fence callback but
- that's hardirq context
- if you don't have per-page hw faults you need the multi-lock ww_mutex dance anyway to avoid races.
It is still best to avoid the per-page faults and preload the new mapping once it is ready.
Jason
On Fri, Jul 3, 2020 at 2:03 PM Jason Gunthorpe jgg@ziepe.ca wrote:
On Thu, Jul 02, 2020 at 08:15:40PM +0200, Daniel Vetter wrote:
- rdma driver worker gets busy to restart rx: 1. lock all dma-buf that are currently in use (dma_resv_lock). thanks to ww_mutex deadlock avoidance this is possible
Why all? Why not just lock the one that was invalidated to restore the mappings? That is some artifact of the GPU approach?
No, but you must make sure that mapping one doesn't invalidate others you need.
Otherwise you can end up in a nice live lock :)
Also if you don't have pagefaults, but have to track busy memory at a context level, you do need to grab all locks of all buffers you need, or you'd race. There's nothing stopping a concurrent ->notify_move on some other buffer you'll need otherwise, and if you try to be clever and roll you're own locking, you'll anger lockdep - you're own lock will have to be on both sides of ww_mutex or it wont work, and that deadlocks.
So you are worried about atomically building some multi buffer transaction? I don't think this applies to RDMA which isn't going to be transcational here..
So maybe I'm just totally confused about the rdma model. I thought: - you bind a pile of memory for various transactions, that might happen whenever. Kernel driver doesn't have much if any insight into when memory isn't needed anymore. I think in the rdma world that's called registering memory, but not sure.
- for hw with hw faults you can pull in the memory when it's needed, and if concurrently another cpu is taking away pages and invalidating rdma hw mappings, then that's no problem.
So far so good, 0 need for atomic transactions anything.
But the answer I gave here is for when you don't have per-page hw faulting on the rdma nic, but something that works at a much larger level. For a gpu it would be a compute context, no idea what the equivalent for rdma is. This could go up to and including the entire nic stalling all rx with link level back pressure, but not necessarily.
Once you go above a single page, or well, at least a single dma-buf object, you need some way to know when you have all buffers and memory mappings present again, because you can't restart with only partial memory.
Now if the rdma memory programming is more like traditional networking, where you have a single rx or tx buffer, then yeah you don't need fancy multi-buffer locking. Like I said I have no idea how many of the buffers you need to restart the rdma stuff for hw which doesn't have per-page hw faulting.
And why is this done with work queues and locking instead of a callback saying the buffer is valid again?
You can do this as well, but a work queue is usually easier to handle than a notification in an interrupt context of a foreign driver.
Yeah you can just install a dma_fence callback but
- that's hardirq context
- if you don't have per-page hw faults you need the multi-lock ww_mutex dance anyway to avoid races.
It is still best to avoid the per-page faults and preload the new mapping once it is ready.
Sure, but I think that's entirely orthogonal.
Also even if you don't need the multi-buffer dance (either because hw faults, or because the rdma rx/tx can be stopped at least at a per-buffer level through some other means) then you still need the dma_resv_lock to serialize with the exporter. So needs a sleeping context either way. Hence some worker is needed. -Daniel
On Fri, Jul 03, 2020 at 02:52:03PM +0200, Daniel Vetter wrote:
So maybe I'm just totally confused about the rdma model. I thought:
- you bind a pile of memory for various transactions, that might
happen whenever. Kernel driver doesn't have much if any insight into when memory isn't needed anymore. I think in the rdma world that's called registering memory, but not sure.
Sure, but once registered the memory is able to be used at any moment with no visibilty from the kernel.
Unlike GPU the transactions that trigger memory access do not go through the kernel - so there is no ability to interrupt a command flow and fiddle with mappings.
Jason
Am 03.07.20 um 15:14 schrieb Jason Gunthorpe:
On Fri, Jul 03, 2020 at 02:52:03PM +0200, Daniel Vetter wrote:
So maybe I'm just totally confused about the rdma model. I thought:
- you bind a pile of memory for various transactions, that might
happen whenever. Kernel driver doesn't have much if any insight into when memory isn't needed anymore. I think in the rdma world that's called registering memory, but not sure.
Sure, but once registered the memory is able to be used at any moment with no visibilty from the kernel.
Unlike GPU the transactions that trigger memory access do not go through the kernel - so there is no ability to interrupt a command flow and fiddle with mappings.
This is the same for GPUs with user space queues as well.
But we can still say for a process if that this process is using a DMA-buf which is moved out and so can't run any more unless the DMA-buf is accessible again.
In other words you somehow need to make sure that the hardware is not accessing a piece of memory any more when you want to move it.
Christian.
Jason
-----Original Message----- From: Christian König christian.koenig@amd.com Am 03.07.20 um 15:14 schrieb Jason Gunthorpe:
On Fri, Jul 03, 2020 at 02:52:03PM +0200, Daniel Vetter wrote:
So maybe I'm just totally confused about the rdma model. I thought:
- you bind a pile of memory for various transactions, that might
happen whenever. Kernel driver doesn't have much if any insight into when memory isn't needed anymore. I think in the rdma world that's called registering memory, but not sure.
Sure, but once registered the memory is able to be used at any moment with no visibilty from the kernel.
Unlike GPU the transactions that trigger memory access do not go through the kernel - so there is no ability to interrupt a command flow and fiddle with mappings.
This is the same for GPUs with user space queues as well.
But we can still say for a process if that this process is using a DMA-buf which is moved out and so can't run any more unless the DMA-buf is accessible again.
In other words you somehow need to make sure that the hardware is not accessing a piece of memory any more when you want to move it.
While a process can be easily suspended, there is no way to tell the RDMA NIC not to process posted work requests that use specific memory regions (or with any other conditions).
So far it appears to me that DMA-buf dynamic mapping for RDMA is only viable with ODP support. For NICs without ODP, a way to allow pinning the device memory is still needed.
Jianxin
Christian.
Jason
Am 07.07.20 um 23:58 schrieb Xiong, Jianxin:
-----Original Message----- From: Christian König christian.koenig@amd.com Am 03.07.20 um 15:14 schrieb Jason Gunthorpe:
On Fri, Jul 03, 2020 at 02:52:03PM +0200, Daniel Vetter wrote:
So maybe I'm just totally confused about the rdma model. I thought:
- you bind a pile of memory for various transactions, that might
happen whenever. Kernel driver doesn't have much if any insight into when memory isn't needed anymore. I think in the rdma world that's called registering memory, but not sure.
Sure, but once registered the memory is able to be used at any moment with no visibilty from the kernel.
Unlike GPU the transactions that trigger memory access do not go through the kernel - so there is no ability to interrupt a command flow and fiddle with mappings.
This is the same for GPUs with user space queues as well.
But we can still say for a process if that this process is using a DMA-buf which is moved out and so can't run any more unless the DMA-buf is accessible again.
In other words you somehow need to make sure that the hardware is not accessing a piece of memory any more when you want to move it.
While a process can be easily suspended, there is no way to tell the RDMA NIC not to process posted work requests that use specific memory regions (or with any other conditions).
So far it appears to me that DMA-buf dynamic mapping for RDMA is only viable with ODP support. For NICs without ODP, a way to allow pinning the device memory is still needed.
And that's exactly the reason why I introduced explicit pin()/unpin() functions into the DMA-buf API: https://elixir.bootlin.com/linux/latest/source/drivers/dma-buf/dma-buf.c#L81...
It's just that at least our devices drivers currently prevent P2P with pinned DMA-buf's for two main reasons:
a) To prevent deny of service attacks because P2P BARs are a rather rare resource.
b) To prevent failures in configuration where P2P is not always possible between all devices which want to access a buffer.
Regards, Christian.
Jianxin
Christian.
Jason
On Wed, Jul 08, 2020 at 11:38:31AM +0200, Christian König wrote:
Am 07.07.20 um 23:58 schrieb Xiong, Jianxin:
-----Original Message----- From: Christian König christian.koenig@amd.com Am 03.07.20 um 15:14 schrieb Jason Gunthorpe:
On Fri, Jul 03, 2020 at 02:52:03PM +0200, Daniel Vetter wrote:
So maybe I'm just totally confused about the rdma model. I thought:
- you bind a pile of memory for various transactions, that might
happen whenever. Kernel driver doesn't have much if any insight into when memory isn't needed anymore. I think in the rdma world that's called registering memory, but not sure.
Sure, but once registered the memory is able to be used at any moment with no visibilty from the kernel.
Unlike GPU the transactions that trigger memory access do not go through the kernel - so there is no ability to interrupt a command flow and fiddle with mappings.
This is the same for GPUs with user space queues as well.
But we can still say for a process if that this process is using a DMA-buf which is moved out and so can't run any more unless the DMA-buf is accessible again.
In other words you somehow need to make sure that the hardware is not accessing a piece of memory any more when you want to move it.
While a process can be easily suspended, there is no way to tell the RDMA NIC not to process posted work requests that use specific memory regions (or with any other conditions).
So far it appears to me that DMA-buf dynamic mapping for RDMA is only viable with ODP support. For NICs without ODP, a way to allow pinning the device memory is still needed.
And that's exactly the reason why I introduced explicit pin()/unpin() functions into the DMA-buf API: https://elixir.bootlin.com/linux/latest/source/drivers/dma-buf/dma-buf.c#L81...
It's just that at least our devices drivers currently prevent P2P with pinned DMA-buf's for two main reasons:
a) To prevent deny of service attacks because P2P BARs are a rather rare resource.
b) To prevent failures in configuration where P2P is not always possible between all devices which want to access a buffer.
So the above is more or less the question in the cover letter (which didn't make it to dri-devel). Can we somehow throw that limitation out, or is that simply not a good idea?
Simply moving buffers to system memory when they're pinned does simplify a lot of headaches. For a specific custom built system we can avoid that maybe, but I think upstream is kinda a different thing.
Cheers, Daniel
Regards, Christian.
Jianxin
Christian.
Jason
Am 08.07.20 um 11:49 schrieb Daniel Vetter:
On Wed, Jul 08, 2020 at 11:38:31AM +0200, Christian König wrote:
Am 07.07.20 um 23:58 schrieb Xiong, Jianxin:
-----Original Message----- From: Christian König christian.koenig@amd.com Am 03.07.20 um 15:14 schrieb Jason Gunthorpe:
On Fri, Jul 03, 2020 at 02:52:03PM +0200, Daniel Vetter wrote:
So maybe I'm just totally confused about the rdma model. I thought:
- you bind a pile of memory for various transactions, that might
happen whenever. Kernel driver doesn't have much if any insight into when memory isn't needed anymore. I think in the rdma world that's called registering memory, but not sure.
Sure, but once registered the memory is able to be used at any moment with no visibilty from the kernel.
Unlike GPU the transactions that trigger memory access do not go through the kernel - so there is no ability to interrupt a command flow and fiddle with mappings.
This is the same for GPUs with user space queues as well.
But we can still say for a process if that this process is using a DMA-buf which is moved out and so can't run any more unless the DMA-buf is accessible again.
In other words you somehow need to make sure that the hardware is not accessing a piece of memory any more when you want to move it.
While a process can be easily suspended, there is no way to tell the RDMA NIC not to process posted work requests that use specific memory regions (or with any other conditions).
So far it appears to me that DMA-buf dynamic mapping for RDMA is only viable with ODP support. For NICs without ODP, a way to allow pinning the device memory is still needed.
And that's exactly the reason why I introduced explicit pin()/unpin() functions into the DMA-buf API: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir.boo...
It's just that at least our devices drivers currently prevent P2P with pinned DMA-buf's for two main reasons:
a) To prevent deny of service attacks because P2P BARs are a rather rare resource.
b) To prevent failures in configuration where P2P is not always possible between all devices which want to access a buffer.
So the above is more or less the question in the cover letter (which didn't make it to dri-devel). Can we somehow throw that limitation out, or is that simply not a good idea?
At least for the AMD graphics drivers that's most certain not a good idea.
We do have an use case where buffers need to be in system memory because P2P doesn't work.
And by pinning them to VRAM you can create a really nice deny of service attack against the X system.
Simply moving buffers to system memory when they're pinned does simplify a lot of headaches. For a specific custom built system we can avoid that maybe, but I think upstream is kinda a different thing.
Yes, agree completely on that. Customers which are willing to take the risk can easily do this themselves.
But that is not something we should probably do for upstream.
Regards, Christian.
Cheers, Daniel
Regards, Christian.
Jianxin
Christian.
Jason
On Wed, Jul 8, 2020 at 10:20 AM Christian König christian.koenig@amd.com wrote:
Am 08.07.20 um 11:49 schrieb Daniel Vetter:
On Wed, Jul 08, 2020 at 11:38:31AM +0200, Christian König wrote:
Am 07.07.20 um 23:58 schrieb Xiong, Jianxin:
-----Original Message----- From: Christian König christian.koenig@amd.com Am 03.07.20 um 15:14 schrieb Jason Gunthorpe:
On Fri, Jul 03, 2020 at 02:52:03PM +0200, Daniel Vetter wrote:
> So maybe I'm just totally confused about the rdma model. I thought: > - you bind a pile of memory for various transactions, that might > happen whenever. Kernel driver doesn't have much if any insight into > when memory isn't needed anymore. I think in the rdma world that's > called registering memory, but not sure. Sure, but once registered the memory is able to be used at any moment with no visibilty from the kernel.
Unlike GPU the transactions that trigger memory access do not go through the kernel - so there is no ability to interrupt a command flow and fiddle with mappings.
This is the same for GPUs with user space queues as well.
But we can still say for a process if that this process is using a DMA-buf which is moved out and so can't run any more unless the DMA-buf is accessible again.
In other words you somehow need to make sure that the hardware is not accessing a piece of memory any more when you want to move it.
While a process can be easily suspended, there is no way to tell the RDMA NIC not to process posted work requests that use specific memory regions (or with any other conditions).
So far it appears to me that DMA-buf dynamic mapping for RDMA is only viable with ODP support. For NICs without ODP, a way to allow pinning the device memory is still needed.
And that's exactly the reason why I introduced explicit pin()/unpin() functions into the DMA-buf API: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Felixir.boo...
It's just that at least our devices drivers currently prevent P2P with pinned DMA-buf's for two main reasons:
a) To prevent deny of service attacks because P2P BARs are a rather rare resource.
b) To prevent failures in configuration where P2P is not always possible between all devices which want to access a buffer.
So the above is more or less the question in the cover letter (which didn't make it to dri-devel). Can we somehow throw that limitation out, or is that simply not a good idea?
At least for the AMD graphics drivers that's most certain not a good idea.
We do have an use case where buffers need to be in system memory because P2P doesn't work.
And by pinning them to VRAM you can create a really nice deny of service attack against the X system.
On the other hand, on modern platforms with large or resizable BARs, you may end up with systems with more vram than system ram.
Alex
Simply moving buffers to system memory when they're pinned does simplify a lot of headaches. For a specific custom built system we can avoid that maybe, but I think upstream is kinda a different thing.
Yes, agree completely on that. Customers which are willing to take the risk can easily do this themselves.
But that is not something we should probably do for upstream.
Regards, Christian.
Cheers, Daniel
Regards, Christian.
Jianxin
Christian.
Jason
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
dri-devel@lists.freedesktop.org