From: Jérôme Glisse jglisse@redhat.com
This patchset add support for peer to peer between device in two manner. First for device memory use through HMM in process regular address space (ie inside a regular vma that is not an mmap of device file or special file). Second for special vma ie mmap of a device file, in this case some device driver might want to allow other device to directly access memory use for those special vma (not that the memory might not even be map to CPU in this case).
They are many use cases for this they mainly fall into 2 category: [A]-Allow device to directly map and control another device command queue.
[B]-Allow device to access another device memory without disrupting the other device computation.
Corresponding workloads:
[1]-Network device directly access an control a block device command queue so that it can do storage access without involving the CPU. This fall into [A] [2]-Accelerator device doing heavy computation and network device is monitoring progress. Direct accelerator's memory access by the network device avoid the need to use much slower system memory. This fall into [B]. [3]-Accelerator device doing heavy computation and network device is streaming out the result. This avoid the need to first bounce the result through system memory (it saves both system memory and bandwidth). This fall into [B]. [4]-Chaining device computation. For instance a camera device take a picture, stream it to a color correction device that stream it to final memory. This fall into [A and B].
People have more ideas on how to use this than i can list here. The intention of this patchset is to provide the means to achieve those and much more.
I have done a testing using nouveau and Mellanox mlx5 where the mlx5 device can directly access GPU memory [1]. I intend to use this inside nouveau and help porting AMD ROCm RDMA to use this [2]. I believe other people have express interest in working on using this with network device and block device.
From implementation point of view this just add 2 new call back to
vm_operations struct (for special device vma support) and 2 new call back to HMM device memory structure for HMM device memory support.
For now it needs IOMMU off with ACS disabled and for both device to be on same PCIE sub-tree (can not cross root complex). However the intention here is different from some other peer to peer work in that we do want to support IOMMU and are fine with going through the root complex in that case. In other words, the bandwidth advantage of avoiding the root complex is of less importance than the programming model for the feature. We do actualy expect that this will be use mostly with IOMMU enabled and thus with having to go through the root bridge.
Another difference from other p2p solution is that we do require that the importing device abide to mmu notifier invalidation so that the exporting device can always invalidate a mapping at any point in time. For this reasons we do not need a struct page for the device memory.
Also in all the cases the policy and final decision on wether to map or not is solely under the control of the exporting device.
Finaly the device memory might not even be map to the CPU and thus we have to go through the exporting device driver to get the physical address at which the memory is accessible.
The core change are minimal (adding new call backs to some struct). IOMMU support will need little change too. Most of the code is in driver to implement export policy and BAR space management. Very gross playground with IOMMU support in [3] (top 3 patches).
Cheers, Jérôme
[1] https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-p2p [2] https://github.com/RadeonOpenCompute/ROCnRDMA [3] https://cgit.freedesktop.org/~glisse/linux/log/?h=wip-hmm-p2p
Cc: Logan Gunthorpe logang@deltatee.com Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Rafael J. Wysocki rafael@kernel.org Cc: Bjorn Helgaas bhelgaas@google.com Cc: Christian Koenig christian.koenig@amd.com Cc: Felix Kuehling Felix.Kuehling@amd.com Cc: Jason Gunthorpe jgg@mellanox.com Cc: linux-pci@vger.kernel.org Cc: dri-devel@lists.freedesktop.org Cc: Christoph Hellwig hch@lst.de Cc: Marek Szyprowski m.szyprowski@samsung.com Cc: Robin Murphy robin.murphy@arm.com Cc: Joerg Roedel jroedel@suse.de Cc: iommu@lists.linux-foundation.org
Jérôme Glisse (5): pci/p2p: add a function to test peer to peer capability drivers/base: add a function to test peer to peer capability mm/vma: add support for peer to peer to device vma mm/hmm: add support for peer to peer to HMM device memory mm/hmm: add support for peer to peer to special device vma
drivers/base/core.c | 20 ++++ drivers/pci/p2pdma.c | 27 +++++ include/linux/device.h | 1 + include/linux/hmm.h | 53 +++++++++ include/linux/mm.h | 38 +++++++ include/linux/pci-p2pdma.h | 6 + mm/hmm.c | 219 ++++++++++++++++++++++++++++++------- 7 files changed, 325 insertions(+), 39 deletions(-)
From: Jérôme Glisse jglisse@redhat.com
device_test_p2p() return true if two devices can peer to peer to each other. We add a generic function as different inter-connect can support peer to peer and we want to genericaly test this no matter what the inter-connect might be. However this version only support PCIE for now.
Signed-off-by: Jérôme Glisse jglisse@redhat.com Cc: Logan Gunthorpe logang@deltatee.com Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Rafael J. Wysocki rafael@kernel.org Cc: Bjorn Helgaas bhelgaas@google.com Cc: Christian Koenig christian.koenig@amd.com Cc: Felix Kuehling Felix.Kuehling@amd.com Cc: Jason Gunthorpe jgg@mellanox.com Cc: linux-kernel@vger.kernel.org Cc: linux-pci@vger.kernel.org Cc: dri-devel@lists.freedesktop.org Cc: Christoph Hellwig hch@lst.de Cc: Marek Szyprowski m.szyprowski@samsung.com Cc: Robin Murphy robin.murphy@arm.com Cc: Joerg Roedel jroedel@suse.de Cc: iommu@lists.linux-foundation.org --- drivers/pci/p2pdma.c | 27 +++++++++++++++++++++++++++ include/linux/pci-p2pdma.h | 6 ++++++ 2 files changed, 33 insertions(+)
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c index c52298d76e64..620ac60babb5 100644 --- a/drivers/pci/p2pdma.c +++ b/drivers/pci/p2pdma.c @@ -797,3 +797,30 @@ ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev, return sprintf(page, "%s\n", pci_name(p2p_dev)); } EXPORT_SYMBOL_GPL(pci_p2pdma_enable_show); + +bool pci_test_p2p(struct device *devA, struct device *devB) +{ + struct pci_dev *pciA, *pciB; + bool ret; + int tmp; + + /* + * For now we only support PCIE peer to peer but other inter-connect + * can be added. + */ + pciA = find_parent_pci_dev(devA); + pciB = find_parent_pci_dev(devB); + if (pciA == NULL || pciB == NULL) { + ret = false; + goto out; + } + + tmp = upstream_bridge_distance(pciA, pciB, NULL); + ret = tmp < 0 ? false : true; + +out: + pci_dev_put(pciB); + pci_dev_put(pciA); + return false; +} +EXPORT_SYMBOL_GPL(pci_test_p2p); diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h index bca9bc3e5be7..7671cc499a08 100644 --- a/include/linux/pci-p2pdma.h +++ b/include/linux/pci-p2pdma.h @@ -36,6 +36,7 @@ int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev, bool *use_p2pdma); ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev, bool use_p2pdma); +bool pci_test_p2p(struct device *devA, struct device *devB); #else /* CONFIG_PCI_P2PDMA */ static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, u64 offset) @@ -97,6 +98,11 @@ static inline ssize_t pci_p2pdma_enable_show(char *page, { return sprintf(page, "none\n"); } + +static inline bool pci_test_p2p(struct device *devA, struct device *devB) +{ + return false; +} #endif /* CONFIG_PCI_P2PDMA */
On 2019-01-29 10:47 a.m., jglisse@redhat.com wrote:
+bool pci_test_p2p(struct device *devA, struct device *devB) +{
- struct pci_dev *pciA, *pciB;
- bool ret;
- int tmp;
- /*
* For now we only support PCIE peer to peer but other inter-connect
* can be added.
*/
- pciA = find_parent_pci_dev(devA);
- pciB = find_parent_pci_dev(devB);
- if (pciA == NULL || pciB == NULL) {
ret = false;
goto out;
- }
- tmp = upstream_bridge_distance(pciA, pciB, NULL);
- ret = tmp < 0 ? false : true;
+out:
- pci_dev_put(pciB);
- pci_dev_put(pciA);
- return false;
+} +EXPORT_SYMBOL_GPL(pci_test_p2p);
This function only ever returns false....
Logan
On Tue, Jan 29, 2019 at 11:24:09AM -0700, Logan Gunthorpe wrote:
On 2019-01-29 10:47 a.m., jglisse@redhat.com wrote:
+bool pci_test_p2p(struct device *devA, struct device *devB) +{
- struct pci_dev *pciA, *pciB;
- bool ret;
- int tmp;
- /*
* For now we only support PCIE peer to peer but other inter-connect
* can be added.
*/
- pciA = find_parent_pci_dev(devA);
- pciB = find_parent_pci_dev(devB);
- if (pciA == NULL || pciB == NULL) {
ret = false;
goto out;
- }
- tmp = upstream_bridge_distance(pciA, pciB, NULL);
- ret = tmp < 0 ? false : true;
+out:
- pci_dev_put(pciB);
- pci_dev_put(pciA);
- return false;
+} +EXPORT_SYMBOL_GPL(pci_test_p2p);
This function only ever returns false....
I guess it was nevr actually tested :(
I feel really worried about passing random 'struct device' pointers into the PCI layer. Are we _sure_ it can handle this properly?
thanks,
greg k-h
On Tue, Jan 29, 2019 at 08:44:26PM +0100, Greg Kroah-Hartman wrote:
On Tue, Jan 29, 2019 at 11:24:09AM -0700, Logan Gunthorpe wrote:
On 2019-01-29 10:47 a.m., jglisse@redhat.com wrote:
+bool pci_test_p2p(struct device *devA, struct device *devB) +{
- struct pci_dev *pciA, *pciB;
- bool ret;
- int tmp;
- /*
* For now we only support PCIE peer to peer but other inter-connect
* can be added.
*/
- pciA = find_parent_pci_dev(devA);
- pciB = find_parent_pci_dev(devB);
- if (pciA == NULL || pciB == NULL) {
ret = false;
goto out;
- }
- tmp = upstream_bridge_distance(pciA, pciB, NULL);
- ret = tmp < 0 ? false : true;
+out:
- pci_dev_put(pciB);
- pci_dev_put(pciA);
- return false;
+} +EXPORT_SYMBOL_GPL(pci_test_p2p);
This function only ever returns false....
I guess it was nevr actually tested :(
I feel really worried about passing random 'struct device' pointers into the PCI layer. Are we _sure_ it can handle this properly?
Oh yes i fixed it on the test rig and forgot to patch my local git tree. My bad.
Cheers, Jérôme
On 2019-01-29 12:44 p.m., Greg Kroah-Hartman wrote:
On Tue, Jan 29, 2019 at 11:24:09AM -0700, Logan Gunthorpe wrote:
On 2019-01-29 10:47 a.m., jglisse@redhat.com wrote:
+bool pci_test_p2p(struct device *devA, struct device *devB) +{
- struct pci_dev *pciA, *pciB;
- bool ret;
- int tmp;
- /*
* For now we only support PCIE peer to peer but other inter-connect
* can be added.
*/
- pciA = find_parent_pci_dev(devA);
- pciB = find_parent_pci_dev(devB);
- if (pciA == NULL || pciB == NULL) {
ret = false;
goto out;
- }
- tmp = upstream_bridge_distance(pciA, pciB, NULL);
- ret = tmp < 0 ? false : true;
+out:
- pci_dev_put(pciB);
- pci_dev_put(pciA);
- return false;
+} +EXPORT_SYMBOL_GPL(pci_test_p2p);
This function only ever returns false....
I guess it was nevr actually tested :(
I feel really worried about passing random 'struct device' pointers into the PCI layer. Are we _sure_ it can handle this properly?
Yes, there are a couple of pci_p2pdma functions that take struct devices directly simply because it's way more convenient for the caller. That's what find_parent_pci_dev() takes care of (it returns false if the device is not a PCI device). Whether that's appropriate here is hard to say seeing we haven't seen any caller code.
Logan
On Tue, Jan 29, 2019 at 01:44:09PM -0700, Logan Gunthorpe wrote:
On 2019-01-29 12:44 p.m., Greg Kroah-Hartman wrote:
On Tue, Jan 29, 2019 at 11:24:09AM -0700, Logan Gunthorpe wrote:
On 2019-01-29 10:47 a.m., jglisse@redhat.com wrote:
+bool pci_test_p2p(struct device *devA, struct device *devB) +{
- struct pci_dev *pciA, *pciB;
- bool ret;
- int tmp;
- /*
* For now we only support PCIE peer to peer but other inter-connect
* can be added.
*/
- pciA = find_parent_pci_dev(devA);
- pciB = find_parent_pci_dev(devB);
- if (pciA == NULL || pciB == NULL) {
ret = false;
goto out;
- }
- tmp = upstream_bridge_distance(pciA, pciB, NULL);
- ret = tmp < 0 ? false : true;
+out:
- pci_dev_put(pciB);
- pci_dev_put(pciA);
- return false;
+} +EXPORT_SYMBOL_GPL(pci_test_p2p);
This function only ever returns false....
I guess it was nevr actually tested :(
I feel really worried about passing random 'struct device' pointers into the PCI layer. Are we _sure_ it can handle this properly?
Yes, there are a couple of pci_p2pdma functions that take struct devices directly simply because it's way more convenient for the caller. That's what find_parent_pci_dev() takes care of (it returns false if the device is not a PCI device). Whether that's appropriate here is hard to say seeing we haven't seen any caller code.
Caller code as a reference (i already given that link in other part of thread but just so that people don't have to follow all branches).
https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-p2p&id=401a5676...
Cheers, Jérôme
On Tue, Jan 29, 2019 at 12:47 PM jglisse@redhat.com wrote:
From: Jérôme Glisse jglisse@redhat.com
device_test_p2p() return true if two devices can peer to peer to each other. We add a generic function as different inter-connect can support peer to peer and we want to genericaly test this no matter what the inter-connect might be. However this version only support PCIE for now.
What about something like these patches: https://cgit.freedesktop.org/~deathsimple/linux/commit/?h=p2p&id=4fab9ff... https://cgit.freedesktop.org/~deathsimple/linux/commit/?h=p2p&id=f90b12d... They are a bit more thorough.
Alex
Signed-off-by: Jérôme Glisse jglisse@redhat.com Cc: Logan Gunthorpe logang@deltatee.com Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Rafael J. Wysocki rafael@kernel.org Cc: Bjorn Helgaas bhelgaas@google.com Cc: Christian Koenig christian.koenig@amd.com Cc: Felix Kuehling Felix.Kuehling@amd.com Cc: Jason Gunthorpe jgg@mellanox.com Cc: linux-kernel@vger.kernel.org Cc: linux-pci@vger.kernel.org Cc: dri-devel@lists.freedesktop.org Cc: Christoph Hellwig hch@lst.de Cc: Marek Szyprowski m.szyprowski@samsung.com Cc: Robin Murphy robin.murphy@arm.com Cc: Joerg Roedel jroedel@suse.de Cc: iommu@lists.linux-foundation.org
drivers/pci/p2pdma.c | 27 +++++++++++++++++++++++++++ include/linux/pci-p2pdma.h | 6 ++++++ 2 files changed, 33 insertions(+)
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c index c52298d76e64..620ac60babb5 100644 --- a/drivers/pci/p2pdma.c +++ b/drivers/pci/p2pdma.c @@ -797,3 +797,30 @@ ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev, return sprintf(page, "%s\n", pci_name(p2p_dev)); } EXPORT_SYMBOL_GPL(pci_p2pdma_enable_show);
+bool pci_test_p2p(struct device *devA, struct device *devB) +{
struct pci_dev *pciA, *pciB;
bool ret;
int tmp;
/*
* For now we only support PCIE peer to peer but other inter-connect
* can be added.
*/
pciA = find_parent_pci_dev(devA);
pciB = find_parent_pci_dev(devB);
if (pciA == NULL || pciB == NULL) {
ret = false;
goto out;
}
tmp = upstream_bridge_distance(pciA, pciB, NULL);
ret = tmp < 0 ? false : true;
+out:
pci_dev_put(pciB);
pci_dev_put(pciA);
return false;
+} +EXPORT_SYMBOL_GPL(pci_test_p2p); diff --git a/include/linux/pci-p2pdma.h b/include/linux/pci-p2pdma.h index bca9bc3e5be7..7671cc499a08 100644 --- a/include/linux/pci-p2pdma.h +++ b/include/linux/pci-p2pdma.h @@ -36,6 +36,7 @@ int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev, bool *use_p2pdma); ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev, bool use_p2pdma); +bool pci_test_p2p(struct device *devA, struct device *devB); #else /* CONFIG_PCI_P2PDMA */ static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size, u64 offset) @@ -97,6 +98,11 @@ static inline ssize_t pci_p2pdma_enable_show(char *page, { return sprintf(page, "none\n"); }
+static inline bool pci_test_p2p(struct device *devA, struct device *devB) +{
return false;
+} #endif /* CONFIG_PCI_P2PDMA */
-- 2.17.2
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Tue, Jan 29, 2019 at 02:56:38PM -0500, Alex Deucher wrote:
On Tue, Jan 29, 2019 at 12:47 PM jglisse@redhat.com wrote:
From: Jérôme Glisse jglisse@redhat.com
device_test_p2p() return true if two devices can peer to peer to each other. We add a generic function as different inter-connect can support peer to peer and we want to genericaly test this no matter what the inter-connect might be. However this version only support PCIE for now.
What about something like these patches: https://cgit.freedesktop.org/~deathsimple/linux/commit/?h=p2p&id=4fab9ff... https://cgit.freedesktop.org/~deathsimple/linux/commit/?h=p2p&id=f90b12d... They are a bit more thorough.
Yes it would be better, i forgot about those. I can include them next time i post. Thank you for reminding me about those :)
Cheers, Jérôme
On 2019-01-29 12:56 p.m., Alex Deucher wrote:
On Tue, Jan 29, 2019 at 12:47 PM jglisse@redhat.com wrote:
From: Jérôme Glisse jglisse@redhat.com
device_test_p2p() return true if two devices can peer to peer to each other. We add a generic function as different inter-connect can support peer to peer and we want to genericaly test this no matter what the inter-connect might be. However this version only support PCIE for now.
What about something like these patches: https://cgit.freedesktop.org/~deathsimple/linux/commit/?h=p2p&id=4fab9ff... https://cgit.freedesktop.org/~deathsimple/linux/commit/?h=p2p&id=f90b12d... They are a bit more thorough.
Those new functions seem to have a lot of overlap with the code that is already upstream in p2pdma.... Perhaps you should be improving the p2pdma functions if they aren't suitable for what you want already instead of creating new ones.
Logan
On Tue, Jan 29, 2019 at 3:25 PM Logan Gunthorpe logang@deltatee.com wrote:
On 2019-01-29 12:56 p.m., Alex Deucher wrote:
On Tue, Jan 29, 2019 at 12:47 PM jglisse@redhat.com wrote:
From: Jérôme Glisse jglisse@redhat.com
device_test_p2p() return true if two devices can peer to peer to each other. We add a generic function as different inter-connect can support peer to peer and we want to genericaly test this no matter what the inter-connect might be. However this version only support PCIE for now.
What about something like these patches: https://cgit.freedesktop.org/~deathsimple/linux/commit/?h=p2p&id=4fab9ff... https://cgit.freedesktop.org/~deathsimple/linux/commit/?h=p2p&id=f90b12d... They are a bit more thorough.
Those new functions seem to have a lot of overlap with the code that is already upstream in p2pdma.... Perhaps you should be improving the p2pdma functions if they aren't suitable for what you want already instead of creating new ones.
Could be. Those patches are pretty old. They probably need to be rebased on the latest upstream p2p stuff.
Alex
Am 29.01.19 um 21:24 schrieb Logan Gunthorpe:
On 2019-01-29 12:56 p.m., Alex Deucher wrote:
On Tue, Jan 29, 2019 at 12:47 PM jglisse@redhat.com wrote:
From: Jérôme Glisse jglisse@redhat.com
device_test_p2p() return true if two devices can peer to peer to each other. We add a generic function as different inter-connect can support peer to peer and we want to genericaly test this no matter what the inter-connect might be. However this version only support PCIE for now.
What about something like these patches: https://cgit.freedesktop.org/~deathsimple/linux/commit/?h=p2p&id=4fab9ff... https://cgit.freedesktop.org/~deathsimple/linux/commit/?h=p2p&id=f90b12d... They are a bit more thorough.
Those new functions seem to have a lot of overlap with the code that is already upstream in p2pdma.... Perhaps you should be improving the p2pdma functions if they aren't suitable for what you want already instead of creating new ones.
Yeah, well that's what I was suggesting for the very beginning :)
But completely agree the existing functions should be improved instead of adding new ones, Christian.
Logan _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
From: Jérôme Glisse jglisse@redhat.com
device_test_p2p() return true if two devices can peer to peer to each other. We add a generic function as different inter-connect can support peer to peer and we want to genericaly test this no matter what the inter-connect might be. However this version only support PCIE for now.
Signed-off-by: Jérôme Glisse jglisse@redhat.com Cc: Logan Gunthorpe logang@deltatee.com Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Rafael J. Wysocki rafael@kernel.org Cc: Bjorn Helgaas bhelgaas@google.com Cc: Christian Koenig christian.koenig@amd.com Cc: Felix Kuehling Felix.Kuehling@amd.com Cc: Jason Gunthorpe jgg@mellanox.com Cc: linux-kernel@vger.kernel.org Cc: linux-pci@vger.kernel.org Cc: dri-devel@lists.freedesktop.org Cc: Christoph Hellwig hch@lst.de Cc: Marek Szyprowski m.szyprowski@samsung.com Cc: Robin Murphy robin.murphy@arm.com Cc: Joerg Roedel jroedel@suse.de Cc: iommu@lists.linux-foundation.org --- drivers/base/core.c | 20 ++++++++++++++++++++ include/linux/device.h | 1 + 2 files changed, 21 insertions(+)
diff --git a/drivers/base/core.c b/drivers/base/core.c index 0073b09bb99f..56023b00e108 100644 --- a/drivers/base/core.c +++ b/drivers/base/core.c @@ -26,6 +26,7 @@ #include <linux/netdevice.h> #include <linux/sched/signal.h> #include <linux/sysfs.h> +#include <linux/pci-p2pdma.h>
#include "base.h" #include "power/power.h" @@ -3167,3 +3168,22 @@ void device_set_of_node_from_dev(struct device *dev, const struct device *dev2) dev->of_node_reused = true; } EXPORT_SYMBOL_GPL(device_set_of_node_from_dev); + +/** + * device_test_p2p - test if two device can peer to peer to each other + * @devA: device A + * @devB: device B + * Returns: true if device can peer to peer to each other, false otherwise + */ +bool device_test_p2p(struct device *devA, struct device *devB) +{ + /* + * For now we only support PCIE peer to peer but other inter-connect + * can be added. + */ + if (pci_test_p2p(devA, devB)) + return true; + + return false; +} +EXPORT_SYMBOL_GPL(device_test_p2p); diff --git a/include/linux/device.h b/include/linux/device.h index 6cb4640b6160..0d532d7f0779 100644 --- a/include/linux/device.h +++ b/include/linux/device.h @@ -1250,6 +1250,7 @@ extern int device_online(struct device *dev); extern void set_primary_fwnode(struct device *dev, struct fwnode_handle *fwnode); extern void set_secondary_fwnode(struct device *dev, struct fwnode_handle *fwnode); void device_set_of_node_from_dev(struct device *dev, const struct device *dev2); +bool device_test_p2p(struct device *devA, struct device *devB);
static inline int dev_num_vf(struct device *dev) {
On 2019-01-29 10:47 a.m., jglisse@redhat.com wrote:
From: Jérôme Glisse jglisse@redhat.com
device_test_p2p() return true if two devices can peer to peer to each other. We add a generic function as different inter-connect can support peer to peer and we want to genericaly test this no matter what the inter-connect might be. However this version only support PCIE for now.
This doesn't appear to be used in any of the further patches; so it's very confusing.
I'm not sure a struct device wrapper is really necessary...
Logan
On Tue, Jan 29, 2019 at 11:26:01AM -0700, Logan Gunthorpe wrote:
On 2019-01-29 10:47 a.m., jglisse@redhat.com wrote:
From: Jérôme Glisse jglisse@redhat.com
device_test_p2p() return true if two devices can peer to peer to each other. We add a generic function as different inter-connect can support peer to peer and we want to genericaly test this no matter what the inter-connect might be. However this version only support PCIE for now.
This doesn't appear to be used in any of the further patches; so it's very confusing.
I'm not sure a struct device wrapper is really necessary...
I wanted to allow other non pci device to join in the fun but yes right now i have only be doing this on pci devices.
Jérôme
On Tue, Jan 29, 2019 at 12:47:25PM -0500, jglisse@redhat.com wrote:
From: Jérôme Glisse jglisse@redhat.com
device_test_p2p() return true if two devices can peer to peer to each other. We add a generic function as different inter-connect can support peer to peer and we want to genericaly test this no matter what the inter-connect might be. However this version only support PCIE for now.
There is no defintion of "peer to peer" in the driver/device model, so why should this be in the driver core at all?
Especially as you only do this for PCI, why not just keep it in the PCI layer, that way you _know_ you are dealing with the right pointer types and there is no need to mess around with the driver core at all.
thanks,
greg k-h
On Tue, Jan 29, 2019 at 08:46:05PM +0100, Greg Kroah-Hartman wrote:
On Tue, Jan 29, 2019 at 12:47:25PM -0500, jglisse@redhat.com wrote:
From: Jérôme Glisse jglisse@redhat.com
device_test_p2p() return true if two devices can peer to peer to each other. We add a generic function as different inter-connect can support peer to peer and we want to genericaly test this no matter what the inter-connect might be. However this version only support PCIE for now.
There is no defintion of "peer to peer" in the driver/device model, so why should this be in the driver core at all?
Especially as you only do this for PCI, why not just keep it in the PCI layer, that way you _know_ you are dealing with the right pointer types and there is no need to mess around with the driver core at all.
Ok i will drop the core device change. I wanted to allow other non PCI to join latter on (ie allow PCI device to export to non PCI device) but if that ever happen then we can update pci exporter at the same time we introduce non pci importer.
Cheers, Jérôme
From: Jérôme Glisse jglisse@redhat.com
Allow mmap of device file to export device memory to peer to peer devices. This will allow for instance a network device to access a GPU memory or to access a storage device queue directly.
The common case will be a vma created by userspace device driver that is then share to another userspace device driver which call in its kernel device driver to map that vma.
The vma does not need to have any valid CPU mapping so that only peer to peer device might access its content. Or it could have valid CPU mapping too in that case it should point to same memory for consistency.
Note that peer to peer mapping is highly platform and device dependent and it might not work in all the cases. However we do expect supports for this to grow on more hardware platform.
This patch only adds new call backs to vm_operations_struct bulk of code light within common bus driver (like pci) and device driver (both the exporting and importing device).
Current design mandate that the importer must obey mmu_notifier and invalidate any peer to peer mapping anytime a notification of invalidation happens for a range that have been peer to peer mapped. This allows exporter device to easily invalidate mapping for any importer device.
Signed-off-by: Jérôme Glisse jglisse@redhat.com Cc: Logan Gunthorpe logang@deltatee.com Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Rafael J. Wysocki rafael@kernel.org Cc: Bjorn Helgaas bhelgaas@google.com Cc: Christian Koenig christian.koenig@amd.com Cc: Felix Kuehling Felix.Kuehling@amd.com Cc: Jason Gunthorpe jgg@mellanox.com Cc: linux-kernel@vger.kernel.org Cc: linux-pci@vger.kernel.org Cc: dri-devel@lists.freedesktop.org Cc: Christoph Hellwig hch@lst.de Cc: Marek Szyprowski m.szyprowski@samsung.com Cc: Robin Murphy robin.murphy@arm.com Cc: Joerg Roedel jroedel@suse.de Cc: iommu@lists.linux-foundation.org --- include/linux/mm.h | 38 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 38 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h index 80bb6408fe73..1bd60a90e575 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -429,6 +429,44 @@ struct vm_operations_struct { pgoff_t start_pgoff, pgoff_t end_pgoff); unsigned long (*pagesize)(struct vm_area_struct * area);
+ /* + * Optional for device driver that want to allow peer to peer (p2p) + * mapping of their vma (which can be back by some device memory) to + * another device. + * + * Note that the exporting device driver might not have map anything + * inside the vma for the CPU but might still want to allow a peer + * device to access the range of memory corresponding to a range in + * that vma. + * + * FOR PREDICTABILITY IF DRIVER SUCCESSFULY MAP A RANGE ONCE FOR A + * DEVICE THEN FURTHER MAPPING OF THE SAME IF THE VMA IS STILL VALID + * SHOULD ALSO BE SUCCESSFUL. Following this rule allow the importing + * device to map once during setup and report any failure at that time + * to the userspace. Further mapping of the same range might happen + * after mmu notifier invalidation over the range. The exporting device + * can use this to move things around (defrag BAR space for instance) + * or do other similar task. + * + * IMPORTER MUST OBEY mmu_notifier NOTIFICATION AND CALL p2p_unmap() + * WHEN A NOTIFIER IS CALL FOR THE RANGE ! THIS CAN HAPPEN AT ANY + * POINT IN TIME WITH NO LOCK HELD. + * + * In below function, the device argument is the importing device, + * the exporting device is the device to which the vma belongs. + */ + long (*p2p_map)(struct vm_area_struct *vma, + struct device *device, + unsigned long start, + unsigned long end, + dma_addr_t *pa, + bool write); + long (*p2p_unmap)(struct vm_area_struct *vma, + struct device *device, + unsigned long start, + unsigned long end, + dma_addr_t *pa); + /* notification that a previously read-only page is about to become * writable, if an error is returned it will cause a SIGBUS */ vm_fault_t (*page_mkwrite)(struct vm_fault *vmf);
On 2019-01-29 10:47 a.m., jglisse@redhat.com wrote:
- /*
* Optional for device driver that want to allow peer to peer (p2p)
* mapping of their vma (which can be back by some device memory) to
* another device.
*
* Note that the exporting device driver might not have map anything
* inside the vma for the CPU but might still want to allow a peer
* device to access the range of memory corresponding to a range in
* that vma.
*
* FOR PREDICTABILITY IF DRIVER SUCCESSFULY MAP A RANGE ONCE FOR A
* DEVICE THEN FURTHER MAPPING OF THE SAME IF THE VMA IS STILL VALID
* SHOULD ALSO BE SUCCESSFUL. Following this rule allow the importing
* device to map once during setup and report any failure at that time
* to the userspace. Further mapping of the same range might happen
* after mmu notifier invalidation over the range. The exporting device
* can use this to move things around (defrag BAR space for instance)
* or do other similar task.
*
* IMPORTER MUST OBEY mmu_notifier NOTIFICATION AND CALL p2p_unmap()
* WHEN A NOTIFIER IS CALL FOR THE RANGE ! THIS CAN HAPPEN AT ANY
* POINT IN TIME WITH NO LOCK HELD.
*
* In below function, the device argument is the importing device,
* the exporting device is the device to which the vma belongs.
*/
- long (*p2p_map)(struct vm_area_struct *vma,
struct device *device,
unsigned long start,
unsigned long end,
dma_addr_t *pa,
bool write);
- long (*p2p_unmap)(struct vm_area_struct *vma,
struct device *device,
unsigned long start,
unsigned long end,
dma_addr_t *pa);
I don't understand why we need new p2p_[un]map function pointers for this. In subsequent patches, they never appear to be set anywhere and are only called by the HMM code. I'd have expected it to be called by some core VMA code and set by HMM as that's what vm_operations_struct is for.
But the code as all very confusing, hard to follow and seems to be missing significant chunks. So I'm not really sure what is going on.
Logan
On Tue, Jan 29, 2019 at 11:36:29AM -0700, Logan Gunthorpe wrote:
On 2019-01-29 10:47 a.m., jglisse@redhat.com wrote:
- /*
* Optional for device driver that want to allow peer to peer (p2p)
* mapping of their vma (which can be back by some device memory) to
* another device.
*
* Note that the exporting device driver might not have map anything
* inside the vma for the CPU but might still want to allow a peer
* device to access the range of memory corresponding to a range in
* that vma.
*
* FOR PREDICTABILITY IF DRIVER SUCCESSFULY MAP A RANGE ONCE FOR A
* DEVICE THEN FURTHER MAPPING OF THE SAME IF THE VMA IS STILL VALID
* SHOULD ALSO BE SUCCESSFUL. Following this rule allow the importing
* device to map once during setup and report any failure at that time
* to the userspace. Further mapping of the same range might happen
* after mmu notifier invalidation over the range. The exporting device
* can use this to move things around (defrag BAR space for instance)
* or do other similar task.
*
* IMPORTER MUST OBEY mmu_notifier NOTIFICATION AND CALL p2p_unmap()
* WHEN A NOTIFIER IS CALL FOR THE RANGE ! THIS CAN HAPPEN AT ANY
* POINT IN TIME WITH NO LOCK HELD.
*
* In below function, the device argument is the importing device,
* the exporting device is the device to which the vma belongs.
*/
- long (*p2p_map)(struct vm_area_struct *vma,
struct device *device,
unsigned long start,
unsigned long end,
dma_addr_t *pa,
bool write);
- long (*p2p_unmap)(struct vm_area_struct *vma,
struct device *device,
unsigned long start,
unsigned long end,
dma_addr_t *pa);
I don't understand why we need new p2p_[un]map function pointers for this. In subsequent patches, they never appear to be set anywhere and are only called by the HMM code. I'd have expected it to be called by some core VMA code and set by HMM as that's what vm_operations_struct is for.
But the code as all very confusing, hard to follow and seems to be missing significant chunks. So I'm not really sure what is going on.
It is set by device driver when userspace do mmap(fd) where fd comes from open("/dev/somedevicefile"). So it is set by device driver. HMM has nothing to do with this. It must be set by device driver mmap call back (mmap callback of struct file_operations). For this patch you can completely ignore all the HMM patches. Maybe posting this as 2 separate patchset would make it clearer.
For instance see [1] for how a non HMM driver can export its memory by just setting those callback. Note that a proper implementation of this should also include some kind of driver policy on what to allow to map and what to not allow ... All this is driver specific in any way.
Cheers, Jérôme
[1] https://cgit.freedesktop.org/~glisse/linux/commit/?h=wip-p2p-showcase&id...
On 2019-01-29 12:11 p.m., Jerome Glisse wrote:
On Tue, Jan 29, 2019 at 11:36:29AM -0700, Logan Gunthorpe wrote:
On 2019-01-29 10:47 a.m., jglisse@redhat.com wrote:
- /*
* Optional for device driver that want to allow peer to peer (p2p)
* mapping of their vma (which can be back by some device memory) to
* another device.
*
* Note that the exporting device driver might not have map anything
* inside the vma for the CPU but might still want to allow a peer
* device to access the range of memory corresponding to a range in
* that vma.
*
* FOR PREDICTABILITY IF DRIVER SUCCESSFULY MAP A RANGE ONCE FOR A
* DEVICE THEN FURTHER MAPPING OF THE SAME IF THE VMA IS STILL VALID
* SHOULD ALSO BE SUCCESSFUL. Following this rule allow the importing
* device to map once during setup and report any failure at that time
* to the userspace. Further mapping of the same range might happen
* after mmu notifier invalidation over the range. The exporting device
* can use this to move things around (defrag BAR space for instance)
* or do other similar task.
*
* IMPORTER MUST OBEY mmu_notifier NOTIFICATION AND CALL p2p_unmap()
* WHEN A NOTIFIER IS CALL FOR THE RANGE ! THIS CAN HAPPEN AT ANY
* POINT IN TIME WITH NO LOCK HELD.
*
* In below function, the device argument is the importing device,
* the exporting device is the device to which the vma belongs.
*/
- long (*p2p_map)(struct vm_area_struct *vma,
struct device *device,
unsigned long start,
unsigned long end,
dma_addr_t *pa,
bool write);
- long (*p2p_unmap)(struct vm_area_struct *vma,
struct device *device,
unsigned long start,
unsigned long end,
dma_addr_t *pa);
I don't understand why we need new p2p_[un]map function pointers for this. In subsequent patches, they never appear to be set anywhere and are only called by the HMM code. I'd have expected it to be called by some core VMA code and set by HMM as that's what vm_operations_struct is for.
But the code as all very confusing, hard to follow and seems to be missing significant chunks. So I'm not really sure what is going on.
It is set by device driver when userspace do mmap(fd) where fd comes from open("/dev/somedevicefile"). So it is set by device driver. HMM has nothing to do with this. It must be set by device driver mmap call back (mmap callback of struct file_operations). For this patch you can completely ignore all the HMM patches. Maybe posting this as 2 separate patchset would make it clearer.
For instance see [1] for how a non HMM driver can export its memory by just setting those callback. Note that a proper implementation of this should also include some kind of driver policy on what to allow to map and what to not allow ... All this is driver specific in any way.
I'd suggest [1] should be a part of the patchset so we can actually see a user of the stuff you're adding.
But it still doesn't explain everything as without the HMM code nothing calls the new vm_ops. And there's still no callers for the p2p_test functions you added. And I still don't understand why we need the new vm_ops or who calls them and when. Why can't drivers use the existing 'fault' vm_op and call a new helper function to map p2p when appropriate or a different helper function to map a large range in its mmap operation? Just like regular mmap code...
Logan
On Tue, Jan 29, 2019 at 12:24:04PM -0700, Logan Gunthorpe wrote:
On 2019-01-29 12:11 p.m., Jerome Glisse wrote:
On Tue, Jan 29, 2019 at 11:36:29AM -0700, Logan Gunthorpe wrote:
On 2019-01-29 10:47 a.m., jglisse@redhat.com wrote:
- /*
* Optional for device driver that want to allow peer to peer (p2p)
* mapping of their vma (which can be back by some device memory) to
* another device.
*
* Note that the exporting device driver might not have map anything
* inside the vma for the CPU but might still want to allow a peer
* device to access the range of memory corresponding to a range in
* that vma.
*
* FOR PREDICTABILITY IF DRIVER SUCCESSFULY MAP A RANGE ONCE FOR A
* DEVICE THEN FURTHER MAPPING OF THE SAME IF THE VMA IS STILL VALID
* SHOULD ALSO BE SUCCESSFUL. Following this rule allow the importing
* device to map once during setup and report any failure at that time
* to the userspace. Further mapping of the same range might happen
* after mmu notifier invalidation over the range. The exporting device
* can use this to move things around (defrag BAR space for instance)
* or do other similar task.
*
* IMPORTER MUST OBEY mmu_notifier NOTIFICATION AND CALL p2p_unmap()
* WHEN A NOTIFIER IS CALL FOR THE RANGE ! THIS CAN HAPPEN AT ANY
* POINT IN TIME WITH NO LOCK HELD.
*
* In below function, the device argument is the importing device,
* the exporting device is the device to which the vma belongs.
*/
- long (*p2p_map)(struct vm_area_struct *vma,
struct device *device,
unsigned long start,
unsigned long end,
dma_addr_t *pa,
bool write);
- long (*p2p_unmap)(struct vm_area_struct *vma,
struct device *device,
unsigned long start,
unsigned long end,
dma_addr_t *pa);
I don't understand why we need new p2p_[un]map function pointers for this. In subsequent patches, they never appear to be set anywhere and are only called by the HMM code. I'd have expected it to be called by some core VMA code and set by HMM as that's what vm_operations_struct is for.
But the code as all very confusing, hard to follow and seems to be missing significant chunks. So I'm not really sure what is going on.
It is set by device driver when userspace do mmap(fd) where fd comes from open("/dev/somedevicefile"). So it is set by device driver. HMM has nothing to do with this. It must be set by device driver mmap call back (mmap callback of struct file_operations). For this patch you can completely ignore all the HMM patches. Maybe posting this as 2 separate patchset would make it clearer.
For instance see [1] for how a non HMM driver can export its memory by just setting those callback. Note that a proper implementation of this should also include some kind of driver policy on what to allow to map and what to not allow ... All this is driver specific in any way.
I'd suggest [1] should be a part of the patchset so we can actually see a user of the stuff you're adding.
I did not wanted to clutter patchset with device driver specific usage of this. As the API can be reason about in abstract way.
But it still doesn't explain everything as without the HMM code nothing calls the new vm_ops. And there's still no callers for the p2p_test functions you added. And I still don't understand why we need the new vm_ops or who calls them and when. Why can't drivers use the existing 'fault' vm_op and call a new helper function to map p2p when appropriate or a different helper function to map a large range in its mmap operation? Just like regular mmap code...
HMM code is just one user, if you have a driver that use HMM mirror then your driver get support for this for free. If you do not want to use HMM then you can directly call this in your driver.
The flow is, device driver want to setup some mapping for a range of virtual address [va_start, va_end]: 1 - Lookup vma for the range 2 - If vma is regular vma (not an mmap of a file) then use GUP If vma is a mmap of a file and they are p2p_map call back then call p2p_map() 3 - Use either the result of GUP or p2p_map() to program the device
The p2p test function is use by device driver implementing the call back for instance see:
https://cgit.freedesktop.org/~glisse/linux/commit/?h=hmm-p2p&id=401a5676...
The vm_fault callback is not suited because here we are mapping to another device so this will need special handling, someone must have both device struct pointer in end and someone must be allow to make decission on what to allow and what not to allow.
Moreover exporting driver like GPU driver might have complex policy in place for which they will only allow export of some memory to a peer device but not other.
In the end it means that it easier and simpler to add new call back specificaly for that, so the intention is clear to both the caller and the callee. The exporting device can then do the proper check using the core helper (ie checking that the device can actually peer to each other) and if that works it can then decide wether or not it wants to allow this other device to access its memory or if it prefer to use main memory for this.
To add this to the fault callback we would need to define a bunch of new flags, setup fake page table so that we can populate pte and then have the importing device re-interpret everything specialy because it comes from another device. It would look ugly and it would need to modify bunch of core mm code.
Note that this callback solution also allow an exporting device to allow peer access while CPU can not access the memory ie the pte entry for the range are pte_none.
Cheers, Jérôme
On 2019-01-29 12:44 p.m., Jerome Glisse wrote:
I'd suggest [1] should be a part of the patchset so we can actually see a user of the stuff you're adding.
I did not wanted to clutter patchset with device driver specific usage of this. As the API can be reason about in abstract way.
It's hard to reason about an interface when you can't see what all the layers want to do with it. Most maintainers (I'd hope) would certainly never merge code that has no callers, and for much the same reason, I'd rather not review patches that don't have real use case examples.
Logan
On Tue, Jan 29, 2019 at 02:11:23PM -0500, Jerome Glisse wrote:
On Tue, Jan 29, 2019 at 11:36:29AM -0700, Logan Gunthorpe wrote:
On 2019-01-29 10:47 a.m., jglisse@redhat.com wrote:
- /*
* Optional for device driver that want to allow peer to peer (p2p)
* mapping of their vma (which can be back by some device memory) to
* another device.
*
* Note that the exporting device driver might not have map anything
* inside the vma for the CPU but might still want to allow a peer
* device to access the range of memory corresponding to a range in
* that vma.
*
* FOR PREDICTABILITY IF DRIVER SUCCESSFULY MAP A RANGE ONCE FOR A
* DEVICE THEN FURTHER MAPPING OF THE SAME IF THE VMA IS STILL VALID
* SHOULD ALSO BE SUCCESSFUL. Following this rule allow the importing
* device to map once during setup and report any failure at that time
* to the userspace. Further mapping of the same range might happen
* after mmu notifier invalidation over the range. The exporting device
* can use this to move things around (defrag BAR space for instance)
* or do other similar task.
*
* IMPORTER MUST OBEY mmu_notifier NOTIFICATION AND CALL p2p_unmap()
* WHEN A NOTIFIER IS CALL FOR THE RANGE ! THIS CAN HAPPEN AT ANY
* POINT IN TIME WITH NO LOCK HELD.
*
* In below function, the device argument is the importing device,
* the exporting device is the device to which the vma belongs.
*/
- long (*p2p_map)(struct vm_area_struct *vma,
struct device *device,
unsigned long start,
unsigned long end,
dma_addr_t *pa,
bool write);
- long (*p2p_unmap)(struct vm_area_struct *vma,
struct device *device,
unsigned long start,
unsigned long end,
dma_addr_t *pa);
I don't understand why we need new p2p_[un]map function pointers for this. In subsequent patches, they never appear to be set anywhere and are only called by the HMM code. I'd have expected it to be called by some core VMA code and set by HMM as that's what vm_operations_struct is for.
But the code as all very confusing, hard to follow and seems to be missing significant chunks. So I'm not really sure what is going on.
It is set by device driver when userspace do mmap(fd) where fd comes from open("/dev/somedevicefile"). So it is set by device driver. HMM has nothing to do with this. It must be set by device driver mmap call back (mmap callback of struct file_operations). For this patch you can completely ignore all the HMM patches. Maybe posting this as 2 separate patchset would make it clearer.
For instance see [1] for how a non HMM driver can export its memory by just setting those callback. Note that a proper implementation of this should also include some kind of driver policy on what to allow to map and what to not allow ... All this is driver specific in any way.
I'm imagining that the RDMA drivers would use this interface on their per-process 'doorbell' BAR pages - we also wish to have P2P DMA to this memory. Also the entire VFIO PCI BAR mmap would be good to cover with this too.
Jerome, I think it would be nice to have a helper scheme - I think the simple case would be simple remapping of PCI BAR memory, so if we could have, say something like:
static const struct vm_operations_struct my_ops { .p2p_map = p2p_ioremap_map_op, .p2p_unmap = p2p_ioremap_unmap_op, }
struct ioremap_data { [..] }
fops_mmap() { vma->private_data = &driver_priv->ioremap_data; return p2p_ioremap_device_memory(vma, exporting_device, [..]); }
Which closely matches at least what the RDMA drivers do. Where p2p_ioremap_device_memory populates p2p_map and p2p_unmap pointers with sensible functions, etc.
It looks like vfio would be able to use this as well (though I am unsure why vfio uses remap_pfn_range instead of io_remap_pfn range for BAR memory..)
Do any drivers need more control than this?
Jason
On Tue, Jan 29, 2019 at 07:32:57PM +0000, Jason Gunthorpe wrote:
On Tue, Jan 29, 2019 at 02:11:23PM -0500, Jerome Glisse wrote:
On Tue, Jan 29, 2019 at 11:36:29AM -0700, Logan Gunthorpe wrote:
On 2019-01-29 10:47 a.m., jglisse@redhat.com wrote:
- /*
* Optional for device driver that want to allow peer to peer (p2p)
* mapping of their vma (which can be back by some device memory) to
* another device.
*
* Note that the exporting device driver might not have map anything
* inside the vma for the CPU but might still want to allow a peer
* device to access the range of memory corresponding to a range in
* that vma.
*
* FOR PREDICTABILITY IF DRIVER SUCCESSFULY MAP A RANGE ONCE FOR A
* DEVICE THEN FURTHER MAPPING OF THE SAME IF THE VMA IS STILL VALID
* SHOULD ALSO BE SUCCESSFUL. Following this rule allow the importing
* device to map once during setup and report any failure at that time
* to the userspace. Further mapping of the same range might happen
* after mmu notifier invalidation over the range. The exporting device
* can use this to move things around (defrag BAR space for instance)
* or do other similar task.
*
* IMPORTER MUST OBEY mmu_notifier NOTIFICATION AND CALL p2p_unmap()
* WHEN A NOTIFIER IS CALL FOR THE RANGE ! THIS CAN HAPPEN AT ANY
* POINT IN TIME WITH NO LOCK HELD.
*
* In below function, the device argument is the importing device,
* the exporting device is the device to which the vma belongs.
*/
- long (*p2p_map)(struct vm_area_struct *vma,
struct device *device,
unsigned long start,
unsigned long end,
dma_addr_t *pa,
bool write);
- long (*p2p_unmap)(struct vm_area_struct *vma,
struct device *device,
unsigned long start,
unsigned long end,
dma_addr_t *pa);
I don't understand why we need new p2p_[un]map function pointers for this. In subsequent patches, they never appear to be set anywhere and are only called by the HMM code. I'd have expected it to be called by some core VMA code and set by HMM as that's what vm_operations_struct is for.
But the code as all very confusing, hard to follow and seems to be missing significant chunks. So I'm not really sure what is going on.
It is set by device driver when userspace do mmap(fd) where fd comes from open("/dev/somedevicefile"). So it is set by device driver. HMM has nothing to do with this. It must be set by device driver mmap call back (mmap callback of struct file_operations). For this patch you can completely ignore all the HMM patches. Maybe posting this as 2 separate patchset would make it clearer.
For instance see [1] for how a non HMM driver can export its memory by just setting those callback. Note that a proper implementation of this should also include some kind of driver policy on what to allow to map and what to not allow ... All this is driver specific in any way.
I'm imagining that the RDMA drivers would use this interface on their per-process 'doorbell' BAR pages - we also wish to have P2P DMA to this memory. Also the entire VFIO PCI BAR mmap would be good to cover with this too.
Correct, you would set those callback on the mmap of your doorbell.
Jerome, I think it would be nice to have a helper scheme - I think the simple case would be simple remapping of PCI BAR memory, so if we could have, say something like:
static const struct vm_operations_struct my_ops { .p2p_map = p2p_ioremap_map_op, .p2p_unmap = p2p_ioremap_unmap_op, }
struct ioremap_data { [..] }
fops_mmap() { vma->private_data = &driver_priv->ioremap_data; return p2p_ioremap_device_memory(vma, exporting_device, [..]); }
Which closely matches at least what the RDMA drivers do. Where p2p_ioremap_device_memory populates p2p_map and p2p_unmap pointers with sensible functions, etc.
It looks like vfio would be able to use this as well (though I am unsure why vfio uses remap_pfn_range instead of io_remap_pfn range for BAR memory..)
Yes simple helper that implement a sane default implementation is definitly a good idea. As i was working with GPU it was not something that immediatly poped to mind (see below). But i can certainly do a sane set of default helper that simple device driver can use right away without too much thinking on there part. I will add this for next posting.
Do any drivers need more control than this?
GPU driver do want more control :) GPU driver are moving things around all the time and they have more memory than bar space (on newer platform AMD GPU do resize the bar but it is not the rule for all GPUs). So GPU driver do actualy manage their BAR address space and they map and unmap thing there. They can not allow someone to just pin stuff there randomly or this would disrupt their regular work flow. Hence they need control and they might implement threshold for instance if they have more than N pages of bar space map for peer to peer then they can decide to fall back to main memory for any new peer mapping.
Cheers, Jérôme
On Tue, Jan 29, 2019 at 02:50:55PM -0500, Jerome Glisse wrote:
GPU driver do want more control :) GPU driver are moving things around all the time and they have more memory than bar space (on newer platform AMD GPU do resize the bar but it is not the rule for all GPUs). So GPU driver do actualy manage their BAR address space and they map and unmap thing there. They can not allow someone to just pin stuff there randomly or this would disrupt their regular work flow. Hence they need control and they might implement threshold for instance if they have more than N pages of bar space map for peer to peer then they can decide to fall back to main memory for any new peer mapping.
But this API doesn't seem to offer any control - I thought that control was all coming from the mm/hmm notifiers triggering p2p_unmaps?
I would think that the importing driver can assume the BAR page is kept alive until it calls unmap (presumably triggered by notifiers)?
ie the exporting driver sees the BAR page as pinned until unmap.
Jason
On Tue, Jan 29, 2019 at 08:24:36PM +0000, Jason Gunthorpe wrote:
On Tue, Jan 29, 2019 at 02:50:55PM -0500, Jerome Glisse wrote:
GPU driver do want more control :) GPU driver are moving things around all the time and they have more memory than bar space (on newer platform AMD GPU do resize the bar but it is not the rule for all GPUs). So GPU driver do actualy manage their BAR address space and they map and unmap thing there. They can not allow someone to just pin stuff there randomly or this would disrupt their regular work flow. Hence they need control and they might implement threshold for instance if they have more than N pages of bar space map for peer to peer then they can decide to fall back to main memory for any new peer mapping.
But this API doesn't seem to offer any control - I thought that control was all coming from the mm/hmm notifiers triggering p2p_unmaps?
The control is within the driver implementation of those callbacks. So driver implementation can refuse to map by returning an error on p2p_map or it can decide to use main memory by migrating its object to main memory and populating the dma address array with dma_page_map() of the main memory pages. Driver like GPU can have policy on top of that for instance they will only allow p2p map to succeed for objects that have been tagged by the userspace in some way ie the userspace application is in control of what can be map to peer device. This is needed for GPU driver as we do want userspace involvement on what object are allowed to have p2p access and also so that we can report to userspace when we are running out of BAR addresses for this to work as intended (ie not falling back to main memory) so that application can take appropriate actions (like decide what to prioritize).
For moving things around after a successful p2p_map yes the exporting device have to call for instance zap_vma_ptes() or something similar. This will trigger notifier call and the importing device will invalidate its mapping. Once it is invalidated then the exporting device can point new call of p2p_map (for the same range) to new memory (obviously the exporting device have to synchronize any concurrent call to p2p_map with the invalidation).
I would think that the importing driver can assume the BAR page is kept alive until it calls unmap (presumably triggered by notifiers)?
ie the exporting driver sees the BAR page as pinned until unmap.
The intention with this patchset is that it is not pin ie the importer device _must_ abide by all mmu notifier invalidations and they can happen at anytime. The importing device can however re-p2p_map the same range after an invalidation.
I would like to restrict this to importer that can invalidate for now because i believe all the first device to use can support the invalidation.
Also when using HMM private device memory we _can not_ pin virtual address to device memory as otherwise CPU access would have to SIGBUS or SEGFAULT and we do not want that. So this was also a motivation to keep thing consistent for the importer for both cases.
Cheers, Jérôme
On Tue, Jan 29, 2019 at 03:44:00PM -0500, Jerome Glisse wrote:
But this API doesn't seem to offer any control - I thought that control was all coming from the mm/hmm notifiers triggering p2p_unmaps?
The control is within the driver implementation of those callbacks.
Seems like what you mean by control is 'the exporter gets to choose the physical address at the instant of map' - which seems reasonable for GPU.
will only allow p2p map to succeed for objects that have been tagged by the userspace in some way ie the userspace application is in control of what can be map to peer device.
I would have thought this means the VMA for the object is created without the map/unmap ops? Or are GPU objects and VMAs unrelated?
For moving things around after a successful p2p_map yes the exporting device have to call for instance zap_vma_ptes() or something similar.
Okay, great, RDMA needs this flow for hotplug - we zap the VMA's when unplugging the PCI device and we can delay the PCI unplug completion until all the p2p_unmaps are called...
But in this case a future p2p_map will have to fail as the BAR no longer exists. How to handle this?
I would think that the importing driver can assume the BAR page is kept alive until it calls unmap (presumably triggered by notifiers)?
ie the exporting driver sees the BAR page as pinned until unmap.
The intention with this patchset is that it is not pin ie the importer device _must_ abide by all mmu notifier invalidations and they can happen at anytime. The importing device can however re-p2p_map the same range after an invalidation.
I would like to restrict this to importer that can invalidate for now because i believe all the first device to use can support the invalidation.
This seems reasonable (and sort of says importers not getting this from HMM need careful checking), was this in the comment above the ops?
Jason
On Tue, Jan 29, 2019 at 11:02:25PM +0000, Jason Gunthorpe wrote:
On Tue, Jan 29, 2019 at 03:44:00PM -0500, Jerome Glisse wrote:
But this API doesn't seem to offer any control - I thought that control was all coming from the mm/hmm notifiers triggering p2p_unmaps?
The control is within the driver implementation of those callbacks.
Seems like what you mean by control is 'the exporter gets to choose the physical address at the instant of map' - which seems reasonable for GPU.
will only allow p2p map to succeed for objects that have been tagged by the userspace in some way ie the userspace application is in control of what can be map to peer device.
I would have thought this means the VMA for the object is created without the map/unmap ops? Or are GPU objects and VMAs unrelated?
GPU object and VMA are unrelated in all open source GPU driver i am somewhat familiar with (AMD, Intel, NVidia). You can create a GPU object and never map it (and thus never have it associated with a vma) and in fact this is very common. For graphic you usualy only have hand full of the hundreds of GPU object your application have mapped.
The control for peer to peer can also be a mutable properties of the object ie userspace do ioctl on the GPU driver which create an object; Some times after the object is created the userspace do others ioctl to allow to export the object to another specific device again this result in ioctl to the device driver, those ioctl set flags and update GPU object kernel structure with all the info.
In the meantime you have no control on when other driver might call the vma p2p call backs. So you must have register the vma with vm_operations that include the p2p_map and p2p_unmap. Those driver function will check the object kernel structure each time they get call and act accordingly.
For moving things around after a successful p2p_map yes the exporting device have to call for instance zap_vma_ptes() or something similar.
Okay, great, RDMA needs this flow for hotplug - we zap the VMA's when unplugging the PCI device and we can delay the PCI unplug completion until all the p2p_unmaps are called...
But in this case a future p2p_map will have to fail as the BAR no longer exists. How to handle this?
So the comment above the callback (i should write more thorough guideline and documentation) state that export should/(must?) be predictable ie if an importer device calls p2p_map() once on a vma and it does succeed then if the same device calls again p2p_map() on the same vma and if the vma is still valid (ie no unmap or does not correspond to a different object ...) then the p2p_map() should/(must?) succeed.
The idea is that the importer would do a first call to p2p_map() when it setup its own object, report failure to userspace if that fails. If it does succeed then we should never have an issue next time we call p2p_map() (after mapping being invalidated by mmu notifier for instance). So it will succeed just like the first call (again assuming the vma is still valid).
Idea is that we can only ask exporter to be predictable and still allow them to fail if things are really going bad.
I would think that the importing driver can assume the BAR page is kept alive until it calls unmap (presumably triggered by notifiers)?
ie the exporting driver sees the BAR page as pinned until unmap.
The intention with this patchset is that it is not pin ie the importer device _must_ abide by all mmu notifier invalidations and they can happen at anytime. The importing device can however re-p2p_map the same range after an invalidation.
I would like to restrict this to importer that can invalidate for now because i believe all the first device to use can support the invalidation.
This seems reasonable (and sort of says importers not getting this from HMM need careful checking), was this in the comment above the ops?
I think i put it in the comment above the ops but in any cases i should write something in documentation with example and thorough guideline. Note that there won't be any mmu notifier to mmap of a device file unless the device driver calls for it or there is a syscall like munmap or mremap or mprotect well any syscall that work on vma.
So assuming the application is no doing something stupid, nor the driver. Then the result of p2p_map can stay valid until the importer is done and call p2p_unmap of its own free will. This is what i do expect for this. But for GPU i would still like to allow GPU driver to evict (and thus invalidate importer mapping) to main memory or defragment their BAR address space if the GPU driver feels a pressing need to do so.
If we ever want to support full pin then we might have to add a flag so that GPU driver can refuse an importer that wants things pin forever.
Cheers, Jérôme
On Tue, Jan 29, 2019 at 07:08:06PM -0500, Jerome Glisse wrote:
On Tue, Jan 29, 2019 at 11:02:25PM +0000, Jason Gunthorpe wrote:
On Tue, Jan 29, 2019 at 03:44:00PM -0500, Jerome Glisse wrote:
But this API doesn't seem to offer any control - I thought that control was all coming from the mm/hmm notifiers triggering p2p_unmaps?
The control is within the driver implementation of those callbacks.
Seems like what you mean by control is 'the exporter gets to choose the physical address at the instant of map' - which seems reasonable for GPU.
will only allow p2p map to succeed for objects that have been tagged by the userspace in some way ie the userspace application is in control of what can be map to peer device.
I would have thought this means the VMA for the object is created without the map/unmap ops? Or are GPU objects and VMAs unrelated?
GPU object and VMA are unrelated in all open source GPU driver i am somewhat familiar with (AMD, Intel, NVidia). You can create a GPU object and never map it (and thus never have it associated with a vma) and in fact this is very common. For graphic you usualy only have hand full of the hundreds of GPU object your application have mapped.
I mean the other way does every VMA with a p2p_map/unmap point to exactly one GPU object?
ie I'm surprised you say that p2p_map needs to have policy, I would have though the policy is applied when the VMA is created (ie objects that are not for p2p do not have p2p_map set), and even for GPU p2p_map should really only have to do with window allocation and pure 'can I even do p2p' type functionality.
Idea is that we can only ask exporter to be predictable and still allow them to fail if things are really going bad.
I think hot unplug / PCI error recovery is one of the 'really going bad' cases..
I think i put it in the comment above the ops but in any cases i should write something in documentation with example and thorough guideline. Note that there won't be any mmu notifier to mmap of a device file unless the device driver calls for it or there is a syscall like munmap or mremap or mprotect well any syscall that work on vma.
This is something we might need to explore, does calling zap_vma_ptes() invoke enough notifiers that a MMU notifiers or HMM mirror consumer will release any p2p maps on that VMA?
If we ever want to support full pin then we might have to add a flag so that GPU driver can refuse an importer that wants things pin forever.
This would become interesting for VFIO and RDMA at least - I don't think VFIO has anything like SVA so it would want to import a p2p_map and indicate that it will not respond to MMU notifiers.
GPU can refuse, but maybe RDMA would allow it...
Jason
On Wed, Jan 30, 2019 at 04:30:27AM +0000, Jason Gunthorpe wrote:
On Tue, Jan 29, 2019 at 07:08:06PM -0500, Jerome Glisse wrote:
On Tue, Jan 29, 2019 at 11:02:25PM +0000, Jason Gunthorpe wrote:
On Tue, Jan 29, 2019 at 03:44:00PM -0500, Jerome Glisse wrote:
But this API doesn't seem to offer any control - I thought that control was all coming from the mm/hmm notifiers triggering p2p_unmaps?
The control is within the driver implementation of those callbacks.
Seems like what you mean by control is 'the exporter gets to choose the physical address at the instant of map' - which seems reasonable for GPU.
will only allow p2p map to succeed for objects that have been tagged by the userspace in some way ie the userspace application is in control of what can be map to peer device.
I would have thought this means the VMA for the object is created without the map/unmap ops? Or are GPU objects and VMAs unrelated?
GPU object and VMA are unrelated in all open source GPU driver i am somewhat familiar with (AMD, Intel, NVidia). You can create a GPU object and never map it (and thus never have it associated with a vma) and in fact this is very common. For graphic you usualy only have hand full of the hundreds of GPU object your application have mapped.
I mean the other way does every VMA with a p2p_map/unmap point to exactly one GPU object?
ie I'm surprised you say that p2p_map needs to have policy, I would have though the policy is applied when the VMA is created (ie objects that are not for p2p do not have p2p_map set), and even for GPU p2p_map should really only have to do with window allocation and pure 'can I even do p2p' type functionality.
All userspace API to enable p2p happens after object creation and in some case they are mutable ie you can decide to no longer share the object (userspace application decision). The BAR address space is a resource from GPU driver point of view and thus from userspace point of view. As such decissions that affect how it is use an what object can use it, can change over application lifetime.
This is why i would like to allow kernel driver to apply any such access policy, decided by the application on its object (on top of which the kernel GPU driver can apply its own policy for GPU resource sharing by forcing some object to main memory).
Idea is that we can only ask exporter to be predictable and still allow them to fail if things are really going bad.
I think hot unplug / PCI error recovery is one of the 'really going bad' cases..
GPU can hang and all data becomes _undefined_, it can also be suspended to save power (think laptop with discret GPU for instance). GPU threads can be kill ... So they are few cases i can think of where either you want to kill the p2p mapping and make sure the importer is aware and might have a change to report back through its own userspace API, or at very least fallback to dummy pages. In some of the above cases, for instance suspend, you just want to move thing around to allow to shut down device memory.
I think i put it in the comment above the ops but in any cases i should write something in documentation with example and thorough guideline. Note that there won't be any mmu notifier to mmap of a device file unless the device driver calls for it or there is a syscall like munmap or mremap or mprotect well any syscall that work on vma.
This is something we might need to explore, does calling zap_vma_ptes() invoke enough notifiers that a MMU notifiers or HMM mirror consumer will release any p2p maps on that VMA?
Yes it does.
If we ever want to support full pin then we might have to add a flag so that GPU driver can refuse an importer that wants things pin forever.
This would become interesting for VFIO and RDMA at least - I don't think VFIO has anything like SVA so it would want to import a p2p_map and indicate that it will not respond to MMU notifiers.
GPU can refuse, but maybe RDMA would allow it...
Ok i will add a flag field in next post. GPU could allow pin but they would most likely use main memory for any such object, hence it is no longer really p2p but at least both device look at the same data.
Cheers, Jérôme
On 2019-01-29 12:32 p.m., Jason Gunthorpe wrote:
Jerome, I think it would be nice to have a helper scheme - I think the simple case would be simple remapping of PCI BAR memory, so if we could have, say something like:
static const struct vm_operations_struct my_ops { .p2p_map = p2p_ioremap_map_op, .p2p_unmap = p2p_ioremap_unmap_op, }
struct ioremap_data { [..] }
fops_mmap() { vma->private_data = &driver_priv->ioremap_data; return p2p_ioremap_device_memory(vma, exporting_device, [..]); }
This is roughly what I was expecting, except I don't see exactly what the p2p_map and p2p_unmap callbacks are for. The importing driver should see p2pdma/hmm struct pages and use the appropriate function to map them. It shouldn't be the responsibility of the exporting driver to implement the mapping. And I don't think we should have 'special' vma's for this (though we may need something to ensure we don't get mapping requests mixed with different types of pages...).
I also figured there'd be a fault version of p2p_ioremap_device_memory() for when you are mapping P2P memory and you want to assign the pages lazily. Though, this can come later when someone wants to implement that.
Logan
On Tue, Jan 29, 2019 at 01:39:49PM -0700, Logan Gunthorpe wrote:
On 2019-01-29 12:32 p.m., Jason Gunthorpe wrote:
Jerome, I think it would be nice to have a helper scheme - I think the simple case would be simple remapping of PCI BAR memory, so if we could have, say something like:
static const struct vm_operations_struct my_ops { .p2p_map = p2p_ioremap_map_op, .p2p_unmap = p2p_ioremap_unmap_op, }
struct ioremap_data { [..] }
fops_mmap() { vma->private_data = &driver_priv->ioremap_data; return p2p_ioremap_device_memory(vma, exporting_device, [..]); }
This is roughly what I was expecting, except I don't see exactly what the p2p_map and p2p_unmap callbacks are for. The importing driver should see p2pdma/hmm struct pages and use the appropriate function to map them. It shouldn't be the responsibility of the exporting driver to implement the mapping. And I don't think we should have 'special' vma's for this (though we may need something to ensure we don't get mapping requests mixed with different types of pages...).
GPU driver must be in control and must be call to. Here there is 2 cases in this patchset and i should have instead posted 2 separate patchset as it seems that it is confusing things.
For the HMM page, the physical address of the page ie the pfn does not correspond to anything ie there is nothing behind it. So the importing device has no idea how to get a valid physical address from an HMM page only the device driver exporting its memory with HMM device memory knows that.
For the special vma ie mmap of a device file. GPU driver do manage their BAR ie the GPU have a page table that map BAR page to GPU memory and the driver _constantly_ update this page table, it is reflected by invalidating the CPU mapping. In fact most of the time the CPU mapping of GPU object are invalid they are valid only a small fraction of their lifetime. So you _must_ have some call to inform the exporting device driver that another device would like to map one of its vma. The exporting device can then try to avoid as much churn as possible for the importing device. But this has consequence and the exporting device driver must be allow to apply policy and make decission on wether or not it authorize the other device to peer map its memory. For GPU the userspace application have to call specific API that translate into specific ioctl which themself set flags on object (in the kernel struct tracking the user space object). The only way to allow program predictability is if the application can ask and know if it can peer export an object (ie is there enough BAR space left).
Moreover i would like to be able to use this API between GPUs that are inter-connected between each other and for those the CPU page table are just invalid and the physical address to use are only meaning full to the exporting and importing device. So again here core kernel has no idea of what the physical address would be.
So in both cases, at very least for GPU, we do want total control to be given to the exporter.
I also figured there'd be a fault version of p2p_ioremap_device_memory() for when you are mapping P2P memory and you want to assign the pages lazily. Though, this can come later when someone wants to implement that.
For GPU the BAR address space is manage page by page and thus you do not want to map a range of BAR addresses but you want to allow mapping of multiple page of BAR address that are not adjacent to each other nor ordered in anyway. But providing helper for simpler device does make sense.
Cheers, Jérôme
On 2019-01-29 1:57 p.m., Jerome Glisse wrote:
GPU driver must be in control and must be call to. Here there is 2 cases in this patchset and i should have instead posted 2 separate patchset as it seems that it is confusing things.
For the HMM page, the physical address of the page ie the pfn does not correspond to anything ie there is nothing behind it. So the importing device has no idea how to get a valid physical address from an HMM page only the device driver exporting its memory with HMM device memory knows that.
For the special vma ie mmap of a device file. GPU driver do manage their BAR ie the GPU have a page table that map BAR page to GPU memory and the driver _constantly_ update this page table, it is reflected by invalidating the CPU mapping. In fact most of the time the CPU mapping of GPU object are invalid they are valid only a small fraction of their lifetime. So you _must_ have some call to inform the exporting device driver that another device would like to map one of its vma. The exporting device can then try to avoid as much churn as possible for the importing device. But this has consequence and the exporting device driver must be allow to apply policy and make decission on wether or not it authorize the other device to peer map its memory. For GPU the userspace application have to call specific API that translate into specific ioctl which themself set flags on object (in the kernel struct tracking the user space object). The only way to allow program predictability is if the application can ask and know if it can peer export an object (ie is there enough BAR space left).
This all seems like it's an HMM problem and not related to mapping BARs/"potential BARs" to userspace. If some code wants to DMA map HMM pages, it calls an HMM function to map them. If HMM needs to consult with the driver on aspects of how that's mapped, then that's between HMM and the driver and not something I really care about. But making the entire mapping stuff tied to userspace VMAs does not make sense to me. What if somebody wants to map some HMM pages in the same way but from kernel space and they therefore don't have a VMA?
I also figured there'd be a fault version of p2p_ioremap_device_memory() for when you are mapping P2P memory and you want to assign the pages lazily. Though, this can come later when someone wants to implement that.
For GPU the BAR address space is manage page by page and thus you do not want to map a range of BAR addresses but you want to allow mapping of multiple page of BAR address that are not adjacent to each other nor ordered in anyway. But providing helper for simpler device does make sense.
Well, this has little do with the backing device but how the memory is mapped into userspace. With p2p_ioremap_device_memory() the entire range is mapped into the userspace VMA immediately during the call to mmap(). With p2p_fault_device_memory(), mmap() would not actually map anything and a page in the VMA would be mapped only when userspace accesses it (using fault()). It seems to me like GPUs would prefer the latter but if HMM takes care of the mapping from userspace potential pages to actual GPU pages through the BAR then that may not be true.
Logan
On Tue, Jan 29, 2019 at 02:30:49PM -0700, Logan Gunthorpe wrote:
On 2019-01-29 1:57 p.m., Jerome Glisse wrote:
GPU driver must be in control and must be call to. Here there is 2 cases in this patchset and i should have instead posted 2 separate patchset as it seems that it is confusing things.
For the HMM page, the physical address of the page ie the pfn does not correspond to anything ie there is nothing behind it. So the importing device has no idea how to get a valid physical address from an HMM page only the device driver exporting its memory with HMM device memory knows that.
For the special vma ie mmap of a device file. GPU driver do manage their BAR ie the GPU have a page table that map BAR page to GPU memory and the driver _constantly_ update this page table, it is reflected by invalidating the CPU mapping. In fact most of the time the CPU mapping of GPU object are invalid they are valid only a small fraction of their lifetime. So you _must_ have some call to inform the exporting device driver that another device would like to map one of its vma. The exporting device can then try to avoid as much churn as possible for the importing device. But this has consequence and the exporting device driver must be allow to apply policy and make decission on wether or not it authorize the other device to peer map its memory. For GPU the userspace application have to call specific API that translate into specific ioctl which themself set flags on object (in the kernel struct tracking the user space object). The only way to allow program predictability is if the application can ask and know if it can peer export an object (ie is there enough BAR space left).
This all seems like it's an HMM problem and not related to mapping BARs/"potential BARs" to userspace. If some code wants to DMA map HMM pages, it calls an HMM function to map them. If HMM needs to consult with the driver on aspects of how that's mapped, then that's between HMM and the driver and not something I really care about. But making the entire mapping stuff tied to userspace VMAs does not make sense to me. What if somebody wants to map some HMM pages in the same way but from kernel space and they therefore don't have a VMA?
No this is the non HMM case i am talking about here. Fully ignore HMM in this frame. A GPU driver that do not support or use HMM in anyway has all the properties and requirement i do list above. So all the points i was making are without HMM in the picture whatsoever. I should have posted this a separate patches to avoid this confusion.
Regarding your HMM question. You can not map HMM pages, all code path that would try that would trigger a migration back to regular memory and will use the regular memory for CPU access.
I also figured there'd be a fault version of p2p_ioremap_device_memory() for when you are mapping P2P memory and you want to assign the pages lazily. Though, this can come later when someone wants to implement that.
For GPU the BAR address space is manage page by page and thus you do not want to map a range of BAR addresses but you want to allow mapping of multiple page of BAR address that are not adjacent to each other nor ordered in anyway. But providing helper for simpler device does make sense.
Well, this has little do with the backing device but how the memory is mapped into userspace. With p2p_ioremap_device_memory() the entire range is mapped into the userspace VMA immediately during the call to mmap(). With p2p_fault_device_memory(), mmap() would not actually map anything and a page in the VMA would be mapped only when userspace accesses it (using fault()). It seems to me like GPUs would prefer the latter but if HMM takes care of the mapping from userspace potential pages to actual GPU pages through the BAR then that may not be true.
Again HMM has nothing to do here, ignore HMM it does not play any role and it is not involve in anyway here. GPU want to control what object they allow other device to access and object they do not allow. GPU driver _constantly_ invalidate the CPU page table and in fact the CPU page table do not have any valid pte for a vma that is an mmap of GPU device file for most of the vma lifetime. Changing that would highly disrupt and break GPU drivers. They need to control that, they need to control what to do if another device tries to peer map some of their memory. Hence why they need to implement the callback and decide on wether or not they allow the peer mapping or use device memory for it (they can decide to fallback to main memory).
If the exporter can not control than this is useless to GPU driver. I would rather not exclude GPU driver from this.
Cheers, Jérôme
On 2019-01-29 2:50 p.m., Jerome Glisse wrote:
No this is the non HMM case i am talking about here. Fully ignore HMM in this frame. A GPU driver that do not support or use HMM in anyway has all the properties and requirement i do list above. So all the points i was making are without HMM in the picture whatsoever. I should have posted this a separate patches to avoid this confusion.
Regarding your HMM question. You can not map HMM pages, all code path that would try that would trigger a migration back to regular memory and will use the regular memory for CPU access.
I thought this was the whole point of HMM... And eventually it would support being able to map the pages through the BAR in cooperation with the driver. If not, what's that whole layer for? Why not just have HMM handle this situation?
And what struct pages are actually going to be backing these VMAs if it's not using HMM?
Again HMM has nothing to do here, ignore HMM it does not play any role and it is not involve in anyway here. GPU want to control what object they allow other device to access and object they do not allow. GPU driver _constantly_ invalidate the CPU page table and in fact the CPU page table do not have any valid pte for a vma that is an mmap of GPU device file for most of the vma lifetime. Changing that would highly disrupt and break GPU drivers. They need to control that, they need to control what to do if another device tries to peer map some of their memory. Hence why they need to implement the callback and decide on wether or not they allow the peer mapping or use device memory for it (they can decide to fallback to main memory).
But mapping is an operation of the memory/struct pages behind the VMA; not of the VMA itself and I think that's evident by the code in that the only way the VMA layer is involved is the fact that you're abusing vm_ops by adding new ops there and calling it by other layers.
Logan
On Tue, Jan 29, 2019 at 03:58:45PM -0700, Logan Gunthorpe wrote:
On 2019-01-29 2:50 p.m., Jerome Glisse wrote:
No this is the non HMM case i am talking about here. Fully ignore HMM in this frame. A GPU driver that do not support or use HMM in anyway has all the properties and requirement i do list above. So all the points i was making are without HMM in the picture whatsoever. I should have posted this a separate patches to avoid this confusion.
Regarding your HMM question. You can not map HMM pages, all code path that would try that would trigger a migration back to regular memory and will use the regular memory for CPU access.
I thought this was the whole point of HMM... And eventually it would support being able to map the pages through the BAR in cooperation with the driver. If not, what's that whole layer for? Why not just have HMM handle this situation?
The whole point is to allow to use device memory for range of virtual address of a process when it does make sense to use device memory for that range. So they are multiple cases where it does make sense: [1] - Only the device is accessing the range and they are no CPU access For instance the program is executing/running a big function on the GPU and they are not concurrent CPU access, this is very common in all the existing GPGPU code. In fact AFAICT It is the most common pattern. So here you can use HMM private or public memory. [2] - Both device and CPU access a common range of virtul address concurrently. In that case if you are on a platform with cache coherent inter-connect like OpenCAPI or CCIX then you can use HMM public device memory and have both access the same memory. You can not use HMM private memory.
So far on x86 we only have PCIE and thus so far on x86 we only have private HMM device memory that is not accessible by the CPU in any way.
It does not make that memory useless, far from it. Having only the device work on the dataset while CPU is either waiting or accessing something else is very common.
Then HMM is a toolbox, so here are some of the tools: HMM mirror - helper to mirror process address on to a device ie this is SVM(Share Virtual Memory)/SVA(Share Virtual Address) in software
HMM private memory - allow to register device memory with the linux kernel. The memory is not CPU accessible. The memory is fully manage by the device driver. What and when to migrate is under the control of the device driver.
HMM public memory - allow to register device memory with the linux kernel. The memory must be CPU accessible and cache coherent and abide by the platform memory model. The memory is fully manage by the device driver because otherwise it would disrupt the device driver operation (for instance GPU can also be use for graphics).
Migration - helper to perform migration to and from device memory. It does not make any decission on itself it just perform all the steps in the right order and call back into the driver to get the migration going.
It is up to device driver to implement heuristic and provide userspace API to control memory migration to and from device memory. For device private memory on CPU page fault the kernel will force a migration back to system memory so that the CPU can access the memory. What matter here is that the memory model of the platform is intact and thus you can safely use CPU atomic operation or rely on your platform memory model for your program. Note that long term i would like to define common API to expose to userspace to manage memory binding to specific device memory so that we can mix and match multiple device memory into a single process and define policy too.
Also CPU atomic instruction to PCIE BAR gives _undefined_ results and in fact on some AMD/Intel platform it leads to weirdness/crash/freeze. So obviously we can not map PCIE BAR to CPU without breaking the memory model. More over on PCIE you might not be able to resize the BAR to expose all the device memory. GPU can have several giga bytes of memory and not all of them support PCIE bar resize, and sometimes PCIE bar resize does not work either because of bios/firmware issue or simply because you are running out of IO space.
So on x86 we are stuck with HMM private memory, i am hopping that some day in the future we will have CCIX or something similar. But for now we have to work with what we have.
And what struct pages are actually going to be backing these VMAs if it's not using HMM?
When you have some range of virtual address migrated to HMM private memory then the CPU pte are special swap entry and they behave just as if the memory was swapped to disk. So CPU access to those will fault and trigger a migration back to main memory.
We still want to allow peer to peer to exist when using HMM memory for a range of virtual address (of a vma that is not an mmap of a device file) because the peer device do not rely on atomic or on the platform memory model. In those cases we assume that the importer is aware of the limitation and is asking access in good faith and thus we want to allow the exporting device to either allow the peer mapping (because it has enough BAR address to map) or fall back to main memory.
Again HMM has nothing to do here, ignore HMM it does not play any role and it is not involve in anyway here. GPU want to control what object they allow other device to access and object they do not allow. GPU driver _constantly_ invalidate the CPU page table and in fact the CPU page table do not have any valid pte for a vma that is an mmap of GPU device file for most of the vma lifetime. Changing that would highly disrupt and break GPU drivers. They need to control that, they need to control what to do if another device tries to peer map some of their memory. Hence why they need to implement the callback and decide on wether or not they allow the peer mapping or use device memory for it (they can decide to fallback to main memory).
But mapping is an operation of the memory/struct pages behind the VMA; not of the VMA itself and I think that's evident by the code in that the only way the VMA layer is involved is the fact that you're abusing vm_ops by adding new ops there and calling it by other layers.
For GPU driver the vma pte are populated on CPU page fault and they get clear quickly after. A very usual pattern is: - CPU write something to the object through the object mapping ie through a vma. This trigger page fault which call the fault() callback from vm_operations struct. This populate the page table for the vma. - Userspace launch commands on the GPU, first thing kernel do is clear all CPU page table entry for objects listed in the commands ie we do not except any further CPU access nor do we want it.
GPU driver have always been geared toward minimizing CPU access to GPU memory. For object that need to be access by both concurrently we use the main memory and not the device memory.
So in fact you will almost never have valid pte for an mmap of a GPU object (done throught the GPU device file). However it does not mean that we want to block peer to peer to happen. Today the use cases we know for peer to peer are with GPUDirect (NVidia) or ROCmDMA (AMD) roughly the same thing. Most common use cases i am aware are: - RDMA is streaming in input directly into GPU memory avoiding the need to have a bounce buffer into memory (this save both main memory and PCIE bandwidth by avoiding RDMA->main then main->GPU). - RDMA is streaming out result (same idea as streaming in but in the other direction :)) - RDMA is use to monitor computation progress on the GPU and it tries to do so with minimal disruption to the GPU. So RDMA would like to be able to peek into GPU memory to fetch some values and transmit them over the network.
I believe people would like to have more complex use case, like for instance having the GPU be able to directly control some RDMA queue to request data to some other host on the networ, or control some block device queue to read data from block device directly. I believe those can be implemented with the API set forward in those patches.
So for those above use cases it is fine to not have valid CPU pte and only have peer to peer mapping. The CPU is not expected to be involve and we should not make it a requirement. Hence we should not expect to have valid pte.
Also another common use case is that GPU driver might leave pte that points to main memory while the GPU is using device memory for the object corresponding to the vma those pte are in. Expectation is that the CPU access are synchronized with the device access through the API use by the application. Note here we are talking non HMM, non SVM case ie special object that are allocated through API specific functions that result in driver ioctl and mmap of device file.
Hopes this helps understand the big picture from GPU driver point of view :)
Cheers, Jérôme
On 2019-01-29 4:47 p.m., Jerome Glisse wrote:
The whole point is to allow to use device memory for range of virtual address of a process when it does make sense to use device memory for that range. So they are multiple cases where it does make sense: [1] - Only the device is accessing the range and they are no CPU access For instance the program is executing/running a big function on the GPU and they are not concurrent CPU access, this is very common in all the existing GPGPU code. In fact AFAICT It is the most common pattern. So here you can use HMM private or public memory. [2] - Both device and CPU access a common range of virtul address concurrently. In that case if you are on a platform with cache coherent inter-connect like OpenCAPI or CCIX then you can use HMM public device memory and have both access the same memory. You can not use HMM private memory.
So far on x86 we only have PCIE and thus so far on x86 we only have private HMM device memory that is not accessible by the CPU in any way.
I feel like you're just moving the rug out from under us... Before you said ignore HMM and I was asking about the use case that wasn't using HMM and how it works without HMM. In response, you just give me *way* too much information describing HMM. And still, as best as I can see, managing DMA mappings (which is different from the userspace mappings) for GPU P2P should be handled by HMM and the userspace mappings should *just* link VMAs to HMM pages using the standard infrastructure we already have.
And what struct pages are actually going to be backing these VMAs if it's not using HMM?
When you have some range of virtual address migrated to HMM private memory then the CPU pte are special swap entry and they behave just as if the memory was swapped to disk. So CPU access to those will fault and trigger a migration back to main memory.
This isn't answering my question at all... I specifically asked what is backing the VMA when we are *not* using HMM.
Logan
On Tue, Jan 29, 2019 at 06:17:43PM -0700, Logan Gunthorpe wrote:
On 2019-01-29 4:47 p.m., Jerome Glisse wrote:
The whole point is to allow to use device memory for range of virtual address of a process when it does make sense to use device memory for that range. So they are multiple cases where it does make sense: [1] - Only the device is accessing the range and they are no CPU access For instance the program is executing/running a big function on the GPU and they are not concurrent CPU access, this is very common in all the existing GPGPU code. In fact AFAICT It is the most common pattern. So here you can use HMM private or public memory. [2] - Both device and CPU access a common range of virtul address concurrently. In that case if you are on a platform with cache coherent inter-connect like OpenCAPI or CCIX then you can use HMM public device memory and have both access the same memory. You can not use HMM private memory.
So far on x86 we only have PCIE and thus so far on x86 we only have private HMM device memory that is not accessible by the CPU in any way.
I feel like you're just moving the rug out from under us... Before you said ignore HMM and I was asking about the use case that wasn't using HMM and how it works without HMM. In response, you just give me *way* too much information describing HMM. And still, as best as I can see, managing DMA mappings (which is different from the userspace mappings) for GPU P2P should be handled by HMM and the userspace mappings should *just* link VMAs to HMM pages using the standard infrastructure we already have.
For HMM P2P mapping we need to call into the driver to know if driver wants to fallback to main memory (running out of BAR addresses) or if it can allow a peer device to directly access its memory. We also need the call to exporting device driver as only the exporting device driver can map the HMM page pfn to some physical BAR address (which would be allocated by driver for GPU).
I wanted to make sure the HMM case was understood too, sorry if it caused confusion with the non HMM case which i describe below.
And what struct pages are actually going to be backing these VMAs if it's not using HMM?
When you have some range of virtual address migrated to HMM private memory then the CPU pte are special swap entry and they behave just as if the memory was swapped to disk. So CPU access to those will fault and trigger a migration back to main memory.
This isn't answering my question at all... I specifically asked what is backing the VMA when we are *not* using HMM.
So when you are not using HMM ie existing GPU object without HMM then like i said you do not have any valid pte most of the time inside the CPU page table ie the GPU driver only populate the pte with valid entry when they are CPU page fault and it clear those as soon as the corresponding object is use by the GPU. In fact some driver also unmap it agressively from the BAR making the memory totaly un-accessible to anything but the GPU.
GPU driver do not like CPU mapping, they are quite aggressive about clearing them. Then everything i said about having userspace deciding which object can be share, and, with who, do apply here. So for GPU you do want to give control to GPU driver and you do not want to require valid CPU pte for the vma so that the exporting driver can return valid address to the importing peer device only.
Also exporting device driver might decide to fallback to main memory (running out of BAR addresses for instance). So again here we want to go through the exporting device driver so that it can take the right action.
So the expected pattern (for GPU driver) is: - no valid pte for the special vma (mmap of device file) - importing device call p2p_map() for the vma if it succeed the first time then we expect it will succeed for the same vma and range next time we call it. - exporting driver can either return physical address to page into its BAR space that point to the correct device memory or fallback to main memory
Then at any point in time: - if GPU driver want to move the object around (for whatever reasons) it calls zap_vma_ptes() the fact that there is no valid CPU pte does not matter it will call mmu notifier and thus any importing device driver will invalidate its mapping - importing device driver that lost the mapping due to mmu notification can re-map by re-calling p2p_map() (it should check that the vma is still valid ...) and guideline is for the exporting device driver to succeed and return valid address to the new memory use for the object
This allow device driver like GPU to keep control. The expected pattern is still the p2p mapping to stay undisrupted for their whole lifetime. Invalidation should only be triggered if GPU driver do need to move things around.
All the above is for the no HMM case ie mmap of a device file so for any existing open source GPU device driver that do not support HMM.
Cheers, Jérôme
On Tue, Jan 29, 2019 at 06:17:43PM -0700, Logan Gunthorpe wrote:
This isn't answering my question at all... I specifically asked what is backing the VMA when we are *not* using HMM.
At least for RDMA what backs the VMA today is non-struct-page BAR memory filled in with io_remap_pfn.
And we want to expose this for P2P DMA. None of the HMM stuff applies here and the p2p_map/unmap are a nice simple approach that covers all the needs RDMA has, at least.
Every attempt to give BAR memory to struct page has run into major trouble, IMHO, so I like that this approach avoids that.
And if you don't have struct page then the only kernel object left to hang meta data off is the VMA itself.
It seems very similar to the existing P2P work between in-kernel consumers, just that VMA is now mediating a general user space driven discovery process instead of being hard wired into a driver.
Jason
On 2019-01-29 9:18 p.m., Jason Gunthorpe wrote:
Every attempt to give BAR memory to struct page has run into major trouble, IMHO, so I like that this approach avoids that.
And if you don't have struct page then the only kernel object left to hang meta data off is the VMA itself.
It seems very similar to the existing P2P work between in-kernel consumers, just that VMA is now mediating a general user space driven discovery process instead of being hard wired into a driver.
But the kernel now has P2P bars backed by struct pages and it works well. And that's what we are doing in-kernel. We even have a hacky out-of-tree module which exposes these pages and it also works (but would need Jerome's solution for denying those pages in GUP, etc). So why do something completely different in userspace so they can't share any of the DMA map infrastructure?
Logan
On Wed, Jan 30, 2019 at 10:17:27AM -0700, Logan Gunthorpe wrote:
On 2019-01-29 9:18 p.m., Jason Gunthorpe wrote:
Every attempt to give BAR memory to struct page has run into major trouble, IMHO, so I like that this approach avoids that.
And if you don't have struct page then the only kernel object left to hang meta data off is the VMA itself.
It seems very similar to the existing P2P work between in-kernel consumers, just that VMA is now mediating a general user space driven discovery process instead of being hard wired into a driver.
But the kernel now has P2P bars backed by struct pages and it works well.
I don't think it works that well..
We ended up with a 'sgl' that is not really a sgl, and doesn't work with many of the common SGL patterns. sg_copy_buffer doesn't work, dma_map, doesn't work, sg_page doesn't work quite right, etc.
Only nvme and rdma got the special hacks to make them understand these p2p-sgls, and I'm still not convinced some of the RDMA drivers that want access to CPU addresses from the SGL (rxe, usnic, hfi, qib) don't break in this scenario.
Since the SGLs become broken, it pretty much means there is no path to make GUP work generically, we have to go through and make everything safe to use with p2p-sgls before allowing GUP. Which, frankly, sounds impossible with all the competing objections.
But GPU seems to have a problem unrelated to this - what Jerome wants is to have two faulting domains for VMA's - visible-to-cpu and visible-to-dma. The new op is essentially faulting the pages into the visible-to-dma category and leaving them invisible-to-cpu.
So that duality would still have to exists, and I think p2p_map/unmap is a much simpler implementation than trying to create some kind of special PTE in the VMA..
At least for RDMA, struct page or not doesn't really matter.
We can make struct pages for the BAR the same way NVMe does. GPU is probably the same, just with more mememory at stake?
And maybe this should be the first implementation. The p2p_map VMA operation should return a SGL and the caller should do the existing pci_p2pdma_map_sg() flow..
Worry about optimizing away the struct page overhead later?
Jason
On Wed, Jan 30, 2019 at 06:56:59PM +0000, Jason Gunthorpe wrote:
On Wed, Jan 30, 2019 at 10:17:27AM -0700, Logan Gunthorpe wrote:
On 2019-01-29 9:18 p.m., Jason Gunthorpe wrote:
Every attempt to give BAR memory to struct page has run into major trouble, IMHO, so I like that this approach avoids that.
And if you don't have struct page then the only kernel object left to hang meta data off is the VMA itself.
It seems very similar to the existing P2P work between in-kernel consumers, just that VMA is now mediating a general user space driven discovery process instead of being hard wired into a driver.
But the kernel now has P2P bars backed by struct pages and it works well.
I don't think it works that well..
We ended up with a 'sgl' that is not really a sgl, and doesn't work with many of the common SGL patterns. sg_copy_buffer doesn't work, dma_map, doesn't work, sg_page doesn't work quite right, etc.
Only nvme and rdma got the special hacks to make them understand these p2p-sgls, and I'm still not convinced some of the RDMA drivers that want access to CPU addresses from the SGL (rxe, usnic, hfi, qib) don't break in this scenario.
Since the SGLs become broken, it pretty much means there is no path to make GUP work generically, we have to go through and make everything safe to use with p2p-sgls before allowing GUP. Which, frankly, sounds impossible with all the competing objections.
But GPU seems to have a problem unrelated to this - what Jerome wants is to have two faulting domains for VMA's - visible-to-cpu and visible-to-dma. The new op is essentially faulting the pages into the visible-to-dma category and leaving them invisible-to-cpu.
So that duality would still have to exists, and I think p2p_map/unmap is a much simpler implementation than trying to create some kind of special PTE in the VMA..
At least for RDMA, struct page or not doesn't really matter.
We can make struct pages for the BAR the same way NVMe does. GPU is probably the same, just with more mememory at stake?
And maybe this should be the first implementation. The p2p_map VMA operation should return a SGL and the caller should do the existing pci_p2pdma_map_sg() flow..
For GPU it would not work, GPU might want to use main memory (because it is running out of BAR space) it is a lot easier if the p2p_map callback calls the right dma map function (for page or io) rather than having to define some format that would pass down the information.
Worry about optimizing away the struct page overhead later?
Struct page do not fit well for GPU as the BAR address can be reprogram to point to any page inside the device memory (think 256M BAR versus 16GB device memory). Forcing struct page on GPU driver would require major surgery to the GPU driver inner working and there is no benefit to have from the struct page. So it is hard to justify this.
Cheers, Jérôme
On Wed, Jan 30, 2019 at 02:22:34PM -0500, Jerome Glisse wrote:
For GPU it would not work, GPU might want to use main memory (because it is running out of BAR space) it is a lot easier if the p2p_map callback calls the right dma map function (for page or io) rather than having to define some format that would pass down the information.
This is already sort of built into the sgl, you are supposed to use is_pci_p2pdma_page() and pci_p2pdma_map_sg() and somehow it is supposed to work out - but I think this is also fairly incomplete.
ie the current APIs seem to assume the SGL is homogeneous :(
Worry about optimizing away the struct page overhead later?
Struct page do not fit well for GPU as the BAR address can be reprogram to point to any page inside the device memory (think 256M BAR versus 16GB device memory).
The struct page only points to the BAR - it is not related to the actual GPU memory in any way. The struct page is just an alternative way to specify the physical address of the BAR page.
I think this boils down to one call to setup the entire BAR, like nvme does, and then using the struct page in the p2p_map SGL??
Jason
On 2019-01-30 12:38 p.m., Jason Gunthorpe wrote:
On Wed, Jan 30, 2019 at 02:22:34PM -0500, Jerome Glisse wrote:
For GPU it would not work, GPU might want to use main memory (because it is running out of BAR space) it is a lot easier if the p2p_map callback calls the right dma map function (for page or io) rather than having to define some format that would pass down the information.
This is already sort of built into the sgl, you are supposed to use is_pci_p2pdma_page() and pci_p2pdma_map_sg() and somehow it is supposed to work out - but I think this is also fairly incomplete.
ie the current APIs seem to assume the SGL is homogeneous :(
We never changed SGLs. We still use them to pass p2pdma pages, only we need to be a bit careful where we send the entire SGL. I see no reason why we can't continue to be careful once their in userspace if there's something in GUP to deny them.
It would be nice to have heterogeneous SGLs and it is something we should work toward but in practice they aren't really necessary at the moment.
Worry about optimizing away the struct page overhead later?
Struct page do not fit well for GPU as the BAR address can be reprogram to point to any page inside the device memory (think 256M BAR versus 16GB device memory).
The struct page only points to the BAR - it is not related to the actual GPU memory in any way. The struct page is just an alternative way to specify the physical address of the BAR page.
That doesn't even necessarily need to be the case. For HMM, I understand, struct pages may not point to any accessible memory and the memory that backs it (or not) may change over the life time of it. So they don't have to be strictly tied to BARs addresses. p2pdma pages are strictly tied to BAR addresses though.
Logan
On Wed, Jan 30, 2019 at 01:00:02PM -0700, Logan Gunthorpe wrote:
We never changed SGLs. We still use them to pass p2pdma pages, only we need to be a bit careful where we send the entire SGL. I see no reason why we can't continue to be careful once their in userspace if there's something in GUP to deny them.
It would be nice to have heterogeneous SGLs and it is something we should work toward but in practice they aren't really necessary at the moment.
RDMA generally cannot cope well with an API that requires homogeneous SGLs.. User space can construct complex MRs (particularly with the proposed SGL MR flow) and we must marshal that into a single SGL or the drivers fall apart.
Jerome explained that GPU is worse, a single VMA may have a random mix of CPU or device pages..
This is a pretty big blocker that would have to somehow be fixed.
That doesn't even necessarily need to be the case. For HMM, I understand, struct pages may not point to any accessible memory and the memory that backs it (or not) may change over the life time of it. So they don't have to be strictly tied to BARs addresses. p2pdma pages are strictly tied to BAR addresses though.
No idea, but at least for this case I don't think we need magic HMM pages to make simple VMA ops p2p_map/umap work..
Jason
On Wed, Jan 30, 2019 at 08:11:19PM +0000, Jason Gunthorpe wrote:
On Wed, Jan 30, 2019 at 01:00:02PM -0700, Logan Gunthorpe wrote:
We never changed SGLs. We still use them to pass p2pdma pages, only we need to be a bit careful where we send the entire SGL. I see no reason why we can't continue to be careful once their in userspace if there's something in GUP to deny them.
It would be nice to have heterogeneous SGLs and it is something we should work toward but in practice they aren't really necessary at the moment.
RDMA generally cannot cope well with an API that requires homogeneous SGLs.. User space can construct complex MRs (particularly with the proposed SGL MR flow) and we must marshal that into a single SGL or the drivers fall apart.
Jerome explained that GPU is worse, a single VMA may have a random mix of CPU or device pages..
This is a pretty big blocker that would have to somehow be fixed.
Note that HMM takes care of that RDMA ODP with my ODP to HMM patch, so what you get for an ODP umem is just a list of dma address you can program your device to. The aim is to avoid the driver to care about that. The access policy when the UMEM object is created by userspace through verbs API should however ascertain that for mmap of device file it is only creating a UMEM that is fully covered by one and only one vma. GPU device driver will have one vma per logical GPU object. I expect other kind of device do that same so that they can match a vma to a unique object in their driver.
That doesn't even necessarily need to be the case. For HMM, I understand, struct pages may not point to any accessible memory and the memory that backs it (or not) may change over the life time of it. So they don't have to be strictly tied to BARs addresses. p2pdma pages are strictly tied to BAR addresses though.
No idea, but at least for this case I don't think we need magic HMM pages to make simple VMA ops p2p_map/umap work..
Yes, you do not need page for simple driver, if we start creating struct page for all PCIE BAR we are gonna waste a lot of memory and resources for no good reason. I doubt all of the PCIE BAR of a device enabling p2p will ever be map as p2p. So simple driver do not need struct page, GPU driver that do not use HMM (all GPU that are more than 2 years old) do not need struct page. Struct page is a burden here more than anything else. I have not seen one good thing the struct page gives you.
Cheers, Jérôme
On Wed, Jan 30, 2019 at 03:43:32PM -0500, Jerome Glisse wrote:
On Wed, Jan 30, 2019 at 08:11:19PM +0000, Jason Gunthorpe wrote:
On Wed, Jan 30, 2019 at 01:00:02PM -0700, Logan Gunthorpe wrote:
We never changed SGLs. We still use them to pass p2pdma pages, only we need to be a bit careful where we send the entire SGL. I see no reason why we can't continue to be careful once their in userspace if there's something in GUP to deny them.
It would be nice to have heterogeneous SGLs and it is something we should work toward but in practice they aren't really necessary at the moment.
RDMA generally cannot cope well with an API that requires homogeneous SGLs.. User space can construct complex MRs (particularly with the proposed SGL MR flow) and we must marshal that into a single SGL or the drivers fall apart.
Jerome explained that GPU is worse, a single VMA may have a random mix of CPU or device pages..
This is a pretty big blocker that would have to somehow be fixed.
Note that HMM takes care of that RDMA ODP with my ODP to HMM patch, so what you get for an ODP umem is just a list of dma address you can program your device to. The aim is to avoid the driver to care about that. The access policy when the UMEM object is created by userspace through verbs API should however ascertain that for mmap of device file it is only creating a UMEM that is fully covered by one and only one vma. GPU device driver will have one vma per logical GPU object. I expect other kind of device do that same so that they can match a vma to a unique object in their driver.
A one VMA rule is not really workable.
With ODP VMA boundaries can move around across the lifetime of the MR and we have no obvious way to fail anything if userpace puts a VMA boundary in the middle of an existing ODP MR address range.
I think the HMM mirror API really needs to deal with this for the driver somehow.
Jason
On Wed, Jan 30, 2019 at 08:50:00PM +0000, Jason Gunthorpe wrote:
On Wed, Jan 30, 2019 at 03:43:32PM -0500, Jerome Glisse wrote:
On Wed, Jan 30, 2019 at 08:11:19PM +0000, Jason Gunthorpe wrote:
On Wed, Jan 30, 2019 at 01:00:02PM -0700, Logan Gunthorpe wrote:
We never changed SGLs. We still use them to pass p2pdma pages, only we need to be a bit careful where we send the entire SGL. I see no reason why we can't continue to be careful once their in userspace if there's something in GUP to deny them.
It would be nice to have heterogeneous SGLs and it is something we should work toward but in practice they aren't really necessary at the moment.
RDMA generally cannot cope well with an API that requires homogeneous SGLs.. User space can construct complex MRs (particularly with the proposed SGL MR flow) and we must marshal that into a single SGL or the drivers fall apart.
Jerome explained that GPU is worse, a single VMA may have a random mix of CPU or device pages..
This is a pretty big blocker that would have to somehow be fixed.
Note that HMM takes care of that RDMA ODP with my ODP to HMM patch, so what you get for an ODP umem is just a list of dma address you can program your device to. The aim is to avoid the driver to care about that. The access policy when the UMEM object is created by userspace through verbs API should however ascertain that for mmap of device file it is only creating a UMEM that is fully covered by one and only one vma. GPU device driver will have one vma per logical GPU object. I expect other kind of device do that same so that they can match a vma to a unique object in their driver.
A one VMA rule is not really workable.
With ODP VMA boundaries can move around across the lifetime of the MR and we have no obvious way to fail anything if userpace puts a VMA boundary in the middle of an existing ODP MR address range.
This is true only for vma that are not mmap of a device file. This is what i was trying to get accross. An mmap of a file is never merge so it can only get split/butcher by munmap/mremap but when that happen you also need to reflect the virtual address space change to the device ie any access to a now invalid range must trigger error.
I think the HMM mirror API really needs to deal with this for the driver somehow.
Yes the HMM does deal with this for you, you do not have to worry about it. Sorry if that was not clear. I just wanted to stress that vma that are mmap of a file do not behave like other vma hence when you create the UMEM you can check for those if you feel the need.
Cheers, Jérôme
On Wed, Jan 30, 2019 at 04:45:25PM -0500, Jerome Glisse wrote:
On Wed, Jan 30, 2019 at 08:50:00PM +0000, Jason Gunthorpe wrote:
On Wed, Jan 30, 2019 at 03:43:32PM -0500, Jerome Glisse wrote:
On Wed, Jan 30, 2019 at 08:11:19PM +0000, Jason Gunthorpe wrote:
On Wed, Jan 30, 2019 at 01:00:02PM -0700, Logan Gunthorpe wrote:
We never changed SGLs. We still use them to pass p2pdma pages, only we need to be a bit careful where we send the entire SGL. I see no reason why we can't continue to be careful once their in userspace if there's something in GUP to deny them.
It would be nice to have heterogeneous SGLs and it is something we should work toward but in practice they aren't really necessary at the moment.
RDMA generally cannot cope well with an API that requires homogeneous SGLs.. User space can construct complex MRs (particularly with the proposed SGL MR flow) and we must marshal that into a single SGL or the drivers fall apart.
Jerome explained that GPU is worse, a single VMA may have a random mix of CPU or device pages..
This is a pretty big blocker that would have to somehow be fixed.
Note that HMM takes care of that RDMA ODP with my ODP to HMM patch, so what you get for an ODP umem is just a list of dma address you can program your device to. The aim is to avoid the driver to care about that. The access policy when the UMEM object is created by userspace through verbs API should however ascertain that for mmap of device file it is only creating a UMEM that is fully covered by one and only one vma. GPU device driver will have one vma per logical GPU object. I expect other kind of device do that same so that they can match a vma to a unique object in their driver.
A one VMA rule is not really workable.
With ODP VMA boundaries can move around across the lifetime of the MR and we have no obvious way to fail anything if userpace puts a VMA boundary in the middle of an existing ODP MR address range.
This is true only for vma that are not mmap of a device file. This is what i was trying to get accross. An mmap of a file is never merge so it can only get split/butcher by munmap/mremap but when that happen you also need to reflect the virtual address space change to the device ie any access to a now invalid range must trigger error.
Why is it invalid? The address range still has valid process memory?
What is the problem in the HMM mirror that it needs this restriction?
There is also the situation where we create an ODP MR that spans 0 -> U64_MAX in the process address space. In this case there are lots of different VMAs it covers and we expect it to fully track all changes to all VMAs.
So we have to spin up dedicated umem_odps that carefully span single VMAs, and somehow track changes to VMA ?
mlx5 odp does some of this already.. But yikes, this needs some pretty careful testing in all these situations.
I think the HMM mirror API really needs to deal with this for the driver somehow.
Yes the HMM does deal with this for you, you do not have to worry about it. Sorry if that was not clear. I just wanted to stress that vma that are mmap of a file do not behave like other vma hence when you create the UMEM you can check for those if you feel the need.
What properties do we get from HMM mirror? Will it tell us when to create more umems to cover VMA seams or will it just cause undesired no-mapped failures in some cases?
Jason
On Wed, Jan 30, 2019 at 09:56:07PM +0000, Jason Gunthorpe wrote:
On Wed, Jan 30, 2019 at 04:45:25PM -0500, Jerome Glisse wrote:
On Wed, Jan 30, 2019 at 08:50:00PM +0000, Jason Gunthorpe wrote:
On Wed, Jan 30, 2019 at 03:43:32PM -0500, Jerome Glisse wrote:
On Wed, Jan 30, 2019 at 08:11:19PM +0000, Jason Gunthorpe wrote:
On Wed, Jan 30, 2019 at 01:00:02PM -0700, Logan Gunthorpe wrote:
We never changed SGLs. We still use them to pass p2pdma pages, only we need to be a bit careful where we send the entire SGL. I see no reason why we can't continue to be careful once their in userspace if there's something in GUP to deny them.
It would be nice to have heterogeneous SGLs and it is something we should work toward but in practice they aren't really necessary at the moment.
RDMA generally cannot cope well with an API that requires homogeneous SGLs.. User space can construct complex MRs (particularly with the proposed SGL MR flow) and we must marshal that into a single SGL or the drivers fall apart.
Jerome explained that GPU is worse, a single VMA may have a random mix of CPU or device pages..
This is a pretty big blocker that would have to somehow be fixed.
Note that HMM takes care of that RDMA ODP with my ODP to HMM patch, so what you get for an ODP umem is just a list of dma address you can program your device to. The aim is to avoid the driver to care about that. The access policy when the UMEM object is created by userspace through verbs API should however ascertain that for mmap of device file it is only creating a UMEM that is fully covered by one and only one vma. GPU device driver will have one vma per logical GPU object. I expect other kind of device do that same so that they can match a vma to a unique object in their driver.
A one VMA rule is not really workable.
With ODP VMA boundaries can move around across the lifetime of the MR and we have no obvious way to fail anything if userpace puts a VMA boundary in the middle of an existing ODP MR address range.
This is true only for vma that are not mmap of a device file. This is what i was trying to get accross. An mmap of a file is never merge so it can only get split/butcher by munmap/mremap but when that happen you also need to reflect the virtual address space change to the device ie any access to a now invalid range must trigger error.
Why is it invalid? The address range still has valid process memory?
If you do munmap(A, size) then all address in the range [A, A+size] are invalid. This is what i am refering too here. Same for mremap.
What is the problem in the HMM mirror that it needs this restriction?
No restriction at all here. I think i just wasn't understood.
There is also the situation where we create an ODP MR that spans 0 -> U64_MAX in the process address space. In this case there are lots of different VMAs it covers and we expect it to fully track all changes to all VMAs.
Yes and that works however any memory access above TASK_SIZE will return -EFAULT as this is kernel address space so you can only access anything that is a valid process virtual address.
So we have to spin up dedicated umem_odps that carefully span single VMAs, and somehow track changes to VMA ?
No you do not.
mlx5 odp does some of this already.. But yikes, this needs some pretty careful testing in all these situations.
Sorry if i confused you even more than the first time. Everything works you have nothing to worry about :)
I think the HMM mirror API really needs to deal with this for the driver somehow.
Yes the HMM does deal with this for you, you do not have to worry about it. Sorry if that was not clear. I just wanted to stress that vma that are mmap of a file do not behave like other vma hence when you create the UMEM you can check for those if you feel the need.
What properties do we get from HMM mirror? Will it tell us when to create more umems to cover VMA seams or will it just cause undesired no-mapped failures in some cases?
You do not get anything from HMM mirror, i might add a flag so that HMM can report this special condition if driver wants to know. If you want to know you have to look at the vma by yourself. GPU driver will definitly want to know whem importing so i might add a flag so that they do not have to lookup the vma themself to know.
Again if you do not care then just ignore everything here, it is handled by HMM and you do not have to worry one bit. If it worked with GUP it will work with HMM and with those p2p patches if it will even works against vma that are mmap of a file and that set the p2p_map function.
Cheers, Jérôme
On Wed, Jan 30, 2019 at 05:30:27PM -0500, Jerome Glisse wrote:
What is the problem in the HMM mirror that it needs this restriction?
No restriction at all here. I think i just wasn't understood.
Are you are talking about from the exporting side - where the thing creating the VMA can really only put one distinct object into it?
Jason
On Wed, Jan 30, 2019 at 10:33:04PM +0000, Jason Gunthorpe wrote:
On Wed, Jan 30, 2019 at 05:30:27PM -0500, Jerome Glisse wrote:
What is the problem in the HMM mirror that it needs this restriction?
No restriction at all here. I think i just wasn't understood.
Are you are talking about from the exporting side - where the thing creating the VMA can really only put one distinct object into it?
The message i was trying to get accross is that HMM mirror will always succeed for everything* except for special vma ie mmap of device file. For those it can only succeed if a p2p_map() call succeed.
So any user of HMM mirror might to know why the mirroring fail ie was it because something exceptional is happening ? Or is it because i was trying to map a special vma which can be forbiden.
Hence why i assume that you might want to know about such p2p_map failure at the time you create the umem odp object as it might be some failure you might want to report differently and handle differently. If you do not care about differentiating OOM or exceptional failure from p2p_map failure than you have nothing to worry about you will get the same error from HMM for both.
Cheers, Jérôme
* Everything except when they are exceptional condition like OOM or poisonous memory.
On Wed, Jan 30, 2019 at 05:47:05PM -0500, Jerome Glisse wrote:
On Wed, Jan 30, 2019 at 10:33:04PM +0000, Jason Gunthorpe wrote:
On Wed, Jan 30, 2019 at 05:30:27PM -0500, Jerome Glisse wrote:
What is the problem in the HMM mirror that it needs this restriction?
No restriction at all here. I think i just wasn't understood.
Are you are talking about from the exporting side - where the thing creating the VMA can really only put one distinct object into it?
The message i was trying to get accross is that HMM mirror will always succeed for everything* except for special vma ie mmap of device file. For those it can only succeed if a p2p_map() call succeed.
So any user of HMM mirror might to know why the mirroring fail ie was it because something exceptional is happening ? Or is it because i was trying to map a special vma which can be forbiden.
Hence why i assume that you might want to know about such p2p_map failure at the time you create the umem odp object as it might be some failure you might want to report differently and handle differently. If you do not care about differentiating OOM or exceptional failure from p2p_map failure than you have nothing to worry about you will get the same error from HMM for both.
I think my hope here was that we could have some kind of 'trial' interface where very eary users can call 'hmm_mirror_is_maybe_supported(dev, user_ptr, len)' and get a failure indication.
We probably wouldn't call this on the full address space though
Beyond that it is just inevitable there can be problems faulting if the memory map is messed with after MR is created.
And here again, I don't want to worry about any particular VMA boundaries..
Jason
On Wed, Jan 30, 2019 at 10:51:55PM +0000, Jason Gunthorpe wrote:
On Wed, Jan 30, 2019 at 05:47:05PM -0500, Jerome Glisse wrote:
On Wed, Jan 30, 2019 at 10:33:04PM +0000, Jason Gunthorpe wrote:
On Wed, Jan 30, 2019 at 05:30:27PM -0500, Jerome Glisse wrote:
What is the problem in the HMM mirror that it needs this restriction?
No restriction at all here. I think i just wasn't understood.
Are you are talking about from the exporting side - where the thing creating the VMA can really only put one distinct object into it?
The message i was trying to get accross is that HMM mirror will always succeed for everything* except for special vma ie mmap of device file. For those it can only succeed if a p2p_map() call succeed.
So any user of HMM mirror might to know why the mirroring fail ie was it because something exceptional is happening ? Or is it because i was trying to map a special vma which can be forbiden.
Hence why i assume that you might want to know about such p2p_map failure at the time you create the umem odp object as it might be some failure you might want to report differently and handle differently. If you do not care about differentiating OOM or exceptional failure from p2p_map failure than you have nothing to worry about you will get the same error from HMM for both.
I think my hope here was that we could have some kind of 'trial' interface where very eary users can call 'hmm_mirror_is_maybe_supported(dev, user_ptr, len)' and get a failure indication.
We probably wouldn't call this on the full address space though
Yes we can do special wrapper around the general case that allow caller to differentiate failure. So at creation you call the special flavor and get proper distinction between error. Afterward during normal operation any failure is just treated in a same way no matter what is the reasons (munmap, mremap, mprotect, ...).
Beyond that it is just inevitable there can be problems faulting if the memory map is messed with after MR is created.
And here again, I don't want to worry about any particular VMA boundaries..
You do not have to worry about boundaries HMM will return -EFAULT if there is no valid vma behind the address you are trying to map (or if the vma prot does not allow you to access it). So then you can handle that failure just like you do now and as my ODP HMM patch preserve.
Cheers, Jérôme
On 2019-01-30 12:22 p.m., Jerome Glisse wrote:
On Wed, Jan 30, 2019 at 06:56:59PM +0000, Jason Gunthorpe wrote:
On Wed, Jan 30, 2019 at 10:17:27AM -0700, Logan Gunthorpe wrote:
On 2019-01-29 9:18 p.m., Jason Gunthorpe wrote:
Every attempt to give BAR memory to struct page has run into major trouble, IMHO, so I like that this approach avoids that.
And if you don't have struct page then the only kernel object left to hang meta data off is the VMA itself.
It seems very similar to the existing P2P work between in-kernel consumers, just that VMA is now mediating a general user space driven discovery process instead of being hard wired into a driver.
But the kernel now has P2P bars backed by struct pages and it works well.
I don't think it works that well..
We ended up with a 'sgl' that is not really a sgl, and doesn't work with many of the common SGL patterns. sg_copy_buffer doesn't work, dma_map, doesn't work, sg_page doesn't work quite right, etc.
Only nvme and rdma got the special hacks to make them understand these p2p-sgls, and I'm still not convinced some of the RDMA drivers that want access to CPU addresses from the SGL (rxe, usnic, hfi, qib) don't break in this scenario.
Since the SGLs become broken, it pretty much means there is no path to make GUP work generically, we have to go through and make everything safe to use with p2p-sgls before allowing GUP. Which, frankly, sounds impossible with all the competing objections.
But GPU seems to have a problem unrelated to this - what Jerome wants is to have two faulting domains for VMA's - visible-to-cpu and visible-to-dma. The new op is essentially faulting the pages into the visible-to-dma category and leaving them invisible-to-cpu.
So that duality would still have to exists, and I think p2p_map/unmap is a much simpler implementation than trying to create some kind of special PTE in the VMA..
At least for RDMA, struct page or not doesn't really matter.
We can make struct pages for the BAR the same way NVMe does. GPU is probably the same, just with more mememory at stake?
And maybe this should be the first implementation. The p2p_map VMA operation should return a SGL and the caller should do the existing pci_p2pdma_map_sg() flow..
For GPU it would not work, GPU might want to use main memory (because it is running out of BAR space) it is a lot easier if the p2p_map callback calls the right dma map function (for page or io) rather than having to define some format that would pass down the information.
Worry about optimizing away the struct page overhead later?
Struct page do not fit well for GPU as the BAR address can be reprogram to point to any page inside the device memory (think 256M BAR versus 16GB device memory). Forcing struct page on GPU driver would require major surgery to the GPU driver inner working and there is no benefit to have from the struct page. So it is hard to justify this.
I think we have to consider the struct pages to track the address space, not what backs it (essentially what HMM is doing). If we need to add operations for the driver to map the address space/struct pages back to physical memory then do that. Creating a whole new idea that's tied to userspace VMAs still seems wrong to me.
Logan
On Wed, Jan 30, 2019 at 12:52:44PM -0700, Logan Gunthorpe wrote:
On 2019-01-30 12:22 p.m., Jerome Glisse wrote:
On Wed, Jan 30, 2019 at 06:56:59PM +0000, Jason Gunthorpe wrote:
On Wed, Jan 30, 2019 at 10:17:27AM -0700, Logan Gunthorpe wrote:
On 2019-01-29 9:18 p.m., Jason Gunthorpe wrote:
Every attempt to give BAR memory to struct page has run into major trouble, IMHO, so I like that this approach avoids that.
And if you don't have struct page then the only kernel object left to hang meta data off is the VMA itself.
It seems very similar to the existing P2P work between in-kernel consumers, just that VMA is now mediating a general user space driven discovery process instead of being hard wired into a driver.
But the kernel now has P2P bars backed by struct pages and it works well.
I don't think it works that well..
We ended up with a 'sgl' that is not really a sgl, and doesn't work with many of the common SGL patterns. sg_copy_buffer doesn't work, dma_map, doesn't work, sg_page doesn't work quite right, etc.
Only nvme and rdma got the special hacks to make them understand these p2p-sgls, and I'm still not convinced some of the RDMA drivers that want access to CPU addresses from the SGL (rxe, usnic, hfi, qib) don't break in this scenario.
Since the SGLs become broken, it pretty much means there is no path to make GUP work generically, we have to go through and make everything safe to use with p2p-sgls before allowing GUP. Which, frankly, sounds impossible with all the competing objections.
But GPU seems to have a problem unrelated to this - what Jerome wants is to have two faulting domains for VMA's - visible-to-cpu and visible-to-dma. The new op is essentially faulting the pages into the visible-to-dma category and leaving them invisible-to-cpu.
So that duality would still have to exists, and I think p2p_map/unmap is a much simpler implementation than trying to create some kind of special PTE in the VMA..
At least for RDMA, struct page or not doesn't really matter.
We can make struct pages for the BAR the same way NVMe does. GPU is probably the same, just with more mememory at stake?
And maybe this should be the first implementation. The p2p_map VMA operation should return a SGL and the caller should do the existing pci_p2pdma_map_sg() flow..
For GPU it would not work, GPU might want to use main memory (because it is running out of BAR space) it is a lot easier if the p2p_map callback calls the right dma map function (for page or io) rather than having to define some format that would pass down the information.
Worry about optimizing away the struct page overhead later?
Struct page do not fit well for GPU as the BAR address can be reprogram to point to any page inside the device memory (think 256M BAR versus 16GB device memory). Forcing struct page on GPU driver would require major surgery to the GPU driver inner working and there is no benefit to have from the struct page. So it is hard to justify this.
I think we have to consider the struct pages to track the address space, not what backs it (essentially what HMM is doing). If we need to add operations for the driver to map the address space/struct pages back to physical memory then do that. Creating a whole new idea that's tied to userspace VMAs still seems wrong to me.
VMA is the object RDMA works on, GPU driver have been working with VMA too, where VMA is tie to only one specific GPU object. So the most disrupting approach here is using struct page. It was never use and will not be use in many driver. Updating those to struct page is too risky and too much changes. The vma call back is something you can remove at any time if you have something better that do not need major surgery to GPU driver.
Cheers, Jérôme
On Tue, Jan 29, 2019 at 01:39:49PM -0700, Logan Gunthorpe wrote:
implement the mapping. And I don't think we should have 'special' vma's for this (though we may need something to ensure we don't get mapping requests mixed with different types of pages...).
I think Jerome explained the point here is to have a 'special vma' rather than a 'special struct page' as, really, we don't need a struct page at all to make this work.
If I recall your earlier attempts at adding struct page for BAR memory, it ran aground on issues related to O_DIRECT/sgls, etc, etc.
This does seem to avoid that pitfall entirely as we can never accidently get into the SGL system with this kind of memory or VMA?
Jason
From: Jérôme Glisse jglisse@redhat.com
Signed-off-by: Jérôme Glisse jglisse@redhat.com Cc: Logan Gunthorpe logang@deltatee.com Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Rafael J. Wysocki rafael@kernel.org Cc: Bjorn Helgaas bhelgaas@google.com Cc: Christian Koenig christian.koenig@amd.com Cc: Felix Kuehling Felix.Kuehling@amd.com Cc: Jason Gunthorpe jgg@mellanox.com Cc: linux-pci@vger.kernel.org Cc: dri-devel@lists.freedesktop.org Cc: Christoph Hellwig hch@lst.de Cc: Marek Szyprowski m.szyprowski@samsung.com Cc: Robin Murphy robin.murphy@arm.com Cc: Joerg Roedel jroedel@suse.de Cc: iommu@lists.linux-foundation.org --- include/linux/hmm.h | 47 +++++++++++++++++++++++++++++++++ mm/hmm.c | 63 +++++++++++++++++++++++++++++++++++++++++---- 2 files changed, 105 insertions(+), 5 deletions(-)
diff --git a/include/linux/hmm.h b/include/linux/hmm.h index 4a1454e3efba..7a3ac182cc48 100644 --- a/include/linux/hmm.h +++ b/include/linux/hmm.h @@ -710,6 +710,53 @@ struct hmm_devmem_ops { const struct page *page, unsigned int flags, pmd_t *pmdp); + + /* + * p2p_map() - map page for peer to peer between device + * @devmem: device memory structure (see struct hmm_devmem) + * @range: range of virtual address that is being mapped + * @device: device the range is being map to + * @addr: first virtual address in the range to consider + * @pa: device address (where actual mapping is store) + * Returns: number of page successfuly mapped, 0 otherwise + * + * Map page belonging to devmem to another device for peer to peer + * access. Device can decide not to map in which case memory will + * be migrated to main memory. + * + * Also there is no garantee that all the pages in the range does + * belongs to the devmem so it is up to the function to check that + * every single page does belong to devmem. + * + * Note for now we do not care about error exect error, so on failure + * function should just return 0. + */ + long (*p2p_map)(struct hmm_devmem *devmem, + struct hmm_range *range, + struct device *device, + unsigned long addr, + dma_addr_t *pas); + + /* + * p2p_unmap() - unmap page from peer to peer between device + * @devmem: device memory structure (see struct hmm_devmem) + * @range: range of virtual address that is being mapped + * @device: device the range is being map to + * @addr: first virtual address in the range to consider + * @pa: device address (where actual mapping is store) + * Returns: number of page successfuly unmapped, 0 otherwise + * + * Unmap page belonging to devmem previously map with p2p_map(). + * + * Note there is no garantee that all the pages in the range does + * belongs to the devmem so it is up to the function to check that + * every single page does belong to devmem. + */ + unsigned long (*p2p_unmap)(struct hmm_devmem *devmem, + struct hmm_range *range, + struct device *device, + unsigned long addr, + dma_addr_t *pas); };
/* diff --git a/mm/hmm.c b/mm/hmm.c index 1a444885404e..fd49b1e116d0 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -1193,16 +1193,19 @@ long hmm_range_dma_map(struct hmm_range *range, dma_addr_t *daddrs, bool block) { - unsigned long i, npages, mapped, page_size; + unsigned long i, npages, mapped, page_size, addr; long ret;
+again: ret = hmm_range_fault(range, block); if (ret <= 0) return ret ? ret : -EBUSY;
+ mapped = 0; + addr = range->start; page_size = hmm_range_page_size(range); npages = (range->end - range->start) >> range->page_shift; - for (i = 0, mapped = 0; i < npages; ++i) { + for (i = 0; i < npages; ++i, addr += page_size) { enum dma_data_direction dir = DMA_FROM_DEVICE; struct page *page;
@@ -1226,6 +1229,29 @@ long hmm_range_dma_map(struct hmm_range *range, goto unmap; }
+ if (is_device_private_page(page)) { + struct hmm_devmem *devmem = page->pgmap->data; + + if (!devmem->ops->p2p_map || !devmem->ops->p2p_unmap) { + /* Fall-back to main memory. */ + range->default_flags |= + range->flags[HMM_PFN_DEVICE_PRIVATE]; + goto again; + } + + ret = devmem->ops->p2p_map(devmem, range, device, + addr, daddrs); + if (ret <= 0) { + /* Fall-back to main memory. */ + range->default_flags |= + range->flags[HMM_PFN_DEVICE_PRIVATE]; + goto again; + } + mapped += ret; + i += ret; + continue; + } + /* If it is read and write than map bi-directional. */ if (range->pfns[i] & range->values[HMM_PFN_WRITE]) dir = DMA_BIDIRECTIONAL; @@ -1242,7 +1268,9 @@ long hmm_range_dma_map(struct hmm_range *range, return mapped;
unmap: - for (npages = i, i = 0; (i < npages) && mapped; ++i) { + npages = i; + addr = range->start; + for (i = 0; (i < npages) && mapped; ++i, addr += page_size) { enum dma_data_direction dir = DMA_FROM_DEVICE; struct page *page;
@@ -1253,6 +1281,18 @@ long hmm_range_dma_map(struct hmm_range *range, if (dma_mapping_error(device, daddrs[i])) continue;
+ if (is_device_private_page(page)) { + struct hmm_devmem *devmem = page->pgmap->data; + unsigned long inc; + + inc = devmem->ops->p2p_unmap(devmem, range, device, + addr, &daddrs[i]); + BUG_ON(inc > npages); + mapped += inc; + i += inc; + continue; + } + /* If it is read and write than map bi-directional. */ if (range->pfns[i] & range->values[HMM_PFN_WRITE]) dir = DMA_BIDIRECTIONAL; @@ -1285,7 +1325,7 @@ long hmm_range_dma_unmap(struct hmm_range *range, dma_addr_t *daddrs, bool dirty) { - unsigned long i, npages, page_size; + unsigned long i, npages, page_size, addr; long cpages = 0;
/* Sanity check. */ @@ -1298,7 +1338,7 @@ long hmm_range_dma_unmap(struct hmm_range *range,
page_size = hmm_range_page_size(range); npages = (range->end - range->start) >> range->page_shift; - for (i = 0; i < npages; ++i) { + for (i = 0, addr = range->start; i < npages; ++i, addr += page_size) { enum dma_data_direction dir = DMA_FROM_DEVICE; struct page *page;
@@ -1318,6 +1358,19 @@ long hmm_range_dma_unmap(struct hmm_range *range, set_page_dirty(page); }
+ if (is_device_private_page(page)) { + struct hmm_devmem *devmem = page->pgmap->data; + unsigned long ret; + + BUG_ON(!devmem->ops->p2p_unmap); + + ret = devmem->ops->p2p_unmap(devmem, range, device, + addr, &daddrs[i]); + BUG_ON(ret > npages); + i += ret; + continue; + } + /* Unmap and clear pfns/dma address */ dma_unmap_page(device, daddrs[i], page_size, dir); range->pfns[i] = range->values[HMM_PFN_NONE];
From: Jérôme Glisse jglisse@redhat.com
Special device vma (mmap of a device file) can correspond to device driver object that some device driver might want to share with other device (giving access to). This add support for HMM to map those special device vma if the owning device (exporter) allows it.
Signed-off-by: Jérôme Glisse jglisse@redhat.com Cc: Logan Gunthorpe logang@deltatee.com Cc: Greg Kroah-Hartman gregkh@linuxfoundation.org Cc: Rafael J. Wysocki rafael@kernel.org Cc: Bjorn Helgaas bhelgaas@google.com Cc: Christian Koenig christian.koenig@amd.com Cc: Felix Kuehling Felix.Kuehling@amd.com Cc: Jason Gunthorpe jgg@mellanox.com Cc: linux-pci@vger.kernel.org Cc: dri-devel@lists.freedesktop.org Cc: Christoph Hellwig hch@lst.de Cc: Marek Szyprowski m.szyprowski@samsung.com Cc: Robin Murphy robin.murphy@arm.com Cc: Joerg Roedel jroedel@suse.de Cc: iommu@lists.linux-foundation.org --- include/linux/hmm.h | 6 ++ mm/hmm.c | 156 ++++++++++++++++++++++++++++++++++---------- 2 files changed, 128 insertions(+), 34 deletions(-)
diff --git a/include/linux/hmm.h b/include/linux/hmm.h index 7a3ac182cc48..98ebe9f52432 100644 --- a/include/linux/hmm.h +++ b/include/linux/hmm.h @@ -137,6 +137,7 @@ enum hmm_pfn_flag_e { * result of vmf_insert_pfn() or vm_insert_page(). Therefore, it should not * be mirrored by a device, because the entry will never have HMM_PFN_VALID * set and the pfn value is undefined. + * HMM_PFN_P2P: this entry have been map as P2P ie the dma address is valid * * Driver provide entry value for none entry, error entry and special entry, * driver can alias (ie use same value for error and special for instance). It @@ -151,6 +152,7 @@ enum hmm_pfn_value_e { HMM_PFN_ERROR, HMM_PFN_NONE, HMM_PFN_SPECIAL, + HMM_PFN_P2P, HMM_PFN_VALUE_MAX };
@@ -250,6 +252,8 @@ static inline bool hmm_range_valid(struct hmm_range *range) static inline struct page *hmm_pfn_to_page(const struct hmm_range *range, uint64_t pfn) { + if (pfn == range->values[HMM_PFN_P2P]) + return NULL; if (pfn == range->values[HMM_PFN_NONE]) return NULL; if (pfn == range->values[HMM_PFN_ERROR]) @@ -270,6 +274,8 @@ static inline struct page *hmm_pfn_to_page(const struct hmm_range *range, static inline unsigned long hmm_pfn_to_pfn(const struct hmm_range *range, uint64_t pfn) { + if (pfn == range->values[HMM_PFN_P2P]) + return -1UL; if (pfn == range->values[HMM_PFN_NONE]) return -1UL; if (pfn == range->values[HMM_PFN_ERROR]) diff --git a/mm/hmm.c b/mm/hmm.c index fd49b1e116d0..621a4f831483 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -1058,37 +1058,36 @@ long hmm_range_snapshot(struct hmm_range *range) } EXPORT_SYMBOL(hmm_range_snapshot);
-/* - * hmm_range_fault() - try to fault some address in a virtual address range - * @range: range being faulted - * @block: allow blocking on fault (if true it sleeps and do not drop mmap_sem) - * Returns: 0 on success ortherwise: - * -EINVAL: - * Invalid argument - * -ENOMEM: - * Out of memory. - * -EPERM: - * Invalid permission (for instance asking for write and range - * is read only). - * -EAGAIN: - * If you need to retry and mmap_sem was drop. This can only - * happens if block argument is false. - * -EBUSY: - * If the the range is being invalidated and you should wait for - * invalidation to finish. - * -EFAULT: - * Invalid (ie either no valid vma or it is illegal to access that - * range), number of valid pages in range->pfns[] (from range start - * address). - * - * This is similar to a regular CPU page fault except that it will not trigger - * any memory migration if the memory being faulted is not accessible by CPUs - * and caller does not ask for migration. - * - * On error, for one virtual address in the range, the function will mark the - * corresponding HMM pfn entry with an error flag. - */ -long hmm_range_fault(struct hmm_range *range, bool block) +static int hmm_vma_p2p_map(struct hmm_range *range, struct vm_area_struct *vma, + unsigned long start, unsigned long end, + struct device *device, dma_addr_t *pas) +{ + struct hmm_vma_walk hmm_vma_walk; + unsigned long npages, i; + bool fault, write; + uint64_t *pfns; + int ret; + + i = (start - range->start) >> PAGE_SHIFT; + npages = (end - start) >> PAGE_SHIFT; + pfns = &range->pfns[i]; + pas = &pas[i]; + + hmm_vma_walk.range = range; + hmm_vma_walk.fault = true; + hmm_range_need_fault(&hmm_vma_walk, pfns, npages, + 0, &fault, &write); + + ret = vma->vm_ops->p2p_map(vma, device, start, end, pas, write); + for (i = 0; i < npages; ++i) { + pfns[i] = ret ? range->values[HMM_PFN_ERROR] : + range->values[HMM_PFN_P2P]; + } + return ret; +} + +static long _hmm_range_fault(struct hmm_range *range, bool block, + struct device *device, dma_addr_t *pas) { const unsigned long device_vma = VM_IO | VM_PFNMAP | VM_MIXEDMAP; unsigned long start = range->start, end; @@ -1110,9 +1109,22 @@ long hmm_range_fault(struct hmm_range *range, bool block) }
vma = find_vma(hmm->mm, start); - if (vma == NULL || (vma->vm_flags & device_vma)) + if (vma == NULL) return -EFAULT;
+ end = min(range->end, vma->vm_end); + if (vma->vm_flags & device_vma) { + if (!device || !pas || !vma->vm_ops->p2p_map) + return -EFAULT; + + ret = hmm_vma_p2p_map(range, vma, start, + end, device, pas); + if (ret) + return ret; + start = end; + continue; + } + if (is_vm_hugetlb_page(vma)) { struct hstate *h = hstate_vma(vma);
@@ -1142,7 +1154,6 @@ long hmm_range_fault(struct hmm_range *range, bool block) hmm_vma_walk.block = block; hmm_vma_walk.range = range; mm_walk.private = &hmm_vma_walk; - end = min(range->end, vma->vm_end);
mm_walk.vma = vma; mm_walk.mm = vma->vm_mm; @@ -1175,6 +1186,41 @@ long hmm_range_fault(struct hmm_range *range, bool block)
return (hmm_vma_walk.last - range->start) >> PAGE_SHIFT; } + +/* + * hmm_range_fault() - try to fault some address in a virtual address range + * @range: range being faulted + * @block: allow blocking on fault (if true it sleeps and do not drop mmap_sem) + * Returns: 0 on success ortherwise: + * -EINVAL: + * Invalid argument + * -ENOMEM: + * Out of memory. + * -EPERM: + * Invalid permission (for instance asking for write and range + * is read only). + * -EAGAIN: + * If you need to retry and mmap_sem was drop. This can only + * happens if block argument is false. + * -EBUSY: + * If the the range is being invalidated and you should wait for + * invalidation to finish. + * -EFAULT: + * Invalid (ie either no valid vma or it is illegal to access that + * range), number of valid pages in range->pfns[] (from range start + * address). + * + * This is similar to a regular CPU page fault except that it will not trigger + * any memory migration if the memory being faulted is not accessible by CPUs + * and caller does not ask for migration. + * + * On error, for one virtual address in the range, the function will mark the + * corresponding HMM pfn entry with an error flag. + */ +long hmm_range_fault(struct hmm_range *range, bool block) +{ + return _hmm_range_fault(range, block, NULL, NULL); +} EXPORT_SYMBOL(hmm_range_fault);
/* @@ -1197,7 +1243,7 @@ long hmm_range_dma_map(struct hmm_range *range, long ret;
again: - ret = hmm_range_fault(range, block); + ret = _hmm_range_fault(range, block, device, daddrs); if (ret <= 0) return ret ? ret : -EBUSY;
@@ -1209,6 +1255,11 @@ long hmm_range_dma_map(struct hmm_range *range, enum dma_data_direction dir = DMA_FROM_DEVICE; struct page *page;
+ if (range->pfns[i] == range->values[HMM_PFN_P2P]) { + mapped++; + continue; + } + /* * FIXME need to update DMA API to provide invalid DMA address * value instead of a function to test dma address value. This @@ -1274,6 +1325,11 @@ long hmm_range_dma_map(struct hmm_range *range, enum dma_data_direction dir = DMA_FROM_DEVICE; struct page *page;
+ if (range->pfns[i] == range->values[HMM_PFN_P2P]) { + mapped--; + continue; + } + page = hmm_pfn_to_page(range, range->pfns[i]); if (page == NULL) continue; @@ -1305,6 +1361,30 @@ long hmm_range_dma_map(struct hmm_range *range, } EXPORT_SYMBOL(hmm_range_dma_map);
+static unsigned long hmm_vma_p2p_unmap(struct hmm_range *range, + struct vm_area_struct *vma, + unsigned long start, + struct device *device, + dma_addr_t *pas) +{ + unsigned long end; + + if (!vma) { + BUG(); + return 1; + } + + start &= PAGE_MASK; + if (start < vma->vm_start || start >= vma->vm_end) { + BUG(); + return 1; + } + + end = min(range->end, vma->vm_end); + vma->vm_ops->p2p_unmap(vma, device, start, end, pas); + return (end - start) >> PAGE_SHIFT; +} + /* * hmm_range_dma_unmap() - unmap range of that was map with hmm_range_dma_map() * @range: range being unmapped @@ -1342,6 +1422,14 @@ long hmm_range_dma_unmap(struct hmm_range *range, enum dma_data_direction dir = DMA_FROM_DEVICE; struct page *page;
+ if (range->pfns[i] == range->values[HMM_PFN_P2P]) { + BUG_ON(!vma); + cpages += hmm_vma_p2p_unmap(range, vma, addr, + device, &daddrs[i]); + i += cpages - 1; + continue; + } + page = hmm_pfn_to_page(range, range->pfns[i]); if (page == NULL) continue;
dri-devel@lists.freedesktop.org