On Wed, Jan 30, 2019 at 09:00:06AM +0100, Christoph Hellwig wrote:
GUP doesn't work anyhow today, and won't work with BAR struct pages in the forseeable future (Logan has sent attempts on this before).
So nothing seems lost..
So you can't do direct I/O to your remapped BAR, you can't create MRs on it, etc, etc.
Jerome made the HMM mirror API use this flow, so afer his patch to switch the ODP MR to use HMM, and to switch GPU drivers, it will work for those cases. Which is more than the zero cases than we have today :)
Jason
On 2019-01-30 12:06 p.m., Jason Gunthorpe wrote:
I don't recall ever attempting that... But patching GUP for special pages or VMAS; or working around by not calling it in some cases seems like the thing that's going to need to be done one way or another.
But we're getting the same bait and switch here... If you are using HMM you are using struct pages, but we're told we need this special VMA hack for cases that don't use HMM and thus don't have struct pages...
Logan
On Wed, Jan 30, 2019 at 12:45:46PM -0700, Logan Gunthorpe wrote:
Remember, the long discussion we had about how to get the IOMEM annotation into SGL? That is a necessary pre-condition to doing anything with GUP in DMA using drivers as GUP -> SGL -> DMA map is pretty much the standard flow.
Well, I don't know much about HMM, but the HMM mirror API looks like a version of MMU notifiers that offloads a bunch of dreck to the HMM code core instead of drivers. The RDMA code got hundreds of lines shorter by using it.
Some of that dreck is obtaining a DMA address for the user VMAs, including using multiple paths to get them. A driver using HMM mirror doesn't seem to call GUP at all, HMM mirror handles that, along with various special cases, including calling out to these new VMA ops.
I don't really know how mirror relates to other parts of HMM, like the bits that use struct pages. Maybe it also knows about more special cases created by other parts of HMM?
So, I see Jerome solving the GUP problem by replacing GUP entirely using an API that is more suited to what these sorts of drivers actually need.
Jason
On 2019-01-30 12:59 p.m., Jason Gunthorpe wrote:
Yes, but that was unrelated to GUP even if that might have been the eventual direction.
And I feel the GUP->SGL->DMA flow should still be what we are aiming for. Even if we need a special GUP for special pages, and a special DMA map; and the SGL still has to be homogenous....
Yes, this is what I'm expecting and what I want. Not bypassing the whole thing by doing special things with VMAs.
Logan
On Wed, Jan 30, 2019 at 02:01:35PM -0700, Logan Gunthorpe wrote:
*shrug* so what if the special GUP called a VMA op instead of traversing the VMA PTEs today? Why does it really matter? It could easily change to a struct page flow tomorrow..
IMHO struct page is a big pain for this application, and if we can build flows that don't actually need it then we shouldn't require it just because the old flows needed it.
HMM mirror is a new flow that doesn't need struct page.
Would you feel better if this also came along with a:
struct dma_sg_table *sgl_dma_map_user(struct device *dma_device, void __user *prt, size_t len)
flow which returns a *DMA MAPPED* sgl that does not have struct page pointers as another interface?
We can certainly call an API like this from RDMA for non-ODP MRs.
Eliminating the page pointers also eliminates the __iomem problem. However this sgl object is not copyable or accessible from the CPU, so the caller must be sure it doesn't need CPU access when using this API.
For RDMA I'd include some flag in the struct ib_device if the driver requires CPU accessible SGLs and call the right API. Maybe the block layer could do the same trick for O_DIRECT?
This would also directly solve the P2P problem with hfi1/qib/rxe, as I'd likely also say that pci_p2pdma_map_sg() returns the same DMA only sgl thing.
Jason
On 2019-01-30 2:50 p.m., Jason Gunthorpe wrote:
Well it's so that it's composable. We want the SGL->DMA side to work for APIs from kernel space and not have to run a completely different flow for kernel drivers than from userspace memory.
For GUP to do a special VMA traversal it would now need to return something besides struct pages which means no SGL and it means a completely different DMA mapping call.
That seems like a nice API. But certainly the implementation would need to use existing dma_map or pci_p2pdma_map calls, or whatever as part of it...
,
We actually stopped caring about the __iomem problem. We are working under the assumption that pages returned by devm_memremap_pages() can be accessed as normal RAM and does not need the __iomem designation. The main problem now is that code paths need to know to use pci_p2pdma_map or not. And in theory this could be pushed into regular dma_map implementations but we'd have to get it into all of them which is a pain.
Logan
On Wed, Jan 30, 2019 at 03:52:13PM -0700, Logan Gunthorpe wrote:
If we want to have these DMA-only SGLs anyhow, then the kernel flow can use them too.
In the kernel it easier because the 'exporter' already knows it is working with BAR memory, so it can just do something like this:
struct dma_sg_table *sgl_dma_map_pci_bar(struct pci_device *from, struct device *to, unsigned long bar_ptr, size_t length)
And then it falls down the same DMA-SGL-only kind of flow that would exist to support the user side. ie it is the kernel version of the API I described below.
GUP cannot support BAR memory because it must only return CPU memory - I think too many callers make this assumption for it to be possible to change it.. (see below)
A new-GUP can return DMA addresses - so it doesn't have this problem.
I wonder how Jerome worked the translation, I haven't looked yet..
As far as CPU mapped uncached BAR memory goes, this is broadly not true.
s390x for instance requires dedicated CPU instructions to access BAR memory.
x86 will fault if you attempt to use a SSE algorithm on uncached BAR memory. The kernel has many SSE accelerated algorithsm so you can never pass these special p2p SGL's through to them either. (think parity or encryption transformations through the block stack)
Many platforms have limitations on alignment and access size for BAR memory - you can't just blindly call memcpy and expect it to work. (TPM is actually struggling with this now, confusingly different versions of memcpy are giving different results on some x86 io memory)
Other platforms might fault or corrupt if an unaligned access is attempted to BAR memory.
In short - I don't see a way to avoid knowing about __iomem in the sgl. There are too many real use cases that require this knowledge, and too many places that touch the SGL pages with the CPU.
I think we must have 'DMA only' SGLs and code paths that are known DMA-only clean to make it work properly with BAR memory.
Jason
dri-devel@lists.freedesktop.org