On Thu, Jan 31, 2019 at 09:13:55AM +0100, Christoph Hellwig wrote:
Something like this (and more) has always been the roadblock with trying to mix BAR memory into SGL. I think it is such a big problem as to be unsolvable in one step..
Struct page doesn't even really help anything beyond dma_map as we still can't pretend that __iomem is normal memory for general SGL users.
Jerome, how does this work anyhow? Did you do something to make the VMA lifetime match the p2p_map/unmap? Or can we get into a situation were the VMA is destroyed and the importing driver can't call the unmap anymore?
I know in the case of notifiers the VMA liftime should be strictly longer than the map/unmap - but does this mean we can never support non-notifier users via this scheme?
(3) to make the PTEs dirty after writing to them. Again no sure what our preferred interface here would be
This need doesn't really apply to BAR memory..
I still think the right direction is to build on what Logan has done - realize that he created a DMA-only SGL - make that a formal type of the kernel and provide the right set of APIs to work with this type, without being forced to expose struct page.
Basically invert the API flow - the DMA map would be done close to GUP, not buried in the driver. This absolutely doesn't work for every flow we have, but it does enable the ones that people seem to care about when talking about P2P.
To get to where we are today we'd need a few new IB APIs, and some nvme change to work with DMA-only SGL's and so forth, but that doesn't seem so bad. The API also seems much more safe and understandable than todays version that is trying to hope that the SGL is never touched by the CPU.
It also does present a path to solve some cases of the O_DIRECT problems if the block stack can develop some way to know if an IO will go down a DMA-only IO path or not... This seems less challenging that auditing every SGL user for iomem safety??
Yes we end up with a duality, but we already basically have that with the p2p flow today..
Jason
On 2019-01-31 12:02 p.m., Jason Gunthorpe wrote:
The DMA-only SGL will work for some use cases, but I think it's going to be a challenge for others. We care most about NVMe and, therefore, the block layer.
Given my understanding of the block layer, and it's queuing infrastructure, I don't think having a DMA-only IO path makes sense. I think it has to be the same path, but with a special DMA-only bio; and endpoints would have to indicate support for that bio. I can't say I have a deep enough understanding of the block layer to know how possible that would be.
Logan
On Thu, Jan 31, 2019 at 12:19:31PM -0700, Logan Gunthorpe wrote:
The exercise here is not to enable O_DIRECT for P2P, it is to allow certain much simpler users to use P2P. We should not be saying that someone has to solve these complicated problems in the entire block stack just to make RDMA work. :(
If the block stack can use a 'dma sgl' or not, I don't know.
However, it does look like it fits these RDMA, GPU and VFIO cases fairly well, and looks better than the hacky sgl-but-really-special-p2p hack we have in RDMA today.
Jason
On Thu, Jan 31, 2019 at 07:02:15PM +0000, Jason Gunthorpe wrote:
So in this version the requirement is that the importer also have a mmu notifier registered and that's what all GPU driver do already. Any driver that map some range of vma to a device should register itself as a mmu notifier listener to do something when vma goes away. I posted a patchset a while ago to allow listener to differentiate when the vma is going away from other type of invalidation [1]
With that in place you can easily handle the pin case. Driver really need to do something when the vma goes away with GUP or not. As the device is then writing/reading to/from something that does not match anything in the process address space.
So user that want pin would register notifier, call p2p_map with pin flag and ignore all notifier callback except the unmap one when the unmap one happens they have the vma and they should call p2p_unmap from their invalidate callback and update their device to either some dummy memory or program it in a way that the userspace application will notice.
This can all be handled by some helper so that driver do not have to write more than 5 lines of code and function to update their device mapping to something of their choosing.
This does not work for GPU really i do not want to have to rewrite GPU driver for this. Struct page is a burden and it does not bring anything to the table. I rather provide an all in one stop for driver to use this without having to worry between regular vma and special vma.
Note that in this patchset i reuse chunk of Logan works and intention is to also allow PCI struct page to work too. But they should not be the only mechanisms.
So what is this O_DIRECT thing that keep coming again and again here :) What is the use case ? Note that bio will always have valid struct page of regular memory as using PCIE BAR for filesystem is crazy (you do not have atomic or cache coherence and many CPU instruction have _undefined_ effect so what ever the userspace would do might do nothing.
Now if you want to use BAR address as destination or source of directIO then let just update the directIO code to handle this. There is no need to go hack every single place in the kernel that might deal with struct page or sgl. Just update the place that need to understand this. We can even update directIO to work on weird platform. The change to directIO will be small, couple hundred line of code at best.
Cheers, Jérôme
[1] https://lore.kernel.org/linux-fsdevel/20190123222315.1122-1-jglisse@redhat.c...
On 2019-01-31 12:35 p.m., Jerome Glisse wrote:
The point is to be able to use a BAR as the source of data to write/read from a file system. So as a simple example, if an NVMe drive had a CMB, and you could map that CMB to userspace, you could do an O_DIRECT read to the BAR on one drive and an O_DIRECT write from the BAR on another drive. Thus you could bypass the upstream port of a switch (and therefore all CPU resources) altogether.
For the most part nobody would want to put a filesystem on a BAR. (Though there have been some crazy ideas to put persistent memory behind a CMB...)
Well if you want to figure out how to remove struct page from the entire block layer that would help everybody. But until then, it's pretty much impossible to use the block layer (and therefore O_DIRECT) without struct page.
Logan
On Thu, Jan 31, 2019 at 02:35:14PM -0500, Jerome Glisse wrote:
I'm talking about almost exactly what you've done in here - make a 'sgl' that is dma addresses only.
In these VMA patches you used a simple array of physical addreses - I'm only talking about moving that array into a 'dma sgl'.
The flow is still basically the same - the driver directly gets DMA physical addresses with no possibility to get a struct page or CPU memory.
And then we can build more stuff around the 'dma sgl', including the in-kernel users Logan is worrying about.
Jason
dri-devel@lists.freedesktop.org