On Tue, Jan 11, 2022 at 04:32:56AM +0000, Matthew Wilcox wrote:
On Mon, Jan 10, 2022 at 08:41:26PM -0400, Jason Gunthorpe wrote:
On Mon, Jan 10, 2022 at 07:34:49PM +0000, Matthew Wilcox wrote:
Finally, it may be possible to stop using scatterlist to describe the input to the DMA-mapping operation. We may be able to get struct scatterlist down to just dma_address and dma_length, with chaining handled through an enclosing struct.
Can you talk about this some more? IMHO one of the key properties of the scatterlist is that it can hold huge amounts of pages without having to do any kind of special allocation due to the chaining.
The same will be true of the phyr idea right?
My thinking is that we'd pass a relatively small array of phyr (maybe 16 entries) to get_user_phyr(). If that turned out not to be big enough, then we have two options; one is to map those 16 ranges with sg and use the sg chaining functionality before throwing away the phyr and calling get_user_phyr() again.
Then we are we using get_user_phyr() at all if we are just storing it in a sg?
Also 16 entries is way to small, it should be at least a whole PMD worth so we don't have to relock the PMD level each iteration.
I would like to see a flow more like:
cpu_phyr_list = get_user_phyr(uptr, 1G); dma_phyr_list = dma_map_phyr(device, cpu_phyr_list); [..] dma_unmap_phyr(device, dma_phyr_list); unpin_drity_free(cpu_phy_list);
Where dma_map_phyr() can build a temporary SGL for old iommu drivers compatability. iommu drivers would want to implement natively, of course.
ie no loops in drivers.
The question is whether this is the right kind of optimisation to be doing. I hear you that we want a dense format, but it's questionable whether the kind of thing you're suggesting is actually denser than this scheme. For example, if we have 1GB pages and userspace happens to have allocated pages (3, 4, 5, 6, 7, 8, 9, 10) then this can be represented as a single phyr. A power-of-two scheme would have us use four entries (3, 4-7, 8-9, 10).
That is not quite what I had in mind..
struct phyr_list { unsigned int first_page_offset_bytes; size_t total_length_bytes; phys_addr_t min_alignment; struct packed_phyr *list_of_pages; };
Where each 'packed_phyr' is an aligned page of some kind. The packing has to be able to represent any number of pfns, so we have four major cases: - 4k pfns (use 8 bytes) - Natural order pfn (use 8 bytes) - 4k aligned pfns, arbitary number (use 12 bytes) - <4k aligned, arbitary length (use 16 bytes?)
In all cases the interior pages are fully used, only the first and last page is sliced based on the two parameters in the phyr_list.
The first_page_offset_bytes/total_length_bytes mean we don't need to use the inefficient coding for many common cases, just stick to the 4k coding and slice the first/last page down.
The last case is, perhaps, a possible route to completely replace scatterlist. Few places need true byte granularity for interior pages, so we can invent some coding to say 'this is 8 byte aligned, and n bytes long' that only fits < 4k or something. Exceptional cases can then still work. I'm not sure what block needs here - is it just 512?
Basically think of list_of_pages as showing a contiguous list of at least min_aligned pages and first_page_offset_bytes/total_length_bytes taking a byte granular slice out of that logical range.
From a HW perspective I see two basic modalities:
- Streaming HW, which read/writes in a single pass (think NVMe/storage/network). Usually takes a list of dma_addr_t and length that HW just walks over. Rarely cares about things like page boundaries. Optimization goal is to minimize list length. In this case we map each packed_phyr into a HW SGL
- Random Access HW, which is randomly touching memory (think RDMA, VFIO, DRM, IOMMU). Usually stores either a linear list of same-size dma_addr_t pages, or a radix tree page table of dma_addr_t. Needs to have a minimum alignment of each chunk (usually 4k) to represent it. Optimization goal is to have maximum page size. In this case we use min_alignment to size the HW array and decode the packed_phyrs into individual pages.
Using a (dma_addr, size_t) tuple makes coalescing adjacent pages very cheap.
With the above this still works, the very last entry in list_of_pages would be the 12 byte pfn type and when we start a new page the logic would then optimize it down to 8 bytes, if possible. At that point we know we are not going to change it:
- An interior page that is up perfectly aligned is represented as a natural order - A starting page that ends on perfect alignment is widened to natural order and first_page_offset_bytes is corrected - An ending page that starts on perfect alignment is widened to natural order and total_length_bytes is set (though no harm in keeping the 12 byte represetation I suppose)
The main detail is to make the extra 4 bytes needed to store the arbtiary pfn counts optional so when we don't need it, it isn't there.
VFIO would like this structure as well as it currently is a very inefficient page at a time loop when it iommu maps things.
I agree that you need these things. I think I'll run into trouble if I try to do them for you ... so I'm going to stop after doing the top end (pinning pages and getting them into an sg list) and let people who know that area better than I do work on that.
I agree, that is why I was asking for the datastructure 'phyr_list', with chaining and so on.
I would imagine a few steps to this process: 1) 'phyr_list' datastructure, with chaining, pre-allocation, etc 2) Wrapper around existing gup to get a phyr_list for user VA 3) Compat 'dma_map_phyr()' that coverts a phyr_list to a sgl and back (However, with full performance for iommu passthrough) 4) Patches changing RDMA/VFIO/DRM to this API 5) Patches optimizing get_user_phyr() 6) Patches implementing dma_map_phyr in the AMD or Intel IOMMU driver
Obviously not all done by you.. I'm happy to help and take a swing at the RDMA and VFIO parts.
I feel like we can go ahead with RDMA so long as the passthrough IOMMU case is 100% efficient. VFIO is already maximually inefficient here, so no worry there already. If VFIO and RDMA consume these APIs then the server IOMMU driver owners should have a strong motivation to implement.
My feeling is that at the core this project is about making a better datastructure than scattterlist that can mostly replace it, then going around the kernel and converting scatterlist users.
Given all the usage considerations I think it is an interesting datastructure.
Jason