Re: Phyr Starter

11 Jan 2022


      On Tue, Jan 11, 2022 at 04:32:56AM +0000, Matthew Wilcox wrote:
...
On Mon, Jan 10, 2022 at 08:41:26PM -0400, Jason Gunthorpe wrote:
...
On Mon, Jan 10, 2022 at 07:34:49PM +0000, Matthew Wilcox wrote:
...
Finally, it may be possible to stop using scatterlist to describe the
input to the DMA-mapping operation.  We may be able to get struct
scatterlist down to just dma_address and dma_length, with chaining
handled through an enclosing struct.
Can you talk about this some more? IMHO one of the key properties of
the scatterlist is that it can hold huge amounts of pages without
having to do any kind of special allocation due to the chaining.
The same will be true of the phyr idea right?
My thinking is that we'd pass a relatively small array of phyr (maybe 16
entries) to get_user_phyr().  If that turned out not to be big enough,
then we have two options; one is to map those 16 ranges with sg and use
the sg chaining functionality before throwing away the phyr and calling
get_user_phyr() again.
Then we are we using get_user_phyr() at all if we are just storing it
in a sg?
Also 16 entries is way to small, it should be at least a whole PMD
worth so we don't have to relock the PMD level each iteration.
I would like to see a flow more like:
cpu_phyr_list = get_user_phyr(uptr, 1G);
  dma_phyr_list = dma_map_phyr(device, cpu_phyr_list);
  [..]
  dma_unmap_phyr(device, dma_phyr_list);
  unpin_drity_free(cpu_phy_list);
Where dma_map_phyr() can build a temporary SGL for old iommu drivers
compatability. iommu drivers would want to implement natively, of
course.
ie no loops in drivers.
...
The question is whether this is the right kind of optimisation to be
doing.  I hear you that we want a dense format, but it's questionable
whether the kind of thing you're suggesting is actually denser than this
scheme.  For example, if we have 1GB pages and userspace happens to have
allocated pages (3, 4, 5, 6, 7, 8, 9, 10) then this can be represented
as a single phyr.  A power-of-two scheme would have us use four entries
(3, 4-7, 8-9, 10).
That is not quite what I had in mind..
struct phyr_list {
   unsigned int first_page_offset_bytes;
   size_t total_length_bytes;
   phys_addr_t min_alignment;
   struct packed_phyr *list_of_pages;
};
Where each 'packed_phyr' is an aligned page of some kind. The packing
has to be able to represent any number of pfns, so we have four major
cases:
 - 4k pfns (use 8 bytes)
 - Natural order pfn (use 8 bytes)
 - 4k aligned pfns, arbitary number (use 12 bytes)
 - <4k aligned, arbitary length (use 16 bytes?)
In all cases the interior pages are fully used, only the first and
last page is sliced based on the two parameters in the phyr_list.
The first_page_offset_bytes/total_length_bytes mean we don't need to
use the inefficient coding for many common cases, just stick to the 4k
coding and slice the first/last page down.
The last case is, perhaps, a possible route to completely replace
scatterlist. Few places need true byte granularity for interior pages,
so we can invent some coding to say 'this is 8 byte aligned, and n
bytes long' that only fits < 4k or something. Exceptional cases can
then still work. I'm not sure what block needs here - is it just 512?
Basically think of list_of_pages as showing a contiguous list of at
least min_aligned pages and first_page_offset_bytes/total_length_bytes
taking a byte granular slice out of that logical range.
...
From a HW perspective I see two basic modalities:
- Streaming HW, which read/writes in a single pass (think
   NVMe/storage/network). Usually takes a list of dma_addr_t and
   length that HW just walks over. Rarely cares about things like page
   boundaries. Optimization goal is to minimize list length. In this
   case we map each packed_phyr into a HW SGL
- Random Access HW, which is randomly touching memory (think RDMA,
   VFIO, DRM, IOMMU). Usually stores either a linear list of same-size
   dma_addr_t pages, or a radix tree page table of dma_addr_t.
   Needs to have a minimum alignment of each chunk (usually 4k) to
   represent it. Optimization goal is to have maximum page size. In
   this case we use min_alignment to size the HW array and decode the
   packed_phyrs into individual pages.
...
Using a (dma_addr, size_t) tuple makes coalescing adjacent pages very
cheap.
With the above this still works, the very last entry in list_of_pages
would be the 12 byte pfn type and when we start a new page the logic
would then optimize it down to 8 bytes, if possible. At that point we
know we are not going to change it:
- An interior page that is up perfectly aligned is represented as a
   natural order
 - A starting page that ends on perfect alignment is widened to
   natural order and first_page_offset_bytes is corrected
 - An ending page that starts on perfect alignment is widened to
   natural order and total_length_bytes is set
   (though no harm in keeping the 12 byte represetation I suppose)
The main detail is to make the extra 4 bytes needed to store the
arbtiary pfn counts optional so when we don't need it, it isn't there.
...
...
VFIO would like this structure as well as it currently is a very
inefficient page at a time loop when it iommu maps things.
I agree that you need these things.  I think I'll run into trouble
if I try to do them for you ... so I'm going to stop after doing the
top end (pinning pages and getting them into an sg list) and let
people who know that area better than I do work on that.
I agree, that is why I was asking for the datastructure 'phyr_list',
with chaining and so on.
I would imagine a few steps to this process:
 1) 'phyr_list' datastructure, with chaining, pre-allocation, etc
 2) Wrapper around existing gup to get a phyr_list for user VA
 3) Compat 'dma_map_phyr()' that coverts a phyr_list to a sgl and back
    (However, with full performance for iommu passthrough)
 4) Patches changing RDMA/VFIO/DRM to this API
 5) Patches optimizing get_user_phyr()
 6) Patches implementing dma_map_phyr in the AMD or Intel IOMMU driver
Obviously not all done by you.. I'm happy to help and take a swing at
the RDMA and VFIO parts.
I feel like we can go ahead with RDMA so long as the passthrough IOMMU
case is 100% efficient. VFIO is already maximually inefficient here,
so no worry there already. If VFIO and RDMA consume these APIs then
the server IOMMU driver owners should have a strong motivation to
implement.
My feeling is that at the core this project is about making a better
datastructure than scattterlist that can mostly replace it, then going
around the kernel and converting scatterlist users.
Given all the usage considerations I think it is an interesting
datastructure.
Jason

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: Phyr Starter