Re: Phyr Starter

11 Jan 2022

      On Tue, Jan 11, 2022 at 04:21:59PM -0400, Jason Gunthorpe wrote:
...
On Tue, Jan 11, 2022 at 06:33:57PM +0000, Matthew Wilcox wrote:
...
...
Then we are we using get_user_phyr() at all if we are just storing it
in a sg?
I did consider just implementing get_user_sg() (actually 4 years ago),
but that cements the use of sg as both an input and output data structure
for DMA mapping, which I am under the impression we're trying to get
away from.
I know every time I talked about a get_user_sg() Christoph is against
it and we need to stop using scatter list...
...
...
Also 16 entries is way to small, it should be at least a whole PMD
worth so we don't have to relock the PMD level each iteration.
I would like to see a flow more like:
cpu_phyr_list = get_user_phyr(uptr, 1G);
  dma_phyr_list = dma_map_phyr(device, cpu_phyr_list);
  [..]
  dma_unmap_phyr(device, dma_phyr_list);
  unpin_drity_free(cpu_phy_list);
Where dma_map_phyr() can build a temporary SGL for old iommu drivers
compatability. iommu drivers would want to implement natively, of
course.
ie no loops in drivers.
Let me just rewrite that for you ...
umem->phyrs = get_user_phyrs(addr, size, &umem->phyr_len);
   umem->sgt = dma_map_phyrs(device, umem->phyrs, umem->phyr_len,
   		DMA_BIDIRECTIONAL, dma_attr);
   ...
   dma_unmap_phyr(device, umem->phyrs, umem->phyr_len, umem->sgt->sgl,
   		umem->sgt->nents, DMA_BIDIRECTIONAL, dma_attr);
   sg_free_table(umem->sgt);
   free_user_phyrs(umem->phyrs, umem->phyr_len);
Why? As above we want to get rid of the sgl, so you are telling me to
adopt phyrs I need to increase the memory consumption by a hefty
amount to store the phyrs and still keep the sgt now? Why?
I don't need the sgt at all. I just need another list of physical
addresses for DMA. I see no issue with a phsr_list storing either CPU
Physical Address or DMA Physical Addresses, same data structure.
There's a difference between a phys_addr_t and a dma_addr_t.  They
can even be different sizes; some architectures use a 32-bit dma_addr_t
and a 64-bit phys_addr_t or vice-versa.  phyr cannot store DMA addresses.
...
In the fairly important passthrough DMA case the CPU list and DMA list
are identical, so we don't even need to do anything.
In the typical iommu case my dma map's phyrs is only one entry.
That becomes a very simple sg table then.
...
As an example coding - Use the first 8 bytes to encode this:
51:0 - Physical address / 4k (ie pfn)
 56:52 - Order (simple, your order encoding can do better)
 61:57 - Unused
 63:62 - Mode, one of:
         00 = natural order pfn (8 bytes)
         01 = order aligned with length (12 bytes)
         10 = arbitary (12 bytes)
Then the optional 4 bytes are used as:
Mode 01 (Up to 2^48 bytes of memory on a 4k alignment)
  31:0 - # of order pages
Mode 10 (Up to 2^25 bytes of memory on a 1 byte alignment)
  11:0 - starting byte offset in the 4k
  31:12 - 20 bits, plus the 5 bit order from the first 8 bytes:
          length in bytes
Honestly, this looks awful to operate on.  Mandatory 8-bytes per entry
with an optional 4 byte extension?
...
...
...
The last case is, perhaps, a possible route to completely replace
scatterlist. Few places need true byte granularity for interior pages,
so we can invent some coding to say 'this is 8 byte aligned, and n
bytes long' that only fits < 4k or something. Exceptional cases can
then still work. I'm not sure what block needs here - is it just 512?
Replacing scatterlist is not my goal.  That seems like a lot more work
for little gain.
Well, I'm not comfortable with the idea above where RDMA would have to
take a memory penalty to use the new interface. To avoid that memory
penalty we need to get rid of scatterlist entirely.
If we do the 16 byte struct from the first email then a umem for MRs
will increase in memory consumption by 160% compared today's 24
bytes/page. I think the HPC workloads will veto this.
Huh?  We do 16 bytes per physically contiguous range.  Then, if your HPC
workloads use an IOMMU that can map a virtually contiguous range
into a single sg entry, it uses 24 bytes for the entire mapping.
It should shrink.
...
...
I just want to delete page_link, offset and length from struct
scatterlist.  Given the above sequence of calls, we're going to get
sg lists that aren't chained.  They may have to be vmalloced, but
they should be contiguous.
I don't understand that? Why would the SGL out of the iommu suddenly
not be chained?
Because it's being given a single set of ranges to map, instead of
being given 512 pages at a time.
...
...
From what I've heard I'm also not keen on a physr list using vmalloc
either, that is said to be quite slow?
It would only be slow for degenerate cases where the pinned memory
is fragmented and not contiguous.
...
...
...
I would imagine a few steps to this process:

'phyr_list' datastructure, with chaining, pre-allocation, etc
Wrapper around existing gup to get a phyr_list for user VA
Compat 'dma_map_phyr()' that coverts a phyr_list to a sgl and back
(However, with full performance for iommu passthrough)
Patches changing RDMA/VFIO/DRM to this API
Patches optimizing get_user_phyr()
Patches implementing dma_map_phyr in the AMD or Intel IOMMU driver

I was thinking ...

get_user_phyrs() & free_user_phyrs()
dma_map_phyrs() and dma_unmap_phyrs() wrappers that create a
scatterlist from phyrs and call dma_map_sg() / dma_unmap_sg() to work
with current IOMMU drivers

IMHO, the scatterlist has to go away. The interface should be physr
list in, physr list out.
That's reproducing the bad decision of the scatterlist, only with
a different encoding.  You end up with something like:
struct neoscat {
    dma_addr_t dma_addr;
    phys_addr_t phys_addr;
    size_t dma_len;
    size_t phys_len;
};
and the dma_addr and dma_len are unused by all-but-the-first entry when
you have a competent IOMMU.  We want a different data structure in and
out, and we may as well keep using the scatterlist for the dma-map-out.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: Phyr Starter