On Tue, Jan 11, 2022 at 06:53:06PM -0400, Jason Gunthorpe wrote:
IOMMU is not common in those cases, it is slow.
So you end up with 16 bytes per entry then another 24 bytes in the entirely redundant scatter list. That is now 40 bytes/page for typical HPC case, and I can't see that being OK.
Ah, I didn't realise what case you wanted to optimise for.
So, how about this ...
Since you want to get to the same destination as I do (a 16-byte-per-entry dma_addr+dma_len struct), but need to get there sooner than "make all sg users stop using it wrongly", let's introduce a (hopefully temporary) "struct dma_range".
But let's go further than that (which only brings us to 32 bytes per range). For the systems you care about which use an identity mapping, and have sizeof(dma_addr_t) == sizeof(phys_addr_t), we can simply point the dma_range pointer to the same memory as the phyr. We just have to not free it too early. That gets us down to 16 bytes per range, a saving of 33%.