Hi Logan/All,
I have added a check for the sg_dma_len == 0 : """ } __sgt_iter(struct scatterlist *sgl, bool dma) { struct sgt_iter s = { .sgp = sgl };
+ if (sgl && sg_dma_len(sgl) == 0) + s.sgp = NULL;
if (s.sgp) { ..... """ at location [1]. but it doens't fix the problem.
You're right though, this change does need to be made, this code doesn't handle pages of sg_dma_len(sg) == 0 correctly So my guess is that we have more bugs in other parts of the i915 driver (or there is a problem with my "sg_dma_len == 0" fix above). I have been trying to spot where else the code might be buggy but I haven't had any luck so far.
I'm doing a microconfernce (at LPC 2020) this wednesdays [1] on this if you're interested in attending. I'm hoping I can chat about it with a few people and find how can reproduce and fix this issues. I don't have any more time I can give to this unfortunately and it would be a shame for the work to go to waste.
[0] https://github.com/torvalds/linux/blob/d012a7190fc1fd72ed48911e77ca97ba4521b... [1] https://linuxplumbersconf.org/event/7/contributions/846/
On Fri, 29 May 2020 at 22:21, Logan Gunthorpe logang@deltatee.com wrote:
On 2020-05-29 3:11 p.m., Marek Szyprowski wrote:
Patches are pending: https://lore.kernel.org/linux-iommu/20200513132114.6046-1-m.szyprowski@samsu...
Cool, nice! Though, I still don't think that fixes the issue in i915_scatterlist.h given it still ignores sg_dma_len() and strictly relies on sg_next()/sg_is_last() to stop iterating -- and I suspect this is the bug that got in Tom's way.
However, as Robin pointed out, there are other ugly tricks like stopping iterating through the SGL when sg_dma_len() is zero. For example, the AMD driver appears to use drm_prime_sg_to_page_addr_arrays() which does this trick and thus likely isn't buggy (otherwise, I'd expect someone to have complained by now seeing AMD has already switched to IOMMU-DMA.
I'm not sure that this is a trick. Stopping at zero sg_dma_len() was somewhere documented.
Well whatever you want to call it, it is ugly to have some drivers doing one thing with the returned value and others assuming there's an extra zero at the end. It just causes confusion for people reading/copying the code. It would be better if they are all consistent. However, I concede stopping at zero should not be broken, presently.
Logan