https://bugzilla.kernel.org/show_bug.cgi?id=206895
Bug ID: 206895 Summary: [amdgpu] crash while using opencl from amdgpu-pro on kernel 5.5.10 Product: Drivers Version: 2.5 Kernel Version: 5.5.10 Hardware: x86-64 OS: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: Video(DRI - non Intel) Assignee: drivers_video-dri@kernel-bugs.osdl.org Reporter: bigbeeshane@gmail.com Regression: No
Created attachment 287987 --> https://bugzilla.kernel.org/attachment.cgi?id=287987&action=edit crash log
I have found that using the amdgpu-pro OpenCL stack with kernel 5.5.10 causes a crash (see attached log) I have seen this while using folding@home.
I have tested reverting back to 5.4.26 with no other changes, this fixes the issue.
https://bugzilla.kernel.org/show_bug.cgi?id=206895
Alex Deucher (alexdeucher@gmail.com) changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |alexdeucher@gmail.com
--- Comment #1 from Alex Deucher (alexdeucher@gmail.com) --- Can you bisect?
https://bugzilla.kernel.org/show_bug.cgi?id=206895
--- Comment #2 from bigbeeshane@gmail.com --- Yes, should be able to over the weekend. Will report my findings.
https://bugzilla.kernel.org/show_bug.cgi?id=206895
stefanspr94@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |stefanspr94@gmail.com
--- Comment #3 from stefanspr94@gmail.com --- These two commits break AMDGPU-PRO OpenCL and ROCm. I guess userspace needs updating.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?i...
https://bugzilla.kernel.org/show_bug.cgi?id=206895
--- Comment #4 from bigbeeshane@gmail.com --- If that is the case isn't the issue the kernel rather than the user space applications ?
As in that case amdgpu is incompatible with any of the OpenCL 2.x implementation's
https://bugzilla.kernel.org/show_bug.cgi?id=206895
--- Comment #5 from stefanspr94@gmail.com --- That depends on whether the changes to the DMA mechanics were meant to be compatible with the old implementation, but I can't answer that as I am no AMD developer.
https://bugzilla.kernel.org/show_bug.cgi?id=206895
--- Comment #6 from bigbeeshane@gmail.com --- Looks like its more to do with switching from amd-iommmu to dma-iommu (see my bisect below)
git bisect start # good: [e87eb585d31fadb5e9e549a1de4b2da60a79bfc9] Merge branch 'pci/misc' git bisect good e87eb585d31fadb5e9e549a1de4b2da60a79bfc9 # bad: [c3bed3b20e40ab44b98ac5f0471a5bd92a802f5a] Merge tag 'pci-v5.5-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci git bisect bad c3bed3b20e40ab44b98ac5f0471a5bd92a802f5a # good: [3f1b210a7f97f7e75c56174ada476fba2d36f340] Merge tag 'sound-5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound git bisect good 3f1b210a7f97f7e75c56174ada476fba2d36f340 # skip: [a6ed68d6468bd5a3da78a103344ded1435fed57a] Merge tag 'drm-next-2019-11-27' of git://anongit.freedesktop.org/drm/drm git bisect skip a6ed68d6468bd5a3da78a103344ded1435fed57a # skip: [3f86a7e090d1dfb974a9dc9d44049f9bff01e6a5] gpiolib: acpi: Print pin number on acpi_gpiochip_alloc_event errors git bisect skip 3f86a7e090d1dfb974a9dc9d44049f9bff01e6a5 # good: [32d1fe8fcb32130733b59fc447e35753dc87fd40] mm/hotplug: reorder memblock_[free|remove]() calls in try_remove_memory() git bisect good 32d1fe8fcb32130733b59fc447e35753dc87fd40 # good: [a5255bc31673c72e264d837cd13cd3085d72cb58] Merge tag 'dmaengine-5.5-rc1' of git://git.infradead.org/users/vkoul/slave-dma git bisect good a5255bc31673c72e264d837cd13cd3085d72cb58 # bad: [9b326948c23908692d7dfe56ed149840d3829eaa] Merge tag 'firewire-update' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394 git bisect bad 9b326948c23908692d7dfe56ed149840d3829eaa # bad: [937d6eefc716a9071f0e3bada19200de1bb9d048] Merge tag 'docs-5.5a' of git://git.lwn.net/linux git bisect bad 937d6eefc716a9071f0e3bada19200de1bb9d048 # good: [a8de1304b7df30e3a14f2a8b9709bb4ff31a0385] libfdt: define INT32_MAX and UINT32_MAX in libfdt_env.h git bisect good a8de1304b7df30e3a14f2a8b9709bb4ff31a0385 # good: [bf23a48edbe331f834eb49d1bd6484ae98cf4dc7] Documentation/translation: Use Korean for Korean translation title git bisect good bf23a48edbe331f834eb49d1bd6484ae98cf4dc7 # good: [34d1b0895dbd10713c73615d8f532e78509e12d9] iommu/arm-smmu: Remove duplicate error message git bisect good 34d1b0895dbd10713c73615d8f532e78509e12d9 # bad: [9b3a713feef8db41d4bcccb3b97e86ee906690c8] Merge branches 'iommu/fixes', 'arm/qcom', 'arm/renesas', 'arm/rockchip', 'arm/mediatek', 'arm/tegra', 'arm/smmu', 'x86/amd', 'x86/vt-d', 'virtio' and 'core' into next git bisect bad 9b3a713feef8db41d4bcccb3b97e86ee906690c8 # bad: [3c124435e8dd516df4b2fc983f4415386fd6edae] iommu/amd: Support multiple PCI DMA aliases in IRQ Remapping git bisect bad 3c124435e8dd516df4b2fc983f4415386fd6edae # bad: [be62dbf554c5b50718a54a359372c148cd9975c7] iommu/amd: Convert AMD iommu driver to the dma-iommu api git bisect bad be62dbf554c5b50718a54a359372c148cd9975c7 # good: [781ca2de89bae1b1d2c96df9ef33e9a324415995] iommu: Add gfp parameter to iommu_ops::map git bisect good 781ca2de89bae1b1d2c96df9ef33e9a324415995 # good: [6e2350207f40e24884da262976f7fd4fba387e8a] iommu/dma-iommu: Use the dev->coherent_dma_mask git bisect good 6e2350207f40e24884da262976f7fd4fba387e8a # first bad commit: [be62dbf554c5b50718a54a359372c148cd9975c7] iommu/amd: Convert AMD iommu driver to the dma-iommu api
I am going to *try* and revert that change and see if if it fixes the issue. Will also check if the latest 5.6rc has shows the errors
https://bugzilla.kernel.org/show_bug.cgi?id=206895
--- Comment #7 from bigbeeshane@gmail.com --- Seems some other issues are showing against this commit
https://bugzilla.kernel.org/show_bug.cgi?id=206461
https://bugzilla.kernel.org/show_bug.cgi?id=206895
--- Comment #8 from bigbeeshane@gmail.com --- After some further validation
5.6-rc6 also has this bug
Reverting be62dbf554c5b50718a54a359372c148cd9975c7 fixes the issue but overall it seems that amdgpu is not using the new implementation of dma_map_sg correctly.
Looking at the documentation (here : https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Docume...) it seems like return value of dma_map_sg and the supplied value for nents can differ in length.
Currently the amdgpu driver code validates that the return value of dma_map_sg and nents are equal, otherwise bailing out of amdgpu_ttm_tt_pin_userptr see line :
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/driver...
This would explain the "*ERROR* failed to pin userptr" message followed by the trace.
https://bugzilla.kernel.org/show_bug.cgi?id=206895
--- Comment #9 from bigbeeshane@gmail.com --- Also validated last night that the following patch to disable the merging of sg sections within dma-iommu fixes the issue I am seeing
--- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -779,7 +779,7 @@ static int __finalise_sg(struct device *dev, struct scatterlist *sg, int nents, * - but doesn't fall at a segment boundary * - and wouldn't make the resulting output segment too long */ - if (cur_len && !s_iova_off && (dma_addr & seg_mask) && + if (0 && cur_len && !s_iova_off && (dma_addr & seg_mask) && (max_len - cur_len >= s_length)) {
I guess amdgpu needs to be updated to handle the case where the iommu driver is merging some of the requested segments ?
https://bugzilla.kernel.org/show_bug.cgi?id=206895
bigbeeshane@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Kernel Version|5.5.10 |5.5.10 & 5.6.0-rc6 Summary|[amdgpu] crash while using |[amdgpu] crash while using |opencl from amdgpu-pro on |opencl from amdgpu-pro on |kernel 5.5.10 |kernel 5.5.10 & 5.6.0-rc6
https://bugzilla.kernel.org/show_bug.cgi?id=206895
--- Comment #10 from bigbeeshane@gmail.com --- Created attachment 288017 --> https://bugzilla.kernel.org/attachment.cgi?id=288017&action=edit amdgpu_possible_patch
drm_prime_sg_to_page_addr_arrays does not support cases when the number of segments returned from dma_map_sg differs from that reported (this can be the case)
Add and make use of a version that can use the count data returned from dma_map_sg and the correct sg_dma_len macro
https://bugzilla.kernel.org/show_bug.cgi?id=206895
--- Comment #11 from Alex Deucher (alexdeucher@gmail.com) --- Thanks for the patch. Please fix drm_prime_sg_to_page_addr_arrays() directly and send the patch to dri-devel@lists.freedesktop.org . Also please add your Signed-off_by.
https://bugzilla.kernel.org/show_bug.cgi?id=206895
--- Comment #12 from Alex Deucher (alexdeucher@gmail.com) --- It's likely other drivers that rely on these helpers would be similarly broken.
https://bugzilla.kernel.org/show_bug.cgi?id=206895
--- Comment #13 from bigbeeshane@gmail.com --- Indeed, however they may not have pushed the SG lists via dma map in the same way as amdgpu.
In that case getting lengths from dma_map_sg would probably cause other issues
https://bugzilla.kernel.org/show_bug.cgi?id=206895
--- Comment #14 from Alex Deucher (alexdeucher@gmail.com) --- True. For now just send out the patch and we can discuss further on the list. Thanks!
https://bugzilla.kernel.org/show_bug.cgi?id=206895
--- Comment #15 from Alex Deucher (alexdeucher@gmail.com) --- General comment about the patch, you can make amdgpu_ttm_dma_sg_to_arrays static since it's only used within amdgpu_ttm.c,
https://bugzilla.kernel.org/show_bug.cgi?id=206895
--- Comment #16 from bigbeeshane@gmail.com --- I'll update drm_prime_sg_to_page_addr_arrays to support both the current logic and dma mapped logic and get a patch up this evening.
That way at least nothing else get broke
dri-devel@lists.freedesktop.org