On Sat, Jul 3, 2021 at 1:55 PM Nathan Chancellor nathan@kernel.org wrote:
Hi Will and Robin,
On Fri, Jul 02, 2021 at 04:13:50PM +0100, Robin Murphy wrote:
On 2021-07-02 14:58, Will Deacon wrote:
Hi Nathan,
On Thu, Jul 01, 2021 at 12:52:20AM -0700, Nathan Chancellor wrote:
On 7/1/2021 12:40 AM, Will Deacon wrote:
On Wed, Jun 30, 2021 at 08:56:51AM -0700, Nathan Chancellor wrote:
On Wed, Jun 30, 2021 at 12:43:48PM +0100, Will Deacon wrote: > On Wed, Jun 30, 2021 at 05:17:27PM +0800, Claire Chang wrote: > > `BUG: unable to handle page fault for address: 00000000003a8290` and > > the fact it crashed at `_raw_spin_lock_irqsave` look like the memory > > (maybe dev->dma_io_tlb_mem) was corrupted? > > The dev->dma_io_tlb_mem should be set here > > (https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/dri...) > > through device_initialize. > > I'm less sure about this. 'dma_io_tlb_mem' should be pointing at > 'io_tlb_default_mem', which is a page-aligned allocation from memblock. > The spinlock is at offset 0x24 in that structure, and looking at the > register dump from the crash: > > Jun 29 18:28:42 hp-4300G kernel: RSP: 0018:ffffadb4013db9e8 EFLAGS: 00010006 > Jun 29 18:28:42 hp-4300G kernel: RAX: 00000000003a8290 RBX: 0000000000000000 RCX: ffff8900572ad580 > Jun 29 18:28:42 hp-4300G kernel: RDX: ffff89005653f024 RSI: 00000000000c0000 RDI: 0000000000001d17 > Jun 29 18:28:42 hp-4300G kernel: RBP: 000000000a20d000 R08: 00000000000c0000 R09: 0000000000000000 > Jun 29 18:28:42 hp-4300G kernel: R10: 000000000a20d000 R11: ffff89005653f000 R12: 0000000000000212 > Jun 29 18:28:42 hp-4300G kernel: R13: 0000000000001000 R14: 0000000000000002 R15: 0000000000200000 > Jun 29 18:28:42 hp-4300G kernel: FS: 00007f1f8898ea40(0000) GS:ffff890057280000(0000) knlGS:0000000000000000 > Jun 29 18:28:42 hp-4300G kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > Jun 29 18:28:42 hp-4300G kernel: CR2: 00000000003a8290 CR3: 00000001020d0000 CR4: 0000000000350ee0 > Jun 29 18:28:42 hp-4300G kernel: Call Trace: > Jun 29 18:28:42 hp-4300G kernel: _raw_spin_lock_irqsave+0x39/0x50 > Jun 29 18:28:42 hp-4300G kernel: swiotlb_tbl_map_single+0x12b/0x4c0 > > Then that correlates with R11 holding the 'dma_io_tlb_mem' pointer and > RDX pointing at the spinlock. Yet RAX is holding junk :/ > > I agree that enabling KASAN would be a good idea, but I also think we > probably need to get some more information out of swiotlb_tbl_map_single() > to see see what exactly is going wrong in there.
I can certainly enable KASAN and if there is any debug print I can add or dump anything, let me know!
I bit the bullet and took v5.13 with swiotlb/for-linus-5.14 merged in, built x86 defconfig and ran it on my laptop. However, it seems to work fine!
Please can you share your .config?
Sure thing, it is attached. It is just Arch Linux's config run through olddefconfig. The original is below in case you need to diff it.
https://raw.githubusercontent.com/archlinux/svntogit-packages/9045405dc83552...
If there is anything more that I can provide, please let me know.
I eventually got this booting (for some reason it was causing LD to SEGV trying to link it for a while...) and sadly it works fine on my laptop. Hmm.
Seems like it might be something specific to the amdgpu module?
Did you manage to try again with KASAN?
Yes, it took a few times to reproduce the issue but I did manage to get a dmesg, please find it attached. I build from commit 7d31f1c65cc9 ("swiotlb: fix implicit debugfs declarations") in Konrad's tree.
Looking at the logs, the use-after-free bug looked somehow relevant (and it's nvme again. Qian's crash is about nvme too):
[ 2.468288] BUG: KASAN: use-after-free in __iommu_dma_unmap_swiotlb+0x64/0xb0 [ 2.468288] Read of size 8 at addr ffff8881d7830000 by task swapper/0/0
[ 2.468288] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.12.0-rc3-debug #1 [ 2.468288] Hardware name: HP HP Desktop M01-F1xxx/87D6, BIOS F.12 12/17/2020 [ 2.468288] Call Trace: [ 2.468288] <IRQ> [ 2.479433] dump_stack+0x9c/0xcf [ 2.479433] print_address_description.constprop.0+0x18/0x130 [ 2.479433] ? __iommu_dma_unmap_swiotlb+0x64/0xb0 [ 2.479433] kasan_report.cold+0x7f/0x111 [ 2.479433] ? __iommu_dma_unmap_swiotlb+0x64/0xb0 [ 2.479433] __iommu_dma_unmap_swiotlb+0x64/0xb0 [ 2.479433] nvme_pci_complete_rq+0x73/0x130 [ 2.479433] blk_complete_reqs+0x6f/0x80 [ 2.479433] __do_softirq+0xfc/0x3be [ 2.479433] irq_exit_rcu+0xce/0x120 [ 2.479433] common_interrupt+0x80/0xa0 [ 2.479433] </IRQ> [ 2.479433] asm_common_interrupt+0x1e/0x40 [ 2.479433] RIP: 0010:cpuidle_enter_state+0xf9/0x590
I wonder if this ended up unmapping something wrong and messing up the dev->dma_io_tlb_mem (i.e. io_tlb_default_mem)?
Could you try this patch on top of 7d31f1c65cc9? This patch helps check if we try to unmap the wrong address.
``` diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index b7f76bca89bf..5ac08d50a394 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -613,6 +613,21 @@ void swiotlb_tbl_unmap_single(struct device *dev, phys_addr_t tlb_addr, size_t mapping_size, enum dma_data_direction dir, unsigned long attrs) { + struct io_tlb_mem *mem = dev->dma_io_tlb_mem; + unsigned int offset = swiotlb_align_offset(dev, tlb_addr); + int index; + + if (!is_swiotlb_buffer(dev, tlb_addr - offset)) { + dev_err(dev, "%s: attempt to unmap invalid address (0x%llx, offset=%u)\n", __func__, tlb_addr, offset); + return; + } + + index = (tlb_addr - offset - mem->start) >> IO_TLB_SHIFT; + if (mem->slots[index].orig_addr == INVALID_PHYS_ADDR) { + dev_err(dev, "%s: memory is not mapped before (0x%llx, offset=%u)\n", __func__, tlb_addr, offset); + return; + } + /* * First, sync the memory before unmapping the entry */ ``` It might be useful to have CONFIG_SLUB_DEBUG=y, CONFIG_SLUB_DEBUG_ON=y and line numbers (scripts/decode_stacktrace.sh) too.
Thank you so much for helping!
It might also be worth taking the IOMMU out of the equation, since that interfaces differently with SWIOTLB and I couldn't figure out the code path from the log you provided. What happens if you boot with "amd_iommu=off swiotlb=force"?
Oh, now there's a thing... the chat from the IOMMU API in the boot log implies that the IOMMU *should* be in the picture - we see that default domains are IOMMU_DOMAIN_DMA default and the GPU 0000:0c:00.0 was added to a group. That means dev->dma_ops should be set and DMA API calls should be going through iommu-dma, yet the callstack in the crash says we've gone straight from dma_map_page_attrs() to swiotlb_map(), implying the inline dma_direct_map_page() path.
If dev->dma_ops didn't look right in the first place, it's perhaps less surprising that dev->dma_io_tlb_mem might be wild as well. It doesn't seem plausible that we should have a race between initialising the device and probing its driver, so maybe the whole dev pointer is getting trampled earlier in the callchain (or is fundamentally wrong to begin with, but from a quick skim of the amdgpu code it did look like adev->dev and adev->pdev are appropriately set early on by amdgpu_pci_probe()).
(although word of warning here: i915 dies horribly on my laptop if I pass swiotlb=force, even with the distro 5.10 kernel)
FWIW I'd imagine you probably need to massively increase the SWIOTLB buffer size to have hope of that working.
Is it worth trying this still then?
Cheers, Nathan