Am 11.06.21 um 20:23 schrieb Ondrej Zary:
On Friday 11 June 2021 14:38:18 Christian König wrote:
Am 10.06.21 um 19:59 schrieb Christian König:
Am 10.06.21 um 19:50 schrieb Ondrej Zary:
[SNIP]
I can't see how this is called from the nouveau code, only possibility I see is that it is maybe called through the AGP code somehow.
Yes, you're right: [ 13.192663] Call Trace: [ 13.192678] dump_stack+0x54/0x68 [ 13.192690] ttm_tt_init+0x11/0x8a [ttm] [ 13.192699] ttm_agp_tt_create+0x39/0x51 [ttm] [ 13.192840] nouveau_ttm_tt_create+0x17/0x22 [nouveau] [ 13.192856] ttm_tt_create+0x78/0x8c [ttm] [ 13.192864] ttm_bo_handle_move_mem+0x7d/0xca [ttm] [ 13.192873] ttm_bo_validate+0x92/0xc8 [ttm] [ 13.192883] ttm_bo_init_reserved+0x216/0x243 [ttm] [ 13.192892] ttm_bo_init+0x45/0x65 [ttm] [ 13.193018] ? nouveau_bo_del_io_reserve_lru+0x48/0x48 [nouveau] [ 13.193150] nouveau_bo_init+0x8c/0x94 [nouveau] [ 13.193273] ? nouveau_bo_del_io_reserve_lru+0x48/0x48 [nouveau] [ 13.193407] nouveau_bo_new+0x44/0x57 [nouveau] [ 13.193537] nouveau_channel_prep+0xa3/0x269 [nouveau] [ 13.193665] nouveau_channel_new+0x3c/0x5f7 [nouveau] [ 13.193679] ? slab_free_freelist_hook+0x3b/0xa7 [ 13.193686] ? kfree+0x9e/0x11a [ 13.193781] ? nvif_object_sclass_put+0xd/0x16 [nouveau] [ 13.193908] nouveau_drm_device_init+0x2e2/0x646 [nouveau] [ 13.193924] ? pci_enable_device_flags+0x1e/0xac [ 13.194052] nouveau_drm_probe+0xeb/0x188 [nouveau] [ 13.194182] ? nouveau_drm_device_init+0x646/0x646 [nouveau] [ 13.194195] pci_device_probe+0x89/0xe9 [ 13.194205] really_probe+0x127/0x2a7 [ 13.194212] driver_probe_device+0x5b/0x87 [ 13.194219] device_driver_attach+0x2e/0x41 [ 13.194226] __driver_attach+0x7c/0x83 [ 13.194232] bus_for_each_dev+0x4c/0x66 [ 13.194238] driver_attach+0x14/0x16 [ 13.194244] ? device_driver_attach+0x41/0x41 [ 13.194251] bus_add_driver+0xc5/0x16c [ 13.194258] driver_register+0x87/0xb9 [ 13.194265] __pci_register_driver+0x38/0x3b [ 13.194271] ? 0xf0c0d000 [ 13.194362] nouveau_drm_init+0x14c/0x1000 [nouveau]
How is ttm_dma_tt->dma_address allocated?
Mhm, I need to double check how AGP is supposed to work.
Since barely anybody is using it these days it is something which breaks from time to time.
I have no idea how that ever worked in the first place since AGP isn't supposed to sync between CPU/GPU. Everything is coherent for that case.
Anyway here is a patch which adds a check to those functions if the dma_address array is allocated in the first place. Please test it.
Thanks, the patch fixes the problem and nouveau now works! Should be applied to 5.12-stable too (5.11 is affected too but EOL).
I will just add a CC stable tag before pushing.
It's weird that it worked before. Looks like dma_address was used uninitialized - it contained some random crap: [ 12.293304] nouveau_bo_sync_for_device: ttm_dma->dma_address=3e055971 ttm_dma->ttm.num_pages=18 [ 12.293321] ttm_dma->dma_address[0]=0x0 [ 12.293341] ttm_dma->dma_address[1]=0x0 [ 12.293360] ttm_dma->dma_address[2]=0xee728980 [ 12.293379] ttm_dma->dma_address[3]=0xed1cb120 [ 12.293397] ttm_dma->dma_address[4]=0x12 [ 12.293416] ttm_dma->dma_address[5]=0x0 [ 12.293434] ttm_dma->dma_address[6]=0x1 [ 12.293453] ttm_dma->dma_address[7]=0x0 [ 12.293471] ttm_dma->dma_address[8]=0x10000 [ 12.293490] ttm_dma->dma_address[9]=0x0 [ 12.293510] ttm_dma->dma_address[10]=0x101 [ 12.293528] ttm_dma->dma_address[11]=0xee7289ec [ 12.293546] ttm_dma->dma_address[12]=0xee7289ec [ 12.293564] ttm_dma->dma_address[13]=0x0 [ 12.293581] ttm_dma->dma_address[14]=0x0 [ 12.293599] ttm_dma->dma_address[15]=0x0 [ 12.293616] ttm_dma->dma_address[16]=0x0 [ 12.293634] ttm_dma->dma_address[17]=0x0 But it did not matter as dma_sync_single_for_device is a no-op here. When dma_address is properly initialized to NULL, it crashes...
Ok that explains things, but essentially means that this only worked by coincident.
Just send out the patch to Ben, the list and you once more. Please reply with a rb, ak-by and/or tested-by so that I can push it ASAP.
Thanks, Christian.
Thanks, Christian.
Thanks for the backtrace, Christian.
I cannot find any assignment executed (in the working code):
$ git grep dma_address\ = drivers/gpu/ drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c: sg->sgl->dma_address = addr; drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c: dma_address = &dma->dma_address[offset >> PAGE_SHIFT]; drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c: dma_address = (mm_node->start << PAGE_SHIFT) + offset; drivers/gpu/drm/i915/gvt/scheduler.c: sg->dma_address = addr; drivers/gpu/drm/i915/i915_gpu_error.c: sg->dma_address = it; drivers/gpu/drm/ttm/ttm_tt.c: ttm->dma_address = (void *) (ttm->ttm.pages + ttm->ttm.num_pages); drivers/gpu/drm/ttm/ttm_tt.c: ttm->dma_address = kvmalloc_array(ttm->ttm.num_pages, drivers/gpu/drm/ttm/ttm_tt.c: ttm_dma->dma_address = NULL; drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c: viter->dma_address = &__vmw_piter_phys_addr; drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c: viter->dma_address = &__vmw_piter_dma_addr; drivers/gpu/drm/vmwgfx/vmwgfx_ttm_buffer.c: viter->dma_address = &__vmw_piter_sg_addr;
The 2 cases in ttm_tt.c are in ttm_dma_tt_alloc_page_directory() and ttm_sg_tt_alloc_page_directory(). Confirmed by adding printk()s that they're NOT called.