Hi,
I tried latest drm-fixes today and saw a lot of these: Fallout from ttm rework?
/Thomas
[ 298.404788] WARNING: CPU: 1 PID: 3839 at drivers/gpu/drm/ttm/ttm_bo.c:512 ttm_bo_release+0x2b5/0x300 [ttm] [ 298.404795] Modules linked in: nls_utf8 isofs rfcomm tun bridge stp llc ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 libcrc32c nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security ip_set nfnetlink ip6table_filter ip6_tables iptable_filter cmac bnep vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vsock snd_seq_midi snd_seq_midi_event intel_rapl_msr snd_ens1371 snd_ac97_codec ac97_bus vmw_balloon intel_rapl_common snd_seq rapl snd_pcm btusb btrtl btbcm btintel bluetooth joydev gameport snd_timer snd_rawmidi snd_seq_device rfkill snd ecdh_generic vmw_vmci ecc soundcore i2c_piix4 auth_rpcgss sunrpc ip_tables vmwgfx drm_kms_helper cec ttm e1000 crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel drm mptspi serio_raw scsi_transport_spi mptscsih mptbase ata_generic pata_acpi uas usb_storage fuse [ 298.404837] CPU: 1 PID: 3839 Comm: thunderbird Tainted: G W 5.12.0-rc2+ #42 [ 298.404839] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/29/2019 [ 298.404840] RIP: 0010:ttm_bo_release+0x2b5/0x300 [ttm] [ 298.404845] Code: e8 a0 f3 35 ce e9 da fd ff ff 49 8b 7e 90 b9 30 75 00 00 31 d2 be 01 00 00 00 e8 c6 17 36 ce 49 8b 46 e0 eb 9e 48 89 e8 eb 99 <0f> 0b 41 c7 86 94 00 00 00 00 00 00 00 49 8d 76 08 31 d2 4c 89 ef [ 298.404847] RSP: 0018:ffffb24a43ef3bd0 EFLAGS: 00010202 [ 298.404848] RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000001 [ 298.404850] RDX: 0000000000000001 RSI: 0000000000000246 RDI: ffffffffc03c50e8 [ 298.404851] RBP: ffff9ad4429f2620 R08: 0000000000000001 R09: ffff9ad4429f2000 [ 298.404852] R10: ffff9ad48664e090 R11: 0000000000000000 R12: ffff9ad4e71371d0 [ 298.404852] R13: ffff9ad4e7137000 R14: ffff9ad4e7137168 R15: ffff9ad48710f4c0 [ 298.404854] FS: 00007fc6d9ae4780(0000) GS:ffff9ad576e40000(0000) knlGS:0000000000000000 [ 298.404855] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 298.404857] CR2: 00007fc6c6740000 CR3: 00000001a4eac004 CR4: 00000000003706e0 [ 298.404864] Call Trace: [ 298.404866] vmw_resource_release+0x131/0x1f0 [vmwgfx] [ 298.404878] vmw_context_cotables_unref.isra.0+0x6f/0xa0 [vmwgfx] [ 298.404891] vmw_resource_release+0x16a/0x1f0 [vmwgfx] [ 298.404901] vmw_user_context_base_release+0x31/0x50 [vmwgfx] [ 298.404912] ttm_release_base+0x7f/0xc0 [vmwgfx] [ 298.404922] ttm_ref_object_release+0xde/0xf0 [vmwgfx] [ 298.404931] ttm_ref_object_base_unref+0x8e/0xb0 [vmwgfx] [ 298.404940] ? vmw_dx_context_unbind+0x1e0/0x1e0 [vmwgfx] [ 298.404951] drm_ioctl_kernel+0xaa/0xf0 [drm] [ 298.404974] drm_ioctl+0x20f/0x3a0 [drm] [ 298.404991] ? vmw_dx_context_unbind+0x1e0/0x1e0 [vmwgfx] [ 298.405003] ? selinux_file_ioctl+0x135/0x230 [ 298.405006] ? drm_version+0x90/0x90 [drm] [ 298.405023] vmw_generic_ioctl+0xab/0x150 [vmwgfx] [ 298.405033] __x64_sys_ioctl+0x83/0xb0 [ 298.405035] do_syscall_64+0x33/0x40 [ 298.405038] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 298.405041] RIP: 0033:0x7fc6d9be15db [ 298.405042] Code: 89 d8 49 8d 3c 1c 48 f7 d8 49 39 c4 72 b5 e8 1c ff ff ff 85 c0 78 ba 4c 89 e0 5b 5d 41 5c c3 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 6d b8 0c 00 f7 d8 64 89 01 48 [ 298.405044] RSP: 002b:00007ffc8ce6de98 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 298.405046] RAX: ffffffffffffffda RBX: 00007ffc8ce6dee0 RCX: 00007fc6d9be15db [ 298.405048] RDX: 00007ffc8ce6dee0 RSI: 0000000040086448 RDI: 0000000000000007 [ 298.405049] RBP: 0000000040086448 R08: 0000000000000000 R09: 0000000000000000 [ 298.405050] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fc6d98e0000 [ 298.405051] R13: 0000000000000007 R14: 00007fc6d98ea3b0 R15: 00007fc6c671f800 [ 298.405053] ---[ end trace c628fb3ea8b5aa96 ]---
We are investigating a similar problem with radeon.
So far no idea what's going wrong since it doesn't seem to happen with amdgpu.
If you have an idea please speak up :)
Thanks, Christian. ________________________________ Von: Thomas Hellström (Intel) thomas_os@shipmail.org Gesendet: Donnerstag, 11. März 2021 11:46 An: Daniel Vetter daniel.vetter@ffwll.ch; Koenig, Christian Christian.Koenig@amd.com; linux-graphics-maintainer@vmware.com linux-graphics-maintainer@vmware.com Cc: DRI Development dri-devel@lists.freedesktop.org Betreff: vmwgfx leaking bo pins?
Hi,
I tried latest drm-fixes today and saw a lot of these: Fallout from ttm rework?
/Thomas
[ 298.404788] WARNING: CPU: 1 PID: 3839 at drivers/gpu/drm/ttm/ttm_bo.c:512 ttm_bo_release+0x2b5/0x300 [ttm] [ 298.404795] Modules linked in: nls_utf8 isofs rfcomm tun bridge stp llc ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 libcrc32c nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security ip_set nfnetlink ip6table_filter ip6_tables iptable_filter cmac bnep vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vsock snd_seq_midi snd_seq_midi_event intel_rapl_msr snd_ens1371 snd_ac97_codec ac97_bus vmw_balloon intel_rapl_common snd_seq rapl snd_pcm btusb btrtl btbcm btintel bluetooth joydev gameport snd_timer snd_rawmidi snd_seq_device rfkill snd ecdh_generic vmw_vmci ecc soundcore i2c_piix4 auth_rpcgss sunrpc ip_tables vmwgfx drm_kms_helper cec ttm e1000 crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel drm mptspi serio_raw scsi_transport_spi mptscsih mptbase ata_generic pata_acpi uas usb_storage fuse [ 298.404837] CPU: 1 PID: 3839 Comm: thunderbird Tainted: G W 5.12.0-rc2+ #42 [ 298.404839] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/29/2019 [ 298.404840] RIP: 0010:ttm_bo_release+0x2b5/0x300 [ttm] [ 298.404845] Code: e8 a0 f3 35 ce e9 da fd ff ff 49 8b 7e 90 b9 30 75 00 00 31 d2 be 01 00 00 00 e8 c6 17 36 ce 49 8b 46 e0 eb 9e 48 89 e8 eb 99 <0f> 0b 41 c7 86 94 00 00 00 00 00 00 00 49 8d 76 08 31 d2 4c 89 ef [ 298.404847] RSP: 0018:ffffb24a43ef3bd0 EFLAGS: 00010202 [ 298.404848] RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000001 [ 298.404850] RDX: 0000000000000001 RSI: 0000000000000246 RDI: ffffffffc03c50e8 [ 298.404851] RBP: ffff9ad4429f2620 R08: 0000000000000001 R09: ffff9ad4429f2000 [ 298.404852] R10: ffff9ad48664e090 R11: 0000000000000000 R12: ffff9ad4e71371d0 [ 298.404852] R13: ffff9ad4e7137000 R14: ffff9ad4e7137168 R15: ffff9ad48710f4c0 [ 298.404854] FS: 00007fc6d9ae4780(0000) GS:ffff9ad576e40000(0000) knlGS:0000000000000000 [ 298.404855] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 298.404857] CR2: 00007fc6c6740000 CR3: 00000001a4eac004 CR4: 00000000003706e0 [ 298.404864] Call Trace: [ 298.404866] vmw_resource_release+0x131/0x1f0 [vmwgfx] [ 298.404878] vmw_context_cotables_unref.isra.0+0x6f/0xa0 [vmwgfx] [ 298.404891] vmw_resource_release+0x16a/0x1f0 [vmwgfx] [ 298.404901] vmw_user_context_base_release+0x31/0x50 [vmwgfx] [ 298.404912] ttm_release_base+0x7f/0xc0 [vmwgfx] [ 298.404922] ttm_ref_object_release+0xde/0xf0 [vmwgfx] [ 298.404931] ttm_ref_object_base_unref+0x8e/0xb0 [vmwgfx] [ 298.404940] ? vmw_dx_context_unbind+0x1e0/0x1e0 [vmwgfx] [ 298.404951] drm_ioctl_kernel+0xaa/0xf0 [drm] [ 298.404974] drm_ioctl+0x20f/0x3a0 [drm] [ 298.404991] ? vmw_dx_context_unbind+0x1e0/0x1e0 [vmwgfx] [ 298.405003] ? selinux_file_ioctl+0x135/0x230 [ 298.405006] ? drm_version+0x90/0x90 [drm] [ 298.405023] vmw_generic_ioctl+0xab/0x150 [vmwgfx] [ 298.405033] __x64_sys_ioctl+0x83/0xb0 [ 298.405035] do_syscall_64+0x33/0x40 [ 298.405038] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 298.405041] RIP: 0033:0x7fc6d9be15db [ 298.405042] Code: 89 d8 49 8d 3c 1c 48 f7 d8 49 39 c4 72 b5 e8 1c ff ff ff 85 c0 78 ba 4c 89 e0 5b 5d 41 5c c3 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 6d b8 0c 00 f7 d8 64 89 01 48 [ 298.405044] RSP: 002b:00007ffc8ce6de98 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 298.405046] RAX: ffffffffffffffda RBX: 00007ffc8ce6dee0 RCX: 00007fc6d9be15db [ 298.405048] RDX: 00007ffc8ce6dee0 RSI: 0000000040086448 RDI: 0000000000000007 [ 298.405049] RBP: 0000000040086448 R08: 0000000000000000 R09: 0000000000000000 [ 298.405050] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fc6d98e0000 [ 298.405051] R13: 0000000000000007 R14: 00007fc6d98ea3b0 R15: 00007fc6c671f800 [ 298.405053] ---[ end trace c628fb3ea8b5aa96 ]---
On 3/11/21 12:32 PM, Koenig, Christian wrote:
We are investigating a similar problem with radeon.
So far no idea what's going wrong since it doesn't seem to happen with amdgpu.
If you have an idea please speak up :)
Sure. No idea ATM. Was just fiddling a bit with vmwgfx to experiment with the fix for the huge page-table-entry issue.
/Thomas
Thanks, Christian.
*Von:* Thomas Hellström (Intel) thomas_os@shipmail.org *Gesendet:* Donnerstag, 11. März 2021 11:46 *An:* Daniel Vetter daniel.vetter@ffwll.ch; Koenig, Christian Christian.Koenig@amd.com; linux-graphics-maintainer@vmware.com linux-graphics-maintainer@vmware.com *Cc:* DRI Development dri-devel@lists.freedesktop.org *Betreff:* vmwgfx leaking bo pins? Hi,
I tried latest drm-fixes today and saw a lot of these: Fallout from ttm rework?
/Thomas
[ 298.404788] WARNING: CPU: 1 PID: 3839 at drivers/gpu/drm/ttm/ttm_bo.c:512 ttm_bo_release+0x2b5/0x300 [ttm] [ 298.404795] Modules linked in: nls_utf8 isofs rfcomm tun bridge stp llc ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 libcrc32c nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security ip_set nfnetlink ip6table_filter ip6_tables iptable_filter cmac bnep vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vsock snd_seq_midi snd_seq_midi_event intel_rapl_msr snd_ens1371 snd_ac97_codec ac97_bus vmw_balloon intel_rapl_common snd_seq rapl snd_pcm btusb btrtl btbcm btintel bluetooth joydev gameport snd_timer snd_rawmidi snd_seq_device rfkill snd ecdh_generic vmw_vmci ecc soundcore i2c_piix4 auth_rpcgss sunrpc ip_tables vmwgfx drm_kms_helper cec ttm e1000 crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel drm mptspi serio_raw scsi_transport_spi mptscsih mptbase ata_generic pata_acpi uas usb_storage fuse [ 298.404837] CPU: 1 PID: 3839 Comm: thunderbird Tainted: G W 5.12.0-rc2+ #42 [ 298.404839] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/29/2019 [ 298.404840] RIP: 0010:ttm_bo_release+0x2b5/0x300 [ttm] [ 298.404845] Code: e8 a0 f3 35 ce e9 da fd ff ff 49 8b 7e 90 b9 30 75 00 00 31 d2 be 01 00 00 00 e8 c6 17 36 ce 49 8b 46 e0 eb 9e 48 89 e8 eb 99 <0f> 0b 41 c7 86 94 00 00 00 00 00 00 00 49 8d 76 08 31 d2 4c 89 ef [ 298.404847] RSP: 0018:ffffb24a43ef3bd0 EFLAGS: 00010202 [ 298.404848] RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000001 [ 298.404850] RDX: 0000000000000001 RSI: 0000000000000246 RDI: ffffffffc03c50e8 [ 298.404851] RBP: ffff9ad4429f2620 R08: 0000000000000001 R09: ffff9ad4429f2000 [ 298.404852] R10: ffff9ad48664e090 R11: 0000000000000000 R12: ffff9ad4e71371d0 [ 298.404852] R13: ffff9ad4e7137000 R14: ffff9ad4e7137168 R15: ffff9ad48710f4c0 [ 298.404854] FS: 00007fc6d9ae4780(0000) GS:ffff9ad576e40000(0000) knlGS:0000000000000000 [ 298.404855] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 298.404857] CR2: 00007fc6c6740000 CR3: 00000001a4eac004 CR4: 00000000003706e0 [ 298.404864] Call Trace: [ 298.404866] vmw_resource_release+0x131/0x1f0 [vmwgfx] [ 298.404878] vmw_context_cotables_unref.isra.0+0x6f/0xa0 [vmwgfx] [ 298.404891] vmw_resource_release+0x16a/0x1f0 [vmwgfx] [ 298.404901] vmw_user_context_base_release+0x31/0x50 [vmwgfx] [ 298.404912] ttm_release_base+0x7f/0xc0 [vmwgfx] [ 298.404922] ttm_ref_object_release+0xde/0xf0 [vmwgfx] [ 298.404931] ttm_ref_object_base_unref+0x8e/0xb0 [vmwgfx] [ 298.404940] ? vmw_dx_context_unbind+0x1e0/0x1e0 [vmwgfx] [ 298.404951] drm_ioctl_kernel+0xaa/0xf0 [drm] [ 298.404974] drm_ioctl+0x20f/0x3a0 [drm] [ 298.404991] ? vmw_dx_context_unbind+0x1e0/0x1e0 [vmwgfx] [ 298.405003] ? selinux_file_ioctl+0x135/0x230 [ 298.405006] ? drm_version+0x90/0x90 [drm] [ 298.405023] vmw_generic_ioctl+0xab/0x150 [vmwgfx] [ 298.405033] __x64_sys_ioctl+0x83/0xb0 [ 298.405035] do_syscall_64+0x33/0x40 [ 298.405038] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 298.405041] RIP: 0033:0x7fc6d9be15db [ 298.405042] Code: 89 d8 49 8d 3c 1c 48 f7 d8 49 39 c4 72 b5 e8 1c ff ff ff 85 c0 78 ba 4c 89 e0 5b 5d 41 5c c3 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 6d b8 0c 00 f7 d8 64 89 01 48 [ 298.405044] RSP: 002b:00007ffc8ce6de98 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 298.405046] RAX: ffffffffffffffda RBX: 00007ffc8ce6dee0 RCX: 00007fc6d9be15db [ 298.405048] RDX: 00007ffc8ce6dee0 RSI: 0000000040086448 RDI: 0000000000000007 [ 298.405049] RBP: 0000000040086448 R08: 0000000000000000 R09: 0000000000000000 [ 298.405050] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fc6d98e0000 [ 298.405051] R13: 0000000000000007 R14: 00007fc6d98ea3b0 R15: 00007fc6c671f800 [ 298.405053] ---[ end trace c628fb3ea8b5aa96 ]---
On Mar 11, 2021, at 05:46, Thomas Hellström (Intel) thomas_os@shipmail.org wrote:
Hi,
I tried latest drm-fixes today and saw a lot of these: Fallout from ttm rework?
Yes, I fixed this in d1a73c641afd2617bd80bce8b71a096fc5b74b7e it was in drm-misc-next in the drm-misc tree for a while but hasn’t been merged for 5.12.
z
Hi, Zack
On 3/11/21 10:07 PM, Zack Rusin wrote:
On Mar 11, 2021, at 05:46, Thomas Hellström (Intel) thomas_os@shipmail.org wrote:
Hi,
I tried latest drm-fixes today and saw a lot of these: Fallout from ttm rework?
Yes, I fixed this in d1a73c641afd2617bd80bce8b71a096fc5b74b7e it was in drm-misc-next in the drm-misc tree for a while but hasn’t been merged for 5.12.
z
Hmm, yes but doesn't that fix trip the ttm_bo_unpin() dma_resv_assert_held(bo->base.resv)?
Taking the reservation to unpin at TTM bo free has always been awkward and that's why vmwgfx and I guess other TTM drivers have been sloppy doing that as TTM never cared. Perhaps TTM could change the pin_count to an atomic and allow unlocked unpinning? still requiring the reservation lock for pin_count transition 0->1, though.
Also, pinning at bo creation in vmwgfx has been to do the equivalent of ttm_bo_init_reserved() (which api was added later). Creating pinned would make the object isolated and allowing the reserve trylock that followed to always succeed. With the introduction of the TTM pin_count, it seems ttm_bo_init_reserved() is used to enable pinned creation which is used to emulate ttm_bo_init_reserved() :)
/Thomas
On Mar 11, 2021, at 17:35, Thomas Hellström (Intel) thomas_os@shipmail.org wrote:
Hi, Zack
On 3/11/21 10:07 PM, Zack Rusin wrote:
On Mar 11, 2021, at 05:46, Thomas Hellström (Intel) thomas_os@shipmail.org wrote:
Hi,
I tried latest drm-fixes today and saw a lot of these: Fallout from ttm rework?
Yes, I fixed this in d1a73c641afd2617bd80bce8b71a096fc5b74b7e it was in drm-misc-next in the drm-misc tree for a while but hasn’t been merged for 5.12.
z
Hmm, yes but doesn't that fix trip the ttm_bo_unpin() dma_resv_assert_held(bo->base.resv)?
No, doesn’t seem to. TBH I’m not sure why myself, but it seems to be working fine.
Taking the reservation to unpin at TTM bo free has always been awkward and that's why vmwgfx and I guess other TTM drivers have been sloppy doing that as TTM never cared. Perhaps TTM could change the pin_count to an atomic and allow unlocked unpinning? still requiring the reservation lock for pin_count transition 0->1, though.
Yea, that’d probably make sense. I think in general just making sure the requirements are consistent and well documented would be great.
Also, pinning at bo creation in vmwgfx has been to do the equivalent of ttm_bo_init_reserved() (which api was added later). Creating pinned would make the object isolated and allowing the reserve trylock that followed to always succeed. With the introduction of the TTM pin_count, it seems ttm_bo_init_reserved() is used to enable pinned creation which is used to emulate ttm_bo_init_reserved() :)
Yea, we should probably port the vmwgfx code to ttm_bo_init_reserved just to be match the newly established semantics. z
Am 12.03.21 um 00:02 schrieb Zack Rusin:
On Mar 11, 2021, at 17:35, Thomas Hellström (Intel) thomas_os@shipmail.org wrote:
Hi, Zack
On 3/11/21 10:07 PM, Zack Rusin wrote:
On Mar 11, 2021, at 05:46, Thomas Hellström (Intel) thomas_os@shipmail.org wrote:
Hi,
I tried latest drm-fixes today and saw a lot of these: Fallout from ttm rework?
Yes, I fixed this in d1a73c641afd2617bd80bce8b71a096fc5b74b7e it was in drm-misc-next in the drm-misc tree for a while but hasn’t been merged for 5.12.
Mhm, crap. I hoped that you would have the same issue as radeon somehow and could help with debugging.
Please make sure the patch is pushed to drm-misc-fixes so that it ends up fixed in 5.12.
z
Hmm, yes but doesn't that fix trip the ttm_bo_unpin() dma_resv_assert_held(bo->base.resv)?
No, doesn’t seem to. TBH I’m not sure why myself, but it seems to be working fine.
Taking the reservation to unpin at TTM bo free has always been awkward and that's why vmwgfx and I guess other TTM drivers have been sloppy doing that as TTM never cared. Perhaps TTM could change the pin_count to an atomic and allow unlocked unpinning? still requiring the reservation lock for pin_count transition 0->1, though.
Yea, that’d probably make sense. I think in general just making sure the requirements are consistent and well documented would be great.
Also, pinning at bo creation in vmwgfx has been to do the equivalent of ttm_bo_init_reserved() (which api was added later). Creating pinned would make the object isolated and allowing the reserve trylock that followed to always succeed. With the introduction of the TTM pin_count, it seems ttm_bo_init_reserved() is used to enable pinned creation which is used to emulate ttm_bo_init_reserved() :)
Yea, we should probably port the vmwgfx code to ttm_bo_init_reserved just to be match the newly established semantics.
Yeah, I stumbled over that at well during one of the cleanups and considered changing the implementation. But then it got lost in all the rework.
Christian.
z
On 3/12/21 12:02 AM, Zack Rusin wrote:
On Mar 11, 2021, at 17:35, Thomas Hellström (Intel) thomas_os@shipmail.org wrote:
Hi, Zack
On 3/11/21 10:07 PM, Zack Rusin wrote:
On Mar 11, 2021, at 05:46, Thomas Hellström (Intel) thomas_os@shipmail.org wrote:
Hi,
I tried latest drm-fixes today and saw a lot of these: Fallout from ttm rework?
Yes, I fixed this in d1a73c641afd2617bd80bce8b71a096fc5b74b7e it was in drm-misc-next in the drm-misc tree for a while but hasn’t been merged for 5.12.
z
Hmm, yes but doesn't that fix trip the ttm_bo_unpin() dma_resv_assert_held(bo->base.resv)?
No, doesn’t seem to. TBH I’m not sure why myself, but it seems to be working fine.
With CONFIG_PROVE_LOCKING=y I see this:
[ 7.117145] [drm] FIFO at 0x00000000fe000000 size is 8192 kiB [ 7.117284] [drm] VRAM at 0x00000000e8000000 size is 131072 kiB [ 7.117291] INFO: trying to register non-static key. [ 7.117295] the code is fine but needs lockdep annotation. [ 7.117298] turning off the locking correctness validator
Which will probably mask that dma_resv_assert_held(bo->base.resv)
/Thomas
On 3/12/21 5:06 AM, Thomas Hellström (Intel) wrote:
On 3/12/21 12:02 AM, Zack Rusin wrote:
On Mar 11, 2021, at 17:35, Thomas Hellström (Intel) thomas_os@shipmail.org wrote:
Hi, Zack
On 3/11/21 10:07 PM, Zack Rusin wrote:
On Mar 11, 2021, at 05:46, Thomas Hellström (Intel) thomas_os@shipmail.org wrote:
Hi,
I tried latest drm-fixes today and saw a lot of these: Fallout from ttm rework?
Yes, I fixed this in d1a73c641afd2617bd80bce8b71a096fc5b74b7e it was in drm-misc-next in the drm-misc tree for a while but hasn’t been merged for 5.12.
z
Hmm, yes but doesn't that fix trip the ttm_bo_unpin() dma_resv_assert_held(bo->base.resv)?
No, doesn’t seem to. TBH I’m not sure why myself, but it seems to be working fine.
With CONFIG_PROVE_LOCKING=y I see this:
[ 7.117145] [drm] FIFO at 0x00000000fe000000 size is 8192 kiB [ 7.117284] [drm] VRAM at 0x00000000e8000000 size is 131072 kiB [ 7.117291] INFO: trying to register non-static key. [ 7.117295] the code is fine but needs lockdep annotation. [ 7.117298] turning off the locking correctness validator
Which will probably mask that dma_resv_assert_held(bo->base.resv)
Ah, yes, you're right. After fixing that I do see the dma_resv_assert_held triggered. Technically trivially fixable with ttm_bo_reserve/ttm_bo_unreserve around ttm_bo_unpin but it's a little ugly that some places e.g. vmw_bo_unreference will require ttm_bo_reserve/ttm_bo_unreserve around ttm_bo_unpin but some won't e.g. vmw_mob_destroy won't because its lock is held by ttm_bo_delayed_delete without a very clear indication within the function which is which.
z
On Mon, Mar 15, 2021 at 6:57 PM Zack Rusin zackr@vmware.com wrote:
On 3/12/21 5:06 AM, Thomas Hellström (Intel) wrote:
On 3/12/21 12:02 AM, Zack Rusin wrote:
On Mar 11, 2021, at 17:35, Thomas Hellström (Intel) thomas_os@shipmail.org wrote:
Hi, Zack
On 3/11/21 10:07 PM, Zack Rusin wrote:
On Mar 11, 2021, at 05:46, Thomas Hellström (Intel) thomas_os@shipmail.org wrote:
Hi,
I tried latest drm-fixes today and saw a lot of these: Fallout from ttm rework?
Yes, I fixed this in d1a73c641afd2617bd80bce8b71a096fc5b74b7e it was in drm-misc-next in the drm-misc tree for a while but hasn’t been merged for 5.12.
z
Hmm, yes but doesn't that fix trip the ttm_bo_unpin() dma_resv_assert_held(bo->base.resv)?
No, doesn’t seem to. TBH I’m not sure why myself, but it seems to be working fine.
With CONFIG_PROVE_LOCKING=y I see this:
[ 7.117145] [drm] FIFO at 0x00000000fe000000 size is 8192 kiB [ 7.117284] [drm] VRAM at 0x00000000e8000000 size is 131072 kiB [ 7.117291] INFO: trying to register non-static key. [ 7.117295] the code is fine but needs lockdep annotation. [ 7.117298] turning off the locking correctness validator
Which will probably mask that dma_resv_assert_held(bo->base.resv)
Ah, yes, you're right. After fixing that I do see the dma_resv_assert_held triggered. Technically trivially fixable with ttm_bo_reserve/ttm_bo_unreserve around ttm_bo_unpin but it's a little ugly that some places e.g. vmw_bo_unreference will require ttm_bo_reserve/ttm_bo_unreserve around ttm_bo_unpin but some won't e.g. vmw_mob_destroy won't because its lock is held by ttm_bo_delayed_delete without a very clear indication within the function which is which.
Not sure it applies here, but if the refcount is down to 0 we know we're in destruction code and can't race with anything anymore, so maybe we can lift the debug check.
Otoh I think at that point we might still be on lru lists, so the rules become rather tricky whether it's really always legal to skip the dma_resv_lock. But we could perhaps figure out something if it's too annoying to have a consistent calling context in drivers.
I'm not a huge fan of dropping the requirement from unpin and switching to atomic_t for the pin count, since pin/unpin is an extremely slow path, adding complexity in how we protect stuff for a function that's maybe called 60/s (for page flipping we pin/unpin) doesn't strike me as the right balance. And atomic commit helpers are explicitly designed to allow drivers to grab dma_resv_lock in their ->cleanup_fb hook, since that part is _not_ in the dma_fence signalling critical path for the atomic flip. But if all else fails I guess that's an option too. -Daniel
On 3/15/21 9:38 PM, Daniel Vetter wrote:
On Mon, Mar 15, 2021 at 6:57 PM Zack Rusin zackr@vmware.com wrote:
On 3/12/21 5:06 AM, Thomas Hellström (Intel) wrote:
On 3/12/21 12:02 AM, Zack Rusin wrote:
On Mar 11, 2021, at 17:35, Thomas Hellström (Intel) thomas_os@shipmail.org wrote:
Hi, Zack
On 3/11/21 10:07 PM, Zack Rusin wrote:
> On Mar 11, 2021, at 05:46, Thomas Hellström (Intel) > thomas_os@shipmail.org wrote: > > Hi, > > I tried latest drm-fixes today and saw a lot of these: Fallout from > ttm rework? Yes, I fixed this in d1a73c641afd2617bd80bce8b71a096fc5b74b7e it was in drm-misc-next in the drm-misc tree for a while but hasn’t been merged for 5.12.
z
Hmm, yes but doesn't that fix trip the ttm_bo_unpin() dma_resv_assert_held(bo->base.resv)?
No, doesn’t seem to. TBH I’m not sure why myself, but it seems to be working fine.
With CONFIG_PROVE_LOCKING=y I see this:
[ 7.117145] [drm] FIFO at 0x00000000fe000000 size is 8192 kiB [ 7.117284] [drm] VRAM at 0x00000000e8000000 size is 131072 kiB [ 7.117291] INFO: trying to register non-static key. [ 7.117295] the code is fine but needs lockdep annotation. [ 7.117298] turning off the locking correctness validator
Which will probably mask that dma_resv_assert_held(bo->base.resv)
Ah, yes, you're right. After fixing that I do see the dma_resv_assert_held triggered. Technically trivially fixable with ttm_bo_reserve/ttm_bo_unreserve around ttm_bo_unpin but it's a little ugly that some places e.g. vmw_bo_unreference will require ttm_bo_reserve/ttm_bo_unreserve around ttm_bo_unpin but some won't e.g. vmw_mob_destroy won't because its lock is held by ttm_bo_delayed_delete without a very clear indication within the function which is which.
It looks like, like Daniel hints below, for the mob pagetable bos since they are pinned and hence not on a LRU list, the parent bo is holding the only reference, which is utilized in vmw_mob_unbind() to make sure the tryreserve always succeeds. (unpin could be called in vmw_mob_unbind for the pagetable bo just after fencing if necessary), and similarly for the other vmwgfx_mob error paths, but in that case one should probably keep the bo pointers in stack variables until you know you don't have to unpin. Then it should be ok to tryreserve around unpinning in the error paths.
But it's constructs like that, that really makes me think we shouldn't need to reserve to unpin.
Not sure it applies here, but if the refcount is down to 0 we know we're in destruction code and can't race with anything anymore, so maybe we can lift the debug check.
Otoh I think at that point we might still be on lru lists, so the rules become rather tricky whether it's really always legal to skip the dma_resv_lock. But we could perhaps figure out something if it's too annoying to have a consistent calling context in drivers.
I'm not a huge fan of dropping the requirement from unpin and switching to atomic_t for the pin count, since pin/unpin is an extremely slow path, adding complexity in how we protect stuff for a function that's maybe called 60/s (for page flipping we pin/unpin) doesn't strike me as the right balance.
*If* we can protect the bo LRU state with the lru lock instead of with reservation it shuld probably only be a matter of extending the lru lock critical section over a couple of assignments. If we change the bo lru state we'd need to grab the lru lock sooner or later anyway, so I think the added complexity should be minimal.
/Thomas
On 3/15/21 6:35 PM, Thomas Hellström (Intel) wrote:
On 3/15/21 9:38 PM, Daniel Vetter wrote:
On Mon, Mar 15, 2021 at 6:57 PM Zack Rusin zackr@vmware.com wrote:
On 3/12/21 5:06 AM, Thomas Hellström (Intel) wrote:
On 3/12/21 12:02 AM, Zack Rusin wrote:
On Mar 11, 2021, at 17:35, Thomas Hellström (Intel) thomas_os@shipmail.org wrote:
Hi, Zack
On 3/11/21 10:07 PM, Zack Rusin wrote: >> On Mar 11, 2021, at 05:46, Thomas Hellström (Intel) >> thomas_os@shipmail.org wrote: >> >> Hi, >> >> I tried latest drm-fixes today and saw a lot of these: Fallout from >> ttm rework? > Yes, I fixed this in d1a73c641afd2617bd80bce8b71a096fc5b74b7e it was > in drm-misc-next in the drm-misc tree for a while but hasn’t been > merged for 5.12. > > z > Hmm, yes but doesn't that fix trip the ttm_bo_unpin() dma_resv_assert_held(bo->base.resv)?
No, doesn’t seem to. TBH I’m not sure why myself, but it seems to be working fine.
With CONFIG_PROVE_LOCKING=y I see this:
[ 7.117145] [drm] FIFO at 0x00000000fe000000 size is 8192 kiB [ 7.117284] [drm] VRAM at 0x00000000e8000000 size is 131072 kiB [ 7.117291] INFO: trying to register non-static key. [ 7.117295] the code is fine but needs lockdep annotation. [ 7.117298] turning off the locking correctness validator
Which will probably mask that dma_resv_assert_held(bo->base.resv)
Ah, yes, you're right. After fixing that I do see the dma_resv_assert_held triggered. Technically trivially fixable with ttm_bo_reserve/ttm_bo_unreserve around ttm_bo_unpin but it's a little ugly that some places e.g. vmw_bo_unreference will require ttm_bo_reserve/ttm_bo_unreserve around ttm_bo_unpin but some won't e.g. vmw_mob_destroy won't because its lock is held by ttm_bo_delayed_delete without a very clear indication within the function which is which.
It looks like, like Daniel hints below, for the mob pagetable bos since they are pinned and hence not on a LRU list, the parent bo is holding the only reference, which is utilized in vmw_mob_unbind() to make sure the tryreserve always succeeds. (unpin could be called in vmw_mob_unbind for the pagetable bo just after fencing if necessary),
That's a little tricky because then we'd have to pin on bind, otherwise after moves, which unbind, we wouldn't be pinned anymore. Plus bind would have to check if the bo is already pinned (i.e. it's the first time bind is called on it) since we pin on creation. Or just stop pinning on creation and do it explicitly in bind/unbind.
In general we probably should make pinning explicit in vmwgfx like in the other drivers because, as is, we sometimes have to treat pin_count as a boolean (e.g. in vmw_bo_pin_reserved) because often times the object has already been pinned during creation. This will break if we'll have drm utilities use pin/unpin.
That's a problem of our own making though, the issue of whether or not the bo has already been reserved is a little more muddy because having multiple nested pin/unpin sites (as long as they're consistent in their matching pin/unpin usage) isn't a problem, but having nested reserved calls is a problem. Although this might be a case of an old man yelling at the sun because I'm too lazy to remember whether or not each callback is already called reserved and instead should just add dma_resv_assert_held(bo->resv) in more places to clarify which is which or like you mention use tryreserve everywhere.
But it's constructs like that, that really makes me think we shouldn't need to reserve to unpin.
Yea, that would be pretty convenient.
z
dri-devel@lists.freedesktop.org