https://bugzilla.kernel.org/show_bug.cgi?id=90741
Bug ID: 90741 Summary: Radeon: System pauses on TAHITI Product: Drivers Version: 2.5 Kernel Version: 3.18.0 3.18.1 Hardware: x86-64 OS: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: Video(DRI - non Intel) Assignee: drivers_video-dri@kernel-bugs.osdl.org Reporter: wielkiegie@gmail.com Regression: No
The system freezes for 1-3 seconds from time to time. No message is shown in dmesg. Some of the programs report high display delays in their profiles (maybe glXSwapBuffers?). The pause is system-wide, probably due to whole-OpenGL stack (glamor).
This seems to be a 3.18.x regression. The problem does not occur on 3.17.4.
During normal usage the problem is very rare. However, it occurs very often while playing OpenGL games.
For games like Minecraft, the pauses are less likely while not moving around. I believe that the game does not upload many buffers then. So the issue may be due to buffer management.
I am using mesa from git, libdrm from git, xserver 1.16.2.901, xf86-video-ati 7.5. The problem occurs both with compositing enabled and disabled (kwin).
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #1 from Gustaw Smolarczyk wielkiegie@gmail.com --- I have managed to bisect this regression:
f2c24b83ae90292d315aa7ac029c6ce7929e01aa is the first bad commit commit f2c24b83ae90292d315aa7ac029c6ce7929e01aa Author: Maarten Lankhorst maarten.lankhorst@canonical.com Date: Wed Apr 2 17:14:48 2014 +0200
drm/ttm: flip the switch, and convert to dma_fence
Signed-off-by: Maarten Lankhorst maarten.lankhorst@canonical.com
:040000 040000 634b9d842856f9a253b8250b6b61423a8b19bf7d 82569dbc605b725ecefac7457b25599158af5fb6 M drivers :040000 040000 6ff7fd43b2671ac388b81ec3c9ab47f4a2d39162 0c9d3986f193c83b952b05e7ec1f1c777a44e88e M include
I am not familiar with the kernel code so I cannot say whether it's a real regression or is just a problem with a userspace.
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #2 from Michel Dänzer michel@daenzer.net --- Does it still happen with a 3.19-rc kernel?
https://bugzilla.kernel.org/show_bug.cgi?id=90741
Maarten Lankhorst bugs@mblankhorst.nl changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |bugs@mblankhorst.nl
--- Comment #3 from Maarten Lankhorst bugs@mblankhorst.nl --- Ok so that commit introduces the bad behavior, does commit 39e7f6f84b3a3aa4520504473f2e2bac1f949ffa "drm/radeon: use rcu waits in some ioctls" or commit "drm/ttm: use rcu in core ttm" change the bad behavior in any way?
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #4 from Gustaw Smolarczyk wielkiegie@gmail.com --- There seems to be no substantial difference in 3.19-rc2, 39e7f6f84b3a3aa4520504473f2e2bac1f949ffa or 472db7ab3093bf2a2999f6b5aa64a030466d6f92.
The only thing (that may be subjective) is that there seems to be a bit less pauses in final 3.18 and later. (did not bisect that, may be totally subjective)
Also of importance can be that I am using 2 GPUs (discrete AMD TAHITI and integrated Intel gen7 (IVB)) in randr 1.4 offloading configuration:
xrandr --setprovideroutputsource Intel radeon xrandr --output HDMI2 --mode 1280x1024 --rate 60 --right-of DVI-0
Where: DVI-0 is my main monitor and is connected to TAHITI HDMI2 is my secondary monitor and is connected to IVB The OpenGL window is on the main monitor
If that may be the source of these problems, I could disable the second monitor along with the iGPU in order to test that hypothesis.
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #5 from Michel Dänzer michel@daenzer.net --- (In reply to Gustaw Smolarczyk from comment #4)
If that may be the source of these problems, I could disable the second monitor along with the iGPU in order to test that hypothesis.
That would be interesting.
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #6 from Maarten Lankhorst bugs@mblankhorst.nl --- I think that right now i915 doesn't use the fence stuff yet so it's probably not causing the problem, still it would be useful to check.
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #7 from Gustaw Smolarczyk wielkiegie@gmail.com --- Unfortunately, it was a false lead. Disabling iGPU didn't help in any way on 3.18.1.
Additionally, I take back the "less pauses in 3.18.0 and later" thing. 3.18.1 was horrible just now (same as any kernel after the commit bisected).
Is there any way I can help investigating it more?
https://bugzilla.kernel.org/show_bug.cgi?id=90741
Christian König deathsimple@vodafone.de changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |deathsimple@vodafone.de
--- Comment #8 from Christian König deathsimple@vodafone.de --- Just a shot into the dark, but can you please post the content of /proc/interrupts.
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #9 from Gustaw Smolarczyk wielkiegie@gmail.com --- Created attachment 162601 --> https://bugzilla.kernel.org/attachment.cgi?id=162601&action=edit /proc/interrupts on 3.17.4
I have attached the contents of /proc/interrupts on 3.17.4. Should I attach one with a faulty kernel too?
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #10 from Christian König deathsimple@vodafone.de --- (In reply to Gustaw Smolarczyk from comment #9)
Created attachment 162601 [details] /proc/interrupts on 3.17.4
I have attached the contents of /proc/interrupts on 3.17.4. Should I attach one with a faulty kernel too?
No, all I wanted to check is that interrupts work as expected. And that indeed seems to be the case here:
26: 40191 12200 8939 6450 22186 15970 9307 6821 PCI-MSI-edge radeon
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #11 from Maarten Lankhorst bugs@mblankhorst.nl --- Created attachment 163871 --> https://bugzilla.kernel.org/attachment.cgi?id=163871&action=edit Restore old fence wait behavior
Ok the only difference I can see is the fence wait function being different in that commit. The new one should be correct, so lets see if it that's what causes it, or if it's the introduction of timeouts in ttm.
Does this hack change things?
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #12 from Gustaw Smolarczyk wielkiegie@gmail.com --- This hack, when applied on top of 3.19-rc5, seems to fix the problem.
https://bugzilla.kernel.org/show_bug.cgi?id=90741
Maarten Lankhorst bugs@mblankhorst.nl changed:
What |Removed |Added ---------------------------------------------------------------------------- Attachment #163871|0 |1 is obsolete| |
--- Comment #13 from Maarten Lankhorst bugs@mblankhorst.nl --- Created attachment 164121 --> https://bugzilla.kernel.org/attachment.cgi?id=164121&action=edit Use exclusive wait to put the waiter at the end of the waitqueue
Ok seems I made a thinko. wait_event_interruptible_timeout and wait_event_timeout add items to the front of the queue instead of the back, this may result in fences never signalling.
Can you test the attached patch?
https://bugzilla.kernel.org/show_bug.cgi?id=90741
Maarten Lankhorst bugs@mblankhorst.nl changed:
What |Removed |Added ---------------------------------------------------------------------------- Attachment #164121|0 |1 is obsolete| |
--- Comment #14 from Maarten Lankhorst bugs@mblankhorst.nl --- Created attachment 164131 --> https://bugzilla.kernel.org/attachment.cgi?id=164131&action=edit Working patch
https://bugzilla.kernel.org/show_bug.cgi?id=90741
Maarten Lankhorst bugs@mblankhorst.nl changed:
What |Removed |Added ---------------------------------------------------------------------------- Attachment #164131|0 |1 is obsolete| |
--- Comment #15 from Maarten Lankhorst bugs@mblankhorst.nl --- Created attachment 164141 --> https://bugzilla.kernel.org/attachment.cgi?id=164141&action=edit Final version.
Can you test if this thing fixes it? Forget the other patch, this one should fix the bug.
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #16 from Gustaw Smolarczyk wielkiegie@gmail.com --- Unfortunately, the last patch doesn't seem to help in any way. Should I test the other two too?
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #17 from Maarten Lankhorst bugs@mblankhorst.nl --- Created attachment 164421 --> https://bugzilla.kernel.org/attachment.cgi?id=164421&action=edit Another approach
Don't need to test anything but final version, can you test this version though? I want to see if the bug's in the wait code or something with the interrupt handling..
If this version fails too, can you uncomment the sw_irq_get and sw_irq_put calls added in the patch?
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #18 from Gustaw Smolarczyk wielkiegie@gmail.com --- Only the second version (with the 2 lines uncommented) of the new patch does fix the problem.
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #19 from Maarten Lankhorst bugs@mblankhorst.nl --- Created attachment 164511 --> https://bugzilla.kernel.org/attachment.cgi?id=164511&action=edit Clear irqs before processing.. Patch on top of 'another approach'
Can you try with the irq_put/get lines commented, and this patch applied? Lets see if clearing irqs before processing helps..
Also there is a line in radeon_fence.c with "radeon_irq_kms_sw_irq_get_delayed", can you add a WARN_ON_ONCE above that line after the else? To check if that path is taken...
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #20 from Gustaw Smolarczyk wielkiegie@gmail.com --- I will try that on Sunday since I have no access to the radeonsi hardware at the moment.
https://bugzilla.kernel.org/show_bug.cgi?id=90741
Jon Arne Jørgensen jonjon.arnearne@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |jonjon.arnearne@gmail.com
--- Comment #21 from Jon Arne Jørgensen jonjon.arnearne@gmail.com --- Hi, I'm the reporter of bug 90861, which migth be related to this bug.
I'm now trying to compile a kernel with the patch from Maarten (attachment 164421) to see if it changes anything, but I can't get it to compile:
[...snip...] Building modules, stage 2. MODPOST 238 modules OBJCOPY arch/x86/boot/compressed/vmlinux.bin GZIP arch/x86/boot/compressed/vmlinux.bin.gz ERROR: "__tracepoint_fence_enable_signal" [drivers/gpu/drm/radeon/radeon.ko] undefined! make[2]: *** [__modpost] Error 1 make[1]: *** [modules] Error 2 make[1]: *** Waiting for unfinished jobs.... [...snip...]
I've tried to enable everythoing I could find related to tracepoint in the kernel config, but with no success.
Any pointers?
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #22 from Gustaw Smolarczyk wielkiegie@gmail.com --- @Maarten Lankhorst
The "Clear irqs..." patch doesn't seem to help.
Also, the WARN_ON_ONCE() doesn't seem to be hit (I have added 'WARN_ON_ONCE(1);' in the else block). The dmesg doesn't have anything related to it. Am I right that the WARN_ON_ONCE line should appear in dmesg in case it is reached?
@Jon Arne Jørgensen
About patch compilation: I have applied it on top of 3.19-rc5. Are you by any chance using 3.18?
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #23 from Jon Arne Jørgensen jonjon.arnearne@gmail.com --- @Gustaw:
~/Development/c/kernel/src% git describe --long --tags v3.19-rc5-0-gec6f34e
So, that should be 3.19-rc5 :)
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #24 from Jon Arne Jørgensen jonjon.arnearne@gmail.com --- I just commented out the trace_fence_enable_signal call to get it to compile. Figured I don't need the tracepoint debug to test the patch.
I also uncommented the sw_irq_get and sw_irq_put calls, and can report that the patch seems to fix bug 90861.
I guess this is proof that the bug is related.
I didn't try the patch with sw_irq_get/set disabled. Just ask, and I will give it a try.
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #25 from Maarten Lankhorst bugs@mblankhorst.nl --- Oke, can you test without the calls to sw_irq_get/put?
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #26 from Jon Arne Jørgensen jonjon.arnearne@gmail.com --- I've tested, and can confirm that they are needed for the patch to fix the crash.
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #27 from Maarten Lankhorst bugs@mblankhorst.nl --- Created attachment 165071 --> https://bugzilla.kernel.org/attachment.cgi?id=165071&action=edit Print refcounts on sw irq, panic on disabled.
Can you test this patch on top of 'another approach', keep the sw_irq_get/put commented? Also enable CONFIG_FENCE_TRACE please.
And then print the entire dmesg..
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #28 from Gustaw Smolarczyk wielkiegie@gmail.com --- Created attachment 165091 --> https://bugzilla.kernel.org/attachment.cgi?id=165091&action=edit dmesg after "Print refcounts..." patch
I have applied this patch on top of original "another approach" one.
I think the pauses are now shorter, but more frequent.
The contents of dmesg has been attached. I have taken it from /var/log/messages, since the in-kernel dmesg buffer was overwritten many times over. I truncated the messages a bit, since I stopped testing a game at ~80s of uptime.
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #29 from Maarten Lankhorst bugs@mblankhorst.nl --- Thanks, that rules out 1 possibility. There is probably no bug in refcounting irqs. :P
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #30 from Maarten Lankhorst bugs@mblankhorst.nl --- Can you empty the contents of the radeon_fence_schedule_check function in radeon_fence.c?
It should make the pauses worse, giving me a chance to find them in the log. :-)
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #31 from Gustaw Smolarczyk wielkiegie@gmail.com --- Do you mean on top of the latest patch?
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #32 from Maarten Lankhorst bugs@mblankhorst.nl --- You can keep print refcounts or remove it, it doesn't matter much either way. I just want to have at least 'another approach' applied.
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #33 from Gustaw Smolarczyk wielkiegie@gmail.com --- Created attachment 165161 --> https://bugzilla.kernel.org/attachment.cgi?id=165161&action=edit dmesg after disabling radeon_fence_schedule_check()
I chose to leave the refcount printing patch as it is.
The pauses now are much much longer (~10-20s).
The 2h time difference in the messages log is due to local time not being set correctly and ntp-client daemon fixing it after some time.
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #34 from Jon Arne Jørgensen jonjon.arnearne@gmail.com --- Created attachment 165901 --> https://bugzilla.kernel.org/attachment.cgi?id=165901&action=edit dmesg with refcount on sw irq
Hi, Maarten I finaly got around to append the patch you posted in comment 27. Not sure if you still need this.
I'm also not sure if I should stick with my original bug report (bug 90861) to not clutter this one?
PS. I had to compile the radeon driver into the kernel to get it to compile. If I compile the radeon driver as module, the compile will break with this error: ERROR: "__tracepoint_fence_enable_signal" [drivers/gpu/drm/radeon/radeon.ko] undefined!
But I guess that is another bug report :)
https://bugzilla.kernel.org/show_bug.cgi?id=90741
Jon Arne Jørgensen jonjon.arnearne@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Attachment #165901|0 |1 is obsolete| |
--- Comment #35 from Jon Arne Jørgensen jonjon.arnearne@gmail.com --- Created attachment 165911 --> https://bugzilla.kernel.org/attachment.cgi?id=165911&action=edit dmesg with refcount on sw irq
Hm, seems that auto-detect mime failed with the last attachment?
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #36 from Maarten Lankhorst bugs@mblankhorst.nl --- Ugh.. can you test "another approach" again, but change the list_add to list_add_tail ? Looks like I messed it up there..
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #37 from Maarten Lankhorst bugs@mblankhorst.nl --- Never mind, that won't matter much..
https://bugzilla.kernel.org/show_bug.cgi?id=90741
Maarten Lankhorst bugs@mblankhorst.nl changed:
What |Removed |Added ---------------------------------------------------------------------------- Attachment #164141|0 |1 is obsolete| | Attachment #164421|0 |1 is obsolete| |
--- Comment #38 from Maarten Lankhorst bugs@mblankhorst.nl --- Created attachment 166571 --> https://bugzilla.kernel.org/attachment.cgi?id=166571&action=edit Use fence_add/remove_callback
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #39 from Gustaw Smolarczyk wielkiegie@gmail.com --- Should I apply it on top of 3.19.0? Or is it an extension of some previous patch?
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #40 from Maarten Lankhorst bugs@mblankhorst.nl --- It replaces the previous attempts to fix the bug.
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #41 from Gustaw Smolarczyk wielkiegie@gmail.com --- Unfortunately, still no good.
https://bugzilla.kernel.org/show_bug.cgi?id=90741
Damien Grassart damien@grassart.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |damien@grassart.com
--- Comment #42 from Damien Grassart damien@grassart.com --- Hi, is there any progress on this bug? The current kernel makes Dota 2 unplayable on my Radeon HD 7870. I can also confirm that the "another approach" patch with the two lines uncommented fixes the problem (as Gustaw already mentioned above).
Is there anything I can do or provide that might help resolve this?
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #43 from Maarten Lankhorst bugs@mblankhorst.nl --- After applying attachment 166571 can you add a usleep(50000); in radeon_fence_enable_signaling after the line "radeon_irq_kms_sw_irq_get(rdev, fence->ring);" but before the line "if (radeon_fence_activity(rdev, fence->ring))"?
It's a hack, to confirm if my suspicions are correct. It will probably make everything a lot slower, but I only care if the pauses are gone or not..
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #44 from Gustaw Smolarczyk wielkiegie@gmail.com --- I think you meant udelay(50000). (I just used msleep(50) and got a nice lock-up since the call was in an atomic context).
Just a sec.
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #45 from Gustaw Smolarczyk wielkiegie@gmail.com --- (In the end I had to use mdelay(50))
The performance was really traumatic (8-11fps). But on a positive side, there seemed to be no pauses. There was one moment when fps dropped to 0 (for ~1/3 of a second), but it may have been due to completely different reason.
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #46 from Maarten Lankhorst bugs@mblankhorst.nl --- Ok what about udelay(500), should be a smaller pause?
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #47 from Gustaw Smolarczyk wielkiegie@gmail.com --- With 500us delay, the performance is nearly back to how it was (~70fps). And the pauses are still gone.
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #48 from Maarten Lankhorst bugs@mblankhorst.nl --- Created attachment 168571 --> https://bugzilla.kernel.org/attachment.cgi?id=168571&action=edit flush hdp harder
No idea if this does anything, adding a bunch of flushes instead of doing a sleep.. pauses still occur?
If not can you check if it still works with only the chunk in radeon_fence_enable_signalling?
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #49 from Gustaw Smolarczyk wielkiegie@gmail.com --- (had to remove "robj->" from the patch).
Doesn't help.
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #50 from Maarten Lankhorst bugs@mblankhorst.nl --- Ok never mind that approach then..
in radeon_fence.c, add a WARN_ON((int)atomic_read(&rdev->irq.ring_int[fence->ring]) <= 0); after the sw_irq_get, it should never fire.
in si.c you have a few lines like this:
if (atomic_read(&rdev->irq.ring_int[RADEON_RING_TYPE_GFX_INDEX])) { DRM_DEBUG("si_irq_set: sw int gfx\n"); cp_int_cntl |= TIME_STAMP_INT_ENABLE; } if (atomic_read(&rdev->irq.ring_int[CAYMAN_RING_TYPE_CP1_INDEX])) { DRM_DEBUG("si_irq_set: sw int cp1\n"); cp_int_cntl1 |= TIME_STAMP_INT_ENABLE; } if (atomic_read(&rdev->irq.ring_int[CAYMAN_RING_TYPE_CP2_INDEX])) { DRM_DEBUG("si_irq_set: sw int cp2\n"); cp_int_cntl2 |= TIME_STAMP_INT_ENABLE; }
Can you first change all of the conditions to atomic_read(..) || 1 so it always sets the irqs? That should fix the hangs at some cpu overhead..
If that works and you no longer get hangs, can you remove the || 1 for RADEON_RING_TYPE_GFX_INDEX, see if the hangs are still gone?
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #51 from Gustaw Smolarczyk wielkiegie@gmail.com --- Should I apply it on top of attachment 166571?
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #52 from Maarten Lankhorst bugs@mblankhorst.nl --- Yes, and also add a WARN_ON(!rdev->ih.enabled); to radeon_fence_enable_signaling, just to be thorough..
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #53 from Gustaw Smolarczyk wielkiegie@gmail.com --- At first, I added "|| 1" only to GFX, CP1 and CP2 (I omitted the two DMA conditions). This did NOT fix the pauses.
However, after converting these final two DMA conditions, the pauses are gone. Reverting GFX condition didn't change that.
Also - is there an easy way to check whether the WARN_ON have been fired or not?
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #54 from Maarten Lankhorst bugs@mblankhorst.nl --- Just look if there's some output in dmesg or not.
To verify, only having interrupts enabled on the DMA conditions fixes your bug?
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #55 from Maarten Lankhorst bugs@mblankhorst.nl --- Do you happen to know if it's DMA or DMA1?
https://bugzilla.kernel.org/show_bug.cgi?id=90741
Maarten Lankhorst bugs@mblankhorst.nl changed:
What |Removed |Added ---------------------------------------------------------------------------- Attachment #168571|0 |1 is obsolete| |
--- Comment #56 from Maarten Lankhorst bugs@mblankhorst.nl --- Created attachment 168581 --> https://bugzilla.kernel.org/attachment.cgi?id=168581&action=edit reshuffle evergreen_dma_fence_ring_emit
Try this patch perhaps, and remove the || 1's?
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #57 from Gustaw Smolarczyk wielkiegie@gmail.com --- Will try when I return home.
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #58 from Gustaw Smolarczyk wielkiegie@gmail.com --- This patch doesn't help.
I have been experimenting with forcing interrupts on only some rings. In the end I have found that by adding "|| 1" only to R600_RING_TYPE_DMA_INDEX ring I have removed the pauses.
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #59 from Maarten Lankhorst bugs@mblankhorst.nl --- I was afraid of that, that's the ring used for eviction.
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #60 from Maarten Lankhorst bugs@mblankhorst.nl --- can you add a if (fence->ring == R600_RING_TYPE_DMA_INDEX) udelay(50); to radeon_fence_enable_signaling? after the irq_get again. If that works try udelay(5), if it doesn't udelay(500) ?
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #61 from Gustaw Smolarczyk wielkiegie@gmail.com --- 5us is ok.
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #62 from Gustaw Smolarczyk wielkiegie@gmail.com --- Even 1us works.
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #63 from Maarten Lankhorst bugs@mblankhorst.nl --- What about changing the udelay to a RREG32(DMA_CNTL + DMA0_REGISTER_OFFSET); ?
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #64 from Gustaw Smolarczyk wielkiegie@gmail.com --- Yes, the fix still works.
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #65 from Maarten Lankhorst bugs@mblankhorst.nl --- So attachment 166571 and the rreg32 addition is enough?
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #66 from Gustaw Smolarczyk wielkiegie@gmail.com --- It seems that way. I hope others can check that too.
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #67 from Maarten Lankhorst bugs@mblankhorst.nl --- what about a rmb(); there instead?
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #68 from Gustaw Smolarczyk wielkiegie@gmail.com --- rmb(); doesn't seem to be enough.
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #69 from Maarten Lankhorst bugs@mblankhorst.nl --- Hm full mb(); ?
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #70 from Gustaw Smolarczyk wielkiegie@gmail.com --- mb(); is ok.
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #71 from Maarten Lankhorst bugs@mblankhorst.nl --- Thanks for helping investigate, I will send the attachment as a bugfix to bug 90861, that bug is specifically about the waitqueue ordering in radeon_fence_default_wait. That one was really caused by me.
The bug here just happened to be triggered more reliably because there is a lot less wait time between sw_irq_get and radeon_fence_read. It needs some kind of barrier but I'm not sure which one. The radeon dev Alex Deucher will look into it and talk with some other people about the proper way to fix this.
Thanks for all your help, I'll cc you when I send my bugfix.
https://bugzilla.kernel.org/show_bug.cgi?id=90741
Alex Deucher alexdeucher@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |alexdeucher@gmail.com
--- Comment #72 from Alex Deucher alexdeucher@gmail.com --- Does reading any arbitrary register also work? E.g., SRBM_STATUS
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #73 from Michel Dänzer michel@daenzer.net --- (In reply to Alex Deucher from comment #72)
Does reading any arbitrary register also work? E.g., SRBM_STATUS
I think that would indeed be the right thing to do here, to prevent the PCIe bridge from delaying the register writes.
Probably better to also check the fence status again after enabling the IRQ, to handle the case where the hardware might signal the fence after we check its status but before we enable the IRQ.
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #74 from Alex Deucher alexdeucher@gmail.com --- posting reads implemented here: http://cgit.freedesktop.org/~agd5f/linux/log/?h=posting-read
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #75 from Maarten Lankhorst bugs@mblankhorst.nl --- Michel, enabling irqs before checking the fence signaling was exactly what I was doing already, I'm hoping the attachment plus the branch is enough. :)
https://bugzilla.kernel.org/show_bug.cgi?id=90741
Dieter Nützel Dieter@nuetzel-hh.de changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |Dieter@nuetzel-hh.de
--- Comment #76 from Dieter Nützel Dieter@nuetzel-hh.de --- Hello Maarten,
is https://bugzilla.kernel.org/attachment.cgi?id=168581 needed or obsolete with Alex's patches?
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #77 from Gustaw Smolarczyk wielkiegie@gmail.com --- No pauses after I applied attachment 166571 and "drm/radeon: do a posting read in si_set_irq" from posting-read branch on top of v3.19. So I guess these commits combined fix this issue.
@Dieter: I did not need to apply this patch in order to fix this issue.
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #78 from Dieter Nützel Dieter@nuetzel-hh.de --- (In reply to Gustaw Smolarczyk from comment #77)
No pauses after I applied attachment 166571 [details] and "drm/radeon: do a posting read in si_set_irq" from posting-read branch on top of v3.19. So I guess these commits combined fix this issue.
@Dieter: I did not need to apply this patch in order to fix this issue.
Thanks Gustaw!
In the meantime I have it running on r600g/NI/Turks XT with:
drm-next-4.1-wip + HDMI fixes from 'audio-fixes' + ' posting-read' drm-radeon-do-a-posting-read-in-evergreen_set_irq.patch drm-radeon-do-a-posting-read-in-r600_set_irq.patch (better more than less...;-) + attachment 166571 drm-radeon-fix-wait-to-actually-occur-after-the-signaling-callback.patch + drm-radeon-fix-DRM_IOCTL_RADEON_CS-oops.patch
Cheers, Dieter
https://bugzilla.kernel.org/show_bug.cgi?id=90741
--- Comment #79 from Maarten Lankhorst bugs@mblankhorst.nl --- Yeah all patches are obsolete except attachment 166571 and the posting read patches..
https://bugzilla.kernel.org/show_bug.cgi?id=90741
Julien Isorce (julien.isorce@gmail.com) changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |julien.isorce@gmail.com
--- Comment #80 from Julien Isorce (julien.isorce@gmail.com) --- This bug should be closed: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h...
dri-devel@lists.freedesktop.org