Hi,
since kernel 4.4 I'm unable to login to kde5 desktop (on openSUSE Tumbleweed). There's a screen with progressbar showing the startup, which normally fades away after reaching 100%. But with kernel 4.4, the progress gets stuck somewhere between 1/2 and 3/4 (not always the same). Top shows that kwin is using few % of CPU's but mostly sleeps in poll(). When I kill it from another console, I see that everything has actually started up, just the progressbar screen was obscuring it. The windows obviously don't have decorations etc. Starting kwin manually again shows me again the progressbar screen at the same position.
I have suspected that kwin is waiting for some event, but nevertheless tried bisecting the kernel between 4.3 and 4.4, which lead to:
# first bad commit: [4dfd64862ff852df7b1198d667dda778715ee88f] drm: Use vblank timestamps to guesstimate how many vblanks were missed
I can confirm that 4.4 works if I revert the following commits: 63154ff230fc9255cc507af6277cd181943c50a1 "drm/amdgpu: Fixup hw vblank counter/ts for new drm_update_vblank_count() (v3)"
d1145ad1e41b6c33758a856163198cb53bb96a50 "drm/radeon: Fixup hw vblank counter/ts for new drm_update_vblank_count() (v2)"
31ace027c9f1f8e0a2b09bbf961e4db7b1f6cf19 "drm: Don't zero vblank timestamps from the irq handler"
ac0567a4b132fa66e3edf3f913938af9daf7f916 "drm: Add DRM_DEBUG_VBL()"
4dfd64862ff852df7b1198d667dda778715ee88f "drm: Use vblank timestamps to guesstimate how many vblanks were missed"
All clean reverts, just needs some fixup on top to use abs() instead of abs64() due to 79211c8ed19c055ca105502c8733800d442a0ae6.
Unfortunately I don't know if this is a kernel problem or kwin problem. I tried to CC maintainers of both, advices what to try or what info to provide welcome. The card is "CAICOS" with 1GB memory.
Thanks, Vlastimil
On Fri, Jan 15, 2016 at 11:34:08AM +0100, Vlastimil Babka wrote:
Hi,
since kernel 4.4 I'm unable to login to kde5 desktop (on openSUSE Tumbleweed). There's a screen with progressbar showing the startup, which normally fades away after reaching 100%. But with kernel 4.4, the progress gets stuck somewhere between 1/2 and 3/4 (not always the same). Top shows that kwin is using few % of CPU's but mostly sleeps in poll(). When I kill it from another console, I see that everything has actually started up, just the progressbar screen was obscuring it. The windows obviously don't have decorations etc. Starting kwin manually again shows me again the progressbar screen at the same position.
Hmm. Sounds like it could then be waiting for a vblank in the distant future. There's that 1<<23 limit in the code though, but even with that we end up with a max wait of ~38 hours assuming a 60Hz refresh rate.
Stuff to try might include enabling drm.debug=0x2f, though that'll generate a lot of stuff. Another option would be to use the drm vblank tracepoints to try and catch what seq number it's waiting for and where we're at currently. Or I suppose you could just hack up drm_wait_vblank() to print an error message or something if the requested seq number is in the future by, say, more than a few seconds, and if that's the case then we could try to figure out why that happens.
I have suspected that kwin is waiting for some event, but nevertheless tried bisecting the kernel between 4.3 and 4.4, which lead to:
# first bad commit: [4dfd64862ff852df7b1198d667dda778715ee88f] drm: Use vblank timestamps to guesstimate how many vblanks were missed
I can confirm that 4.4 works if I revert the following commits: 63154ff230fc9255cc507af6277cd181943c50a1 "drm/amdgpu: Fixup hw vblank counter/ts for new drm_update_vblank_count() (v3)"
d1145ad1e41b6c33758a856163198cb53bb96a50 "drm/radeon: Fixup hw vblank counter/ts for new drm_update_vblank_count() (v2)"
The sha1s don't seem to match what I have, so not sure which kernel tree you have, but looking at the radeon commit at least one thing immediately caught my attention;
+ /* Bump counter if we are at >= leading edge of vblank, + * but before vsync where vpos would turn negative and + * the hw counter really increments. + */ + if (vpos >= 0) + count++;
It's rather hard to see what it's really doing since the custom flags to the get_scanout_position now cause it return non-standard things. But if I'm reading things correctly it should really say something like:
if (vpos >= 0 && vpos < (vsync_start - vblank_start)) count++;
Hmm. Actually even that might not be correct since it could be using the "fake" vblank start here, so might be it'd need to be something like:
if (vpos >= 0 && vpos < (vsync_start - vblank_start + lb_vblank_lead_lines) count++;
Also might be worth a shot to just ignore the hw frame counter. Eg.:
index e266ffc520d2..db728580549a 100644 --- a/drivers/gpu/drm/radeon/radeon_drv.c +++ b/drivers/gpu/drm/radeon/radeon_drv.c @@ -492,7 +492,6 @@ static struct drm_driver kms_driver = { .lastclose = radeon_driver_lastclose_kms, .set_busid = drm_pci_set_busid, .unload = radeon_driver_unload_kms, - .get_vblank_counter = radeon_get_vblank_counter_kms, .enable_vblank = radeon_enable_vblank_kms, .disable_vblank = radeon_disable_vblank_kms, .get_vblank_timestamp = radeon_get_vblank_timestamp_kms, diff --git a/drivers/gpu/drm/radeon/radeon_irq_kms.c b/drivers/gpu/drm/radeon/radeon_irq_kms.c index 979f3bf65f2c..3c5fcab74152 100644 --- a/drivers/gpu/drm/radeon/radeon_irq_kms.c +++ b/drivers/gpu/drm/radeon/radeon_irq_kms.c @@ -152,11 +152,6 @@ int radeon_driver_irq_postinstall_kms(struct drm_device *dev) { struct radeon_device *rdev = dev->dev_private;
- if (ASIC_IS_AVIVO(rdev)) - dev->max_vblank_count = 0x00ffffff; - else - dev->max_vblank_count = 0x001fffff; - return 0; }
assuming I'm reading the code correctly.
31ace027c9f1f8e0a2b09bbf961e4db7b1f6cf19 "drm: Don't zero vblank timestamps from the irq handler"
ac0567a4b132fa66e3edf3f913938af9daf7f916 "drm: Add DRM_DEBUG_VBL()"
4dfd64862ff852df7b1198d667dda778715ee88f "drm: Use vblank timestamps to guesstimate how many vblanks were missed"
All clean reverts, just needs some fixup on top to use abs() instead of abs64() due to 79211c8ed19c055ca105502c8733800d442a0ae6.
Unfortunately I don't know if this is a kernel problem or kwin problem. I tried to CC maintainers of both, advices what to try or what info to provide welcome. The card is "CAICOS" with 1GB memory.
Thanks, Vlastimil
On 01/15/2016 01:26 PM, Ville Syrjälä wrote:
On Fri, Jan 15, 2016 at 11:34:08AM +0100, Vlastimil Babka wrote:
I have suspected that kwin is waiting for some event, but nevertheless tried bisecting the kernel between 4.3 and 4.4, which lead to:
# first bad commit: [4dfd64862ff852df7b1198d667dda778715ee88f] drm: Use vblank timestamps to guesstimate how many vblanks were missed
I can confirm that 4.4 works if I revert the following commits: 63154ff230fc9255cc507af6277cd181943c50a1 "drm/amdgpu: Fixup hw vblank counter/ts for new drm_update_vblank_count() (v3)"
d1145ad1e41b6c33758a856163198cb53bb96a50 "drm/radeon: Fixup hw vblank counter/ts for new drm_update_vblank_count() (v2)"
The sha1s don't seem to match what I have, so not sure which kernel tree
Hm sorry, I pasted the sha1 of the reverts by mistake. Correct sha1 are: 5b5561b3660db734652fbd02b4b6cbe00434d96b "drm/radeon: Fixup hw vblank counter/ts for new drm_update_vblank_count() (v2)" fa4270d8e0257b4b76f11baa2866f4313d29aaf5 "drm: Don't zero vblank timestamps from the irq handler" 235fabe09b46469adad2c9e4cb0563758155187c "drm: Add DRM_DEBUG_VBL()" 4dfd64862ff852df7b1198d667dda778715ee88f "drm: Use vblank timestamps to guesstimate how many vblanks were missed" 8e36f9d33c134d5c6448ad65b423a9fd94e045cf "drm/amdgpu: Fixup hw vblank counter/ts for new drm_update_vblank_count() (v3)"
Also, it turns out that the process actually showing the progress is "ksplashqml", not kwin. It survives killing kwin, and restarting kwin just makes it shown on top again, or something. If I force kill ksplashqml instead of kwin, the desktop works including decorations and everything. ksplashqml itself also waits in kernel in poll().
I'll try some of your suggestions, thanks!
On 01/15/2016 01:26 PM, Ville Syrjälä wrote:
On Fri, Jan 15, 2016 at 11:34:08AM +0100, Vlastimil Babka wrote:
Hi,
since kernel 4.4 I'm unable to login to kde5 desktop (on openSUSE Tumbleweed). There's a screen with progressbar showing the startup, which normally fades away after reaching 100%. But with kernel 4.4, the progress gets stuck somewhere between 1/2 and 3/4 (not always the same). Top shows that kwin is using few % of CPU's but mostly sleeps in poll(). When I kill it from another console, I see that everything has actually started up, just the progressbar screen was obscuring it. The windows obviously don't have decorations etc. Starting kwin manually again shows me again the progressbar screen at the same position.
Depressing. I was stress-testing those patches with Linux 4.4 for days on 2 AMD gpu's (HD-4000 RV 730 and HD-5770) under KDE 5 Plasma 5.4.2 (KUbuntu 15.10, XOrg 1.17) and just retested Linux 4.4 on nouveau/radeon/intel also with XOrg 1.18 and XOrg master) with Linux 4.4 a few days ago and never encountered such a hang or other vblank related problem on KDE-5 or GNOME-3.
I'm currently running...
while xinit /usr/bin/ksplashqml --test -- :1 ; do echo yay; done
... in an endless loop on Linux 4.4 SMP PREEMPT on HD-5770 and so far i can't trigger a hang after hundreds of runs.
Does this also hang for you?
Hmm. Sounds like it could then be waiting for a vblank in the distant future. There's that 1<<23 limit in the code though, but even with that we end up with a max wait of ~38 hours assuming a 60Hz refresh rate.
xtrace suggests that ksplashqml seems to use classic OpenGL + glXSwapBuffers under DRI2. So no clever swap scheduling based on vblank counter values.
Stuff to try might include enabling drm.debug=0x2f, though that'll generate a lot of stuff. Another option would be to use the drm vblank tracepoints to try and catch what seq number it's waiting for and where we're at currently. Or I suppose you could just hack up drm_wait_vblank() to print an error message or something if the requested seq number is in the future by, say, more than a few seconds, and if that's the case then we could try to figure out why that happens.
I have suspected that kwin is waiting for some event, but nevertheless tried bisecting the kernel between 4.3 and 4.4, which lead to:
# first bad commit: [4dfd64862ff852df7b1198d667dda778715ee88f] drm: Use vblank timestamps to guesstimate how many vblanks were missed
I can confirm that 4.4 works if I revert the following commits: 63154ff230fc9255cc507af6277cd181943c50a1 "drm/amdgpu: Fixup hw vblank counter/ts for new drm_update_vblank_count() (v3)"
d1145ad1e41b6c33758a856163198cb53bb96a50 "drm/radeon: Fixup hw vblank counter/ts for new drm_update_vblank_count() (v2)"
The sha1s don't seem to match what I have, so not sure which kernel tree you have, but looking at the radeon commit at least one thing immediately caught my attention;
/* Bump counter if we are at >= leading edge of vblank,
* but before vsync where vpos would turn negative and
* the hw counter really increments.
*/
if (vpos >= 0)
count++;
It's rather hard to see what it's really doing since the custom flags to the get_scanout_position now cause it return non-standard things. But if I'm reading things correctly it should really say something like:
if (vpos >= 0 && vpos < (vsync_start - vblank_start)) count++;
Hmm. Actually even that might not be correct since it could be using the "fake" vblank start here, so might be it'd need to be something like:
if (vpos >= 0 && vpos < (vsync_start - vblank_start + lb_vblank_lead_lines) count++;
The current code should be correct. vpos here returns the distance of hw vertical scanout position to the start of vblank. According to Alex and Harry Wentland of AMD's display team, and my testing of my two cards the hw vertical scanout position resets to zero at start line of vsync, therefore the "vpos" in that code becomes negative at start of vsync. At the same time the hw frame counter increments by one, making that "count++" to bump the returned count by +1 no longer neccessary.
If the reset of hw vertical scanout pos to zero and the increment of hw frame counter wouldn't happen at exactly the same time at start of vsync i could see how that could cause two successive queries of driver->get_vblank_counter() could report a count of N+1 and then N if the timing of both calls would be just perfectly right. That would cause the DRM code to falsely detect counter wraparound and jump the vblank counter forward by 2^24.
My tested gpu's had DCE-3 or DCE-4 display engines, Caicos has DCE-5, so could this be some hw quirk for DCE-5?
Also might be worth a shot to just ignore the hw frame counter. Eg.:
index e266ffc520d2..db728580549a 100644 --- a/drivers/gpu/drm/radeon/radeon_drv.c +++ b/drivers/gpu/drm/radeon/radeon_drv.c @@ -492,7 +492,6 @@ static struct drm_driver kms_driver = { .lastclose = radeon_driver_lastclose_kms, .set_busid = drm_pci_set_busid, .unload = radeon_driver_unload_kms,
.get_vblank_counter = radeon_get_vblank_counter_kms, .enable_vblank = radeon_enable_vblank_kms, .disable_vblank = radeon_disable_vblank_kms, .get_vblank_timestamp = radeon_get_vblank_timestamp_kms,
diff --git a/drivers/gpu/drm/radeon/radeon_irq_kms.c b/drivers/gpu/drm/radeon/radeon_irq_kms.c index 979f3bf65f2c..3c5fcab74152 100644 --- a/drivers/gpu/drm/radeon/radeon_irq_kms.c +++ b/drivers/gpu/drm/radeon/radeon_irq_kms.c @@ -152,11 +152,6 @@ int radeon_driver_irq_postinstall_kms(struct drm_device *dev) { struct radeon_device *rdev = dev->dev_private;
if (ASIC_IS_AVIVO(rdev))
dev->max_vblank_count = 0x00ffffff;
else
dev->max_vblank_count = 0x001fffff;
}return 0;
assuming I'm reading the code correctly.
31ace027c9f1f8e0a2b09bbf961e4db7b1f6cf19 "drm: Don't zero vblank timestamps from the irq handler"
ac0567a4b132fa66e3edf3f913938af9daf7f916 "drm: Add DRM_DEBUG_VBL()"
4dfd64862ff852df7b1198d667dda778715ee88f "drm: Use vblank timestamps to guesstimate how many vblanks were missed"
All clean reverts, just needs some fixup on top to use abs() instead of abs64() due to 79211c8ed19c055ca105502c8733800d442a0ae6.
Unfortunately I don't know if this is a kernel problem or kwin problem. I tried to CC maintainers of both, advices what to try or what info to provide welcome. The card is "CAICOS" with 1GB memory.
I think a drm.debug=0x21 setting and grep'ping the syslog for "vblank" should probably give useful info around the time of the hang.
Maybe also check XOrg.0.log for (WW) warnings related to flip.
thanks, -mario
Thanks, Vlastimil
On 01/16/2016 05:24 AM, Mario Kleiner wrote:
On 01/15/2016 01:26 PM, Ville Syrjälä wrote:
On Fri, Jan 15, 2016 at 11:34:08AM +0100, Vlastimil Babka wrote:
I'm currently running...
while xinit /usr/bin/ksplashqml --test -- :1 ; do echo yay; done
... in an endless loop on Linux 4.4 SMP PREEMPT on HD-5770 and so far i can't trigger a hang after hundreds of runs.
Does this also hang for you?
No, test mode seems to be fine.
I think a drm.debug=0x21 setting and grep'ping the syslog for "vblank" should probably give useful info around the time of the hang.
Attached. Captured by having kdm running, switching to console, running "dmesg -C ; dmesg -w > /tmp/dmesg", switch to kdm, enter password, see frozen splashscreen, switch back, terminate dmesg. So somewhere around the middle there should be where ksplashscreen starts...
Maybe also check XOrg.0.log for (WW) warnings related to flip.
No such warnings there.
thanks, -mario
Thanks, Vlastimil
On 01/18/2016 11:49 AM, Vlastimil Babka wrote:
On 01/16/2016 05:24 AM, Mario Kleiner wrote:
I think a drm.debug=0x21 setting and grep'ping the syslog for "vblank" should probably give useful info around the time of the hang.
Attached. Captured by having kdm running, switching to console, running "dmesg -C ; dmesg -w > /tmp/dmesg", switch to kdm, enter password, see frozen splashscreen, switch back, terminate dmesg. So somewhere around the middle there should be where ksplashscreen starts...
Maybe also check XOrg.0.log for (WW) warnings related to flip.
No such warnings there.
This is how gdb backtraces look like from the 4 threads of ksplashqml that's stuck. Thread 3 seems to be waiting on some response to radeon's ioctl?
(gdb) info threads Id Target Id Frame 4 Thread 0x7feb296f5700 (LWP 3643) "QXcbEventReader" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 3 Thread 0x7feb199f8700 (LWP 3644) "ksplashqml" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 2 Thread 0x7feb18ff2700 (LWP 3645) "QQmlThread" 0x00007feb392bd24d in poll () at ../sysdeps/unix/syscall-template.S:84 * 1 Thread 0x7feb3b79f8c0 (LWP 3642) "ksplashqml" 0x00007feb392bd24d in poll () at ../sysdeps/unix/syscall-template.S:84 (gdb) bt #0 0x00007feb392bd24d in poll () at ../sysdeps/unix/syscall-template.S:84 #1 0x00007feb32509432 in poll (__timeout=-1, __nfds=1, __fds=0x7ffce30ffb50) at /usr/include/bits/poll2.h:46 #2 _xcb_conn_wait (c=c@entry=0x17e25c0, cond=cond@entry=0x7ffce30ffc70, vector=vector@entry=0x0, count=count@entry=0x0) at xcb_conn.c:459 #3 0x00007feb3250ad57 in wait_for_reply (c=c@entry=0x17e25c0, request=request@entry=883, e=e@entry=0x7ffce30ffd48) at xcb_in.c:516 #4 0x00007feb3250aec1 in xcb_wait_for_reply64 (c=c@entry=0x17e25c0, request=883, e=e@entry=0x7ffce30ffd48) at xcb_in.c:560 #5 0x00007feb32b80300 in _XReply (dpy=dpy@entry=0x17e12c0, rep=rep@entry=0x7ffce30ffdc0, extra=extra@entry=0, discard=discard@entry=0) at xcb_io.c:596 #6 0x00007feb36eda712 in DRI2GetBuffersWithFormat (dpy=0x17e12c0, drawable=12582924, width=width@entry=0x181e528, height=height@entry=0x181e52c, attachments=0x7ffce30fff10, count=1, outCount=0x7ffce30ffef0) at dri2.c:491 #7 0x00007feb36edaa17 in dri2GetBuffersWithFormat (driDrawable=<optimized out>, width=0x181e528, height=0x181e52c, attachments=<optimized out>, count=<optimized out>, out_count=0x7ffce30ffef0, loaderPrivate=0x1fb1290) at dri2_glx.c:900 #8 0x00007feb20132618 in dri2_drawable_get_buffers (count=<synthetic pointer>, atts=0x1817da0, drawable=0x1816d20) at dri2.c:213 #9 dri2_allocate_textures (ctx=0x1a453d0, drawable=0x1816d20, statts=0x1817da0, statts_count=2) at dri2.c:407 #10 0x00007feb2012f17c in dri_st_framebuffer_validate (stctx=<optimized out>, stfbi=<optimized out>, statts=0x1817da0, count=2, out=0x7ffce3100050) at dri_drawable.c:83 #11 0x00007feb2005b5fe in st_framebuffer_validate (stfb=0x1817940, st=st@entry=0x1b11f20) at state_tracker/st_manager.c:200 #12 0x00007feb2005c88e in st_api_make_current (stapi=<optimized out>, stctxi=0x1b11f20, stdrawi=0x1816d20, streadi=0x1816d20) at state_tracker/st_manager.c:831 #13 0x00007feb2012ecd1 in dri_make_current (cPriv=<optimized out>, driDrawPriv=0x181e500, driReadPriv=0x181e500) at dri_context.c:245 #14 0x00007feb2012dcb6 in driBindContext (pcp=<optimized out>, pdp=<optimized out>, prp=<optimized out>) at dri_util.c:531 #15 0x00007feb36edc38b in dri2_bind_context (context=0x1a70960, old=<optimized out>, draw=12582924, read=12582924) at dri2_glx.c:160 #16 0x00007feb36eb99b7 in MakeContextCurrent (dpy=0x17e12c0, draw=draw@entry=12582924, read=read@entry=12582924, gc_user=0x1a70960) at glxcurrent.c:228 #17 0x00007feb36eb9b3b in glXMakeCurrent (dpy=<optimized out>, draw=draw@entry=12582924, gc=<optimized out>) at glxcurrent.c:262 #18 0x00007feb288d9a2d in QGLXContext::makeCurrent (this=0x1a48760, surface=0x1a0ac40) at qglxintegration.cpp:476 #19 0x00007feb3a0f8750 in QOpenGLContext::makeCurrent (this=0x18401e0, surface=0x1842d90) at kernel/qopenglcontext.cpp:936 #20 0x00007feb3af63aef in QSGGuiThreadRenderLoop::renderWindow (this=this@entry=0x1913f50, window=0x1842d80) at /usr/src/debug/qtdeclarative-opensource-src-5.5.1/src/quick/scenegraph/qsgrenderloop.cpp:341 #21 0x00007feb3af64d11 in QSGGuiThreadRenderLoop::event (this=0x1913f50, e=<optimized out>) at /usr/src/debug/qtdeclarative-opensource-src-5.5.1/src/quick/scenegraph/qsgrenderloop.cpp:474 #22 0x00007feb39b7fbd9 in QCoreApplication::notify (this=<optimized out>, receiver=<optimized out>, event=<optimized out>) at kernel/qcoreapplication.cpp:1038 #23 0x00007feb39b7fcf3 in QCoreApplication::notifyInternal (this=0x7ffce3100740, receiver=0x1913f50, event=event@entry=0x7ffce31004c0) at kernel/qcoreapplication.cpp:965 #24 0x00007feb39bd23bd in sendEvent (event=0x7ffce31004c0, receiver=<optimized out>) at ../../src/corelib/kernel/qcoreapplication.h:224 #25 QTimerInfoList::activateTimers (this=0x183e220) at kernel/qtimerinfo_unix.cpp:637 #26 0x00007feb39bd2909 in timerSourceDispatch (source=<optimized out>) at kernel/qeventdispatcher_glib.cpp:177 #27 idleTimerSourceDispatch (source=<optimized out>) at kernel/qeventdispatcher_glib.cpp:224 #28 0x00007feb35f4c097 in g_main_dispatch (context=0x7feb240016f0) at gmain.c:3154 #29 g_main_context_dispatch (context=context@entry=0x7feb240016f0) at gmain.c:3769 #30 0x00007feb35f4c2c8 in g_main_context_iterate (context=context@entry=0x7feb240016f0, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at gmain.c:3840 #31 0x00007feb35f4c36c in g_main_context_iteration (context=0x7feb240016f0, may_block=may_block@entry=1) at gmain.c:3901 #32 0x00007feb39bd350f in QEventDispatcherGlib::processEvents (this=0x183f080, flags=...) at kernel/qeventdispatcher_glib.cpp:418 #33 0x00007feb39b7d63a in QEventLoop::exec (this=this@entry=0x7ffce31006e0, flags=..., flags@entry=...) at kernel/qeventloop.cpp:204 #34 0x00007feb39b852fd in QCoreApplication::exec () at kernel/qcoreapplication.cpp:1229 #35 0x00007feb3a0bb53c in QGuiApplication::exec () at kernel/qguiapplication.cpp:1527 #36 0x0000000000405ce1 in main (argc=3, argv=0x7ffce3100878) at /usr/src/debug/plasma-workspace-5.5.2/ksplash/ksplashqml/main.cpp:98
(gdb) thread 2 [Switching to thread 2 (Thread 0x7feb18ff2700 (LWP 3645))] #0 0x00007feb392bd24d in poll () at ../sysdeps/unix/syscall-template.S:84 84 in ../sysdeps/unix/syscall-template.S (gdb) bt #0 0x00007feb392bd24d in poll () at ../sysdeps/unix/syscall-template.S:84 #1 0x00007feb35f4c264 in g_main_context_poll (priority=2147483647, n_fds=1, fds=0x7feb14003070, timeout=<optimized out>, context=0x7feb14000990) at gmain.c:4135 #2 g_main_context_iterate (context=context@entry=0x7feb14000990, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at gmain.c:3835 #3 0x00007feb35f4c36c in g_main_context_iteration (context=0x7feb14000990, may_block=may_block@entry=1) at gmain.c:3901 #4 0x00007feb39bd350f in QEventDispatcherGlib::processEvents (this=0x7feb140008c0, flags=...) at kernel/qeventdispatcher_glib.cpp:418 #5 0x00007feb39b7d63a in QEventLoop::exec (this=this@entry=0x7feb18ff1cf0, flags=..., flags@entry=...) at kernel/qeventloop.cpp:204 #6 0x00007feb399a9b1c in QThread::exec (this=this@entry=0x184dc00) at thread/qthread.cpp:503 #7 0x00007feb38c799a5 in QQmlThreadPrivate::run (this=0x184dc00) at /usr/src/debug/qtdeclarative-opensource-src-5.5.1/src/qml/qml/ftw/qqmlthread.cpp:141 #8 0x00007feb399ae94f in QThreadPrivate::start (arg=0x184dc00) at thread/qthread_unix.cpp:331 #9 0x00007feb37a3d4a4 in start_thread (arg=0x7feb18ff2700) at pthread_create.c:334 #10 0x00007feb392c5bdd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
(gdb) thread 3 [Switching to thread 3 (Thread 0x7feb199f8700 (LWP 3644))] #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 185 ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S: No such file or directory. (gdb) bt #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 #1 0x00007feb20496e63 in cnd_wait (mtx=0x18abc90, cond=0x18abcb8) at ../../../../../include/c11/threads_posix.h:159 #2 pipe_semaphore_wait (sema=0x18abc90) at ../../../../../src/gallium/auxiliary/os/os_thread.h:259 #3 radeon_drm_cs_emit_ioctl (param=param@entry=0x18ab940) at radeon_drm_winsys.c:653 #4 0x00007feb204966a7 in impl_thrd_routine (p=<optimized out>) at ../../../../../include/c11/threads_posix.h:87 #5 0x00007feb37a3d4a4 in start_thread (arg=0x7feb199f8700) at pthread_create.c:334 #6 0x00007feb392c5bdd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
(gdb) thread 4 [Switching to thread 4 (Thread 0x7feb296f5700 (LWP 3643))] #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 185 in ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S (gdb) bt #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 #1 0x00007feb32509689 in _xcb_conn_wait (c=c@entry=0x17e25c0, cond=cond@entry=0x17e2600, vector=vector@entry=0x0, count=count@entry=0x0) at xcb_conn.c:427 #2 0x00007feb3250b007 in xcb_wait_for_event (c=0x17e25c0) at xcb_in.c:693 #3 0x00007feb2ba48e29 in QXcbEventReader::run (this=0x17f55d0) at qxcbconnection.cpp:1229 #4 0x00007feb399ae94f in QThreadPrivate::start (arg=0x17f55d0) at thread/qthread_unix.cpp:331 #5 0x00007feb37a3d4a4 in start_thread (arg=0x7feb296f5700) at pthread_create.c:334 #6 0x00007feb392c5bdd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
Thread 3 seems to be waiting on some response to radeon's ioctl?
That's just the worker thread waiting for something to do. At least for this case you can ignore it.
The interesting one is calling DRI2GetBuffersWithFormat and waiting for a reply. For some reason the X server seems to be stuck waiting (most likely) for a page flip.
Regards, Christian.
Am 18.01.2016 um 15:06 schrieb Vlastimil Babka:
On 01/18/2016 11:49 AM, Vlastimil Babka wrote:
On 01/16/2016 05:24 AM, Mario Kleiner wrote:
I think a drm.debug=0x21 setting and grep'ping the syslog for "vblank" should probably give useful info around the time of the hang.
Attached. Captured by having kdm running, switching to console, running "dmesg -C ; dmesg -w > /tmp/dmesg", switch to kdm, enter password, see frozen splashscreen, switch back, terminate dmesg. So somewhere around the middle there should be where ksplashscreen starts...
Maybe also check XOrg.0.log for (WW) warnings related to flip.
No such warnings there.
This is how gdb backtraces look like from the 4 threads of ksplashqml that's stuck. Thread 3 seems to be waiting on some response to radeon's ioctl?
(gdb) info threads Id Target Id Frame 4 Thread 0x7feb296f5700 (LWP 3643) "QXcbEventReader" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 3 Thread 0x7feb199f8700 (LWP 3644) "ksplashqml" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 2 Thread 0x7feb18ff2700 (LWP 3645) "QQmlThread" 0x00007feb392bd24d in poll () at ../sysdeps/unix/syscall-template.S:84
- 1 Thread 0x7feb3b79f8c0 (LWP 3642) "ksplashqml" 0x00007feb392bd24d in poll () at ../sysdeps/unix/syscall-template.S:84
(gdb) bt #0 0x00007feb392bd24d in poll () at ../sysdeps/unix/syscall-template.S:84 #1 0x00007feb32509432 in poll (__timeout=-1, __nfds=1, __fds=0x7ffce30ffb50) at /usr/include/bits/poll2.h:46 #2 _xcb_conn_wait (c=c@entry=0x17e25c0, cond=cond@entry=0x7ffce30ffc70, vector=vector@entry=0x0, count=count@entry=0x0) at xcb_conn.c:459 #3 0x00007feb3250ad57 in wait_for_reply (c=c@entry=0x17e25c0, request=request@entry=883, e=e@entry=0x7ffce30ffd48) at xcb_in.c:516 #4 0x00007feb3250aec1 in xcb_wait_for_reply64 (c=c@entry=0x17e25c0, request=883, e=e@entry=0x7ffce30ffd48) at xcb_in.c:560 #5 0x00007feb32b80300 in _XReply (dpy=dpy@entry=0x17e12c0, rep=rep@entry=0x7ffce30ffdc0, extra=extra@entry=0, discard=discard@entry=0) at xcb_io.c:596 #6 0x00007feb36eda712 in DRI2GetBuffersWithFormat (dpy=0x17e12c0, drawable=12582924, width=width@entry=0x181e528, height=height@entry=0x181e52c, attachments=0x7ffce30fff10, count=1, outCount=0x7ffce30ffef0) at dri2.c:491 #7 0x00007feb36edaa17 in dri2GetBuffersWithFormat (driDrawable=<optimized out>, width=0x181e528, height=0x181e52c, attachments=<optimized out>, count=<optimized out>, out_count=0x7ffce30ffef0, loaderPrivate=0x1fb1290) at dri2_glx.c:900 #8 0x00007feb20132618 in dri2_drawable_get_buffers (count=<synthetic pointer>, atts=0x1817da0, drawable=0x1816d20) at dri2.c:213 #9 dri2_allocate_textures (ctx=0x1a453d0, drawable=0x1816d20, statts=0x1817da0, statts_count=2) at dri2.c:407 #10 0x00007feb2012f17c in dri_st_framebuffer_validate (stctx=<optimized out>, stfbi=<optimized out>, statts=0x1817da0, count=2, out=0x7ffce3100050) at dri_drawable.c:83 #11 0x00007feb2005b5fe in st_framebuffer_validate (stfb=0x1817940, st=st@entry=0x1b11f20) at state_tracker/st_manager.c:200 #12 0x00007feb2005c88e in st_api_make_current (stapi=<optimized out>, stctxi=0x1b11f20, stdrawi=0x1816d20, streadi=0x1816d20) at state_tracker/st_manager.c:831 #13 0x00007feb2012ecd1 in dri_make_current (cPriv=<optimized out>, driDrawPriv=0x181e500, driReadPriv=0x181e500) at dri_context.c:245 #14 0x00007feb2012dcb6 in driBindContext (pcp=<optimized out>, pdp=<optimized out>, prp=<optimized out>) at dri_util.c:531 #15 0x00007feb36edc38b in dri2_bind_context (context=0x1a70960, old=<optimized out>, draw=12582924, read=12582924) at dri2_glx.c:160 #16 0x00007feb36eb99b7 in MakeContextCurrent (dpy=0x17e12c0, draw=draw@entry=12582924, read=read@entry=12582924, gc_user=0x1a70960) at glxcurrent.c:228 #17 0x00007feb36eb9b3b in glXMakeCurrent (dpy=<optimized out>, draw=draw@entry=12582924, gc=<optimized out>) at glxcurrent.c:262 #18 0x00007feb288d9a2d in QGLXContext::makeCurrent (this=0x1a48760, surface=0x1a0ac40) at qglxintegration.cpp:476 #19 0x00007feb3a0f8750 in QOpenGLContext::makeCurrent (this=0x18401e0, surface=0x1842d90) at kernel/qopenglcontext.cpp:936 #20 0x00007feb3af63aef in QSGGuiThreadRenderLoop::renderWindow (this=this@entry=0x1913f50, window=0x1842d80) at /usr/src/debug/qtdeclarative-opensource-src-5.5.1/src/quick/scenegraph/qsgrenderloop.cpp:341 #21 0x00007feb3af64d11 in QSGGuiThreadRenderLoop::event (this=0x1913f50, e=<optimized out>) at /usr/src/debug/qtdeclarative-opensource-src-5.5.1/src/quick/scenegraph/qsgrenderloop.cpp:474 #22 0x00007feb39b7fbd9 in QCoreApplication::notify (this=<optimized out>, receiver=<optimized out>, event=<optimized out>) at kernel/qcoreapplication.cpp:1038 #23 0x00007feb39b7fcf3 in QCoreApplication::notifyInternal (this=0x7ffce3100740, receiver=0x1913f50, event=event@entry=0x7ffce31004c0) at kernel/qcoreapplication.cpp:965 #24 0x00007feb39bd23bd in sendEvent (event=0x7ffce31004c0, receiver=<optimized out>) at ../../src/corelib/kernel/qcoreapplication.h:224 #25 QTimerInfoList::activateTimers (this=0x183e220) at kernel/qtimerinfo_unix.cpp:637 #26 0x00007feb39bd2909 in timerSourceDispatch (source=<optimized out>) at kernel/qeventdispatcher_glib.cpp:177 #27 idleTimerSourceDispatch (source=<optimized out>) at kernel/qeventdispatcher_glib.cpp:224 #28 0x00007feb35f4c097 in g_main_dispatch (context=0x7feb240016f0) at gmain.c:3154 #29 g_main_context_dispatch (context=context@entry=0x7feb240016f0) at gmain.c:3769 #30 0x00007feb35f4c2c8 in g_main_context_iterate (context=context@entry=0x7feb240016f0, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at gmain.c:3840 #31 0x00007feb35f4c36c in g_main_context_iteration (context=0x7feb240016f0, may_block=may_block@entry=1) at gmain.c:3901 #32 0x00007feb39bd350f in QEventDispatcherGlib::processEvents (this=0x183f080, flags=...) at kernel/qeventdispatcher_glib.cpp:418 #33 0x00007feb39b7d63a in QEventLoop::exec (this=this@entry=0x7ffce31006e0, flags=..., flags@entry=...) at kernel/qeventloop.cpp:204 #34 0x00007feb39b852fd in QCoreApplication::exec () at kernel/qcoreapplication.cpp:1229 #35 0x00007feb3a0bb53c in QGuiApplication::exec () at kernel/qguiapplication.cpp:1527 #36 0x0000000000405ce1 in main (argc=3, argv=0x7ffce3100878) at /usr/src/debug/plasma-workspace-5.5.2/ksplash/ksplashqml/main.cpp:98
(gdb) thread 2 [Switching to thread 2 (Thread 0x7feb18ff2700 (LWP 3645))] #0 0x00007feb392bd24d in poll () at ../sysdeps/unix/syscall-template.S:84 84 in ../sysdeps/unix/syscall-template.S (gdb) bt #0 0x00007feb392bd24d in poll () at ../sysdeps/unix/syscall-template.S:84 #1 0x00007feb35f4c264 in g_main_context_poll (priority=2147483647, n_fds=1, fds=0x7feb14003070, timeout=<optimized out>, context=0x7feb14000990) at gmain.c:4135 #2 g_main_context_iterate (context=context@entry=0x7feb14000990, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at gmain.c:3835 #3 0x00007feb35f4c36c in g_main_context_iteration (context=0x7feb14000990, may_block=may_block@entry=1) at gmain.c:3901 #4 0x00007feb39bd350f in QEventDispatcherGlib::processEvents (this=0x7feb140008c0, flags=...) at kernel/qeventdispatcher_glib.cpp:418 #5 0x00007feb39b7d63a in QEventLoop::exec (this=this@entry=0x7feb18ff1cf0, flags=..., flags@entry=...) at kernel/qeventloop.cpp:204 #6 0x00007feb399a9b1c in QThread::exec (this=this@entry=0x184dc00) at thread/qthread.cpp:503 #7 0x00007feb38c799a5 in QQmlThreadPrivate::run (this=0x184dc00) at /usr/src/debug/qtdeclarative-opensource-src-5.5.1/src/qml/qml/ftw/qqmlthread.cpp:141 #8 0x00007feb399ae94f in QThreadPrivate::start (arg=0x184dc00) at thread/qthread_unix.cpp:331 #9 0x00007feb37a3d4a4 in start_thread (arg=0x7feb18ff2700) at pthread_create.c:334 #10 0x00007feb392c5bdd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
(gdb) thread 3 [Switching to thread 3 (Thread 0x7feb199f8700 (LWP 3644))] #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 185 ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S: No such file or directory. (gdb) bt #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 #1 0x00007feb20496e63 in cnd_wait (mtx=0x18abc90, cond=0x18abcb8) at ../../../../../include/c11/threads_posix.h:159 #2 pipe_semaphore_wait (sema=0x18abc90) at ../../../../../src/gallium/auxiliary/os/os_thread.h:259 #3 radeon_drm_cs_emit_ioctl (param=param@entry=0x18ab940) at radeon_drm_winsys.c:653 #4 0x00007feb204966a7 in impl_thrd_routine (p=<optimized out>) at ../../../../../include/c11/threads_posix.h:87 #5 0x00007feb37a3d4a4 in start_thread (arg=0x7feb199f8700) at pthread_create.c:334 #6 0x00007feb392c5bdd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
(gdb) thread 4 [Switching to thread 4 (Thread 0x7feb296f5700 (LWP 3643))] #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 185 in ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S (gdb) bt #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 #1 0x00007feb32509689 in _xcb_conn_wait (c=c@entry=0x17e25c0, cond=cond@entry=0x17e2600, vector=vector@entry=0x0, count=count@entry=0x0) at xcb_conn.c:427 #2 0x00007feb3250b007 in xcb_wait_for_event (c=0x17e25c0) at xcb_in.c:693 #3 0x00007feb2ba48e29 in QXcbEventReader::run (this=0x17f55d0) at qxcbconnection.cpp:1229 #4 0x00007feb399ae94f in QThreadPrivate::start (arg=0x17f55d0) at thread/qthread_unix.cpp:331 #5 0x00007feb37a3d4a4 in start_thread (arg=0x7feb296f5700) at pthread_create.c:334 #6 0x00007feb392c5bdd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
On 01/18/2016 11:49 AM, Vlastimil Babka wrote:
On 01/16/2016 05:24 AM, Mario Kleiner wrote:
On 01/15/2016 01:26 PM, Ville Syrjälä wrote:
On Fri, Jan 15, 2016 at 11:34:08AM +0100, Vlastimil Babka wrote:
I'm currently running...
while xinit /usr/bin/ksplashqml --test -- :1 ; do echo yay; done
... in an endless loop on Linux 4.4 SMP PREEMPT on HD-5770 and so far i can't trigger a hang after hundreds of runs.
Does this also hang for you?
No, test mode seems to be fine.
I think a drm.debug=0x21 setting and grep'ping the syslog for "vblank" should probably give useful info around the time of the hang.
Attached. Captured by having kdm running, switching to console, running "dmesg -C ; dmesg -w > /tmp/dmesg", switch to kdm, enter password, see frozen splashscreen, switch back, terminate dmesg. So somewhere around the middle there should be where ksplashscreen starts...
Hmm this looks suspicious? (!!! mine)
[ 538.918990] [drm:drm_update_vblank_count] updating vblank count on crtc 1: current=385876589, diff=1, hw=622 hw_last=621 [ 538.918991] [drm:evergreen_irq_process] IH: D2 vblank [ 538.935035] [drm:radeon_get_vblank_counter_kms] crtc 0: dist from vblank start 3 [ 538.935040] [drm:radeon_get_vblank_counter_kms] crtc 0: dist from vblank start 3 [ 538.935041] [drm:drm_update_vblank_count] updating vblank count on crtc 0: current=16808100, diff=1, hw=30885 hw_last=30884 [ 538.935042] [drm:evergreen_irq_process] IH: D1 vblank [ 538.939702] [drm:drm_wait_vblank] waiting on vblank count 385876590, crtc 1 [ 538.939704] [drm:drm_wait_vblank] returning 385876590 to client [ 538.939709] [drm:drm_wait_vblank] waiting on vblank count 385876590, crtc 1 [ 538.939710] [drm:drm_wait_vblank] returning 385876590 to client !!!538.939715] [drm:drm_queue_vblank_event] event on vblank count 385876591, current 385876590, crtc 1 [ 538.944452] [drm:drm_wait_vblank] waiting on vblank count 16808101, crtc 0 [ 538.944453] [drm:drm_wait_vblank] returning 16808101 to client [ 538.944458] [drm:drm_wait_vblank] waiting on vblank count 16808101, crtc 0 [ 538.944460] [drm:drm_wait_vblank] returning 16808101 to client [ 538.944465] [drm:drm_queue_vblank_event] event on vblank count 16808102, current 16808101, crtc 0 [ 538.948210] [drm:drm_wait_vblank] waiting on vblank count 16808101, crtc 0 [ 538.948212] [drm:drm_wait_vblank] returning 16808101 to client [ 538.948222] [drm:drm_wait_vblank] waiting on vblank count 16808101, crtc 0 [ 538.948224] [drm:drm_wait_vblank] returning 16808101 to client [ 538.949589] [drm:drm_wait_vblank] waiting on vblank count 16808101, crtc 0 [ 538.949591] [drm:drm_wait_vblank] returning 16808101 to client [ 538.951238] [drm:radeon_get_vblank_counter_kms] crtc 1: dist from vblank start 6 [ 538.951245] [drm:radeon_get_vblank_counter_kms] crtc 1: dist from vblank start 7 !!!538.951246] [drm:drm_update_vblank_count] updating vblank count on crtc 1: current=385876590, diff=16776597, hw=3 hw_last=622 [ 538.951247] [drm:evergreen_irq_process] IH: D2 vblank [ 538.951746] [drm:radeon_get_vblank_counter_kms] crtc 0: dist from vblank start 4 [ 538.951752] [drm:radeon_get_vblank_counter_kms] crtc 0: dist from vblank start 4 [ 538.951753] [drm:drm_update_vblank_count] updating vblank count on crtc 0: current=16808101, diff=1, hw=30886 hw_last=30885 [ 538.951754] [drm:drm_handle_vblank_events] vblank event on 16808102, current 16808102 [ 538.951756] [drm:evergreen_irq_process] IH: D1 vblank [ 538.964570] [drm:radeon_get_vblank_counter_kms] crtc 1: dist from vblank start 7 [ 538.964581] [drm:radeon_get_vblank_counter_kms] crtc 1: dist from vblank start -1058 [ 538.964583] [drm:drm_update_vblank_count] updating vblank count on crtc 1: current=402653187, diff=1, hw=4 hw_last=3
Could it be that the underflow caused some signed logic to misbehave and fail to detect that we passed 385876591?
Later we have another such big skip (but this time nothing waits for it I guess):
[ 541.337813] [drm:drm_update_vblank_count] updating vblank count on crtc 1: current=402653363, diff=16777040, hw=3 hw_last=179
Maybe also check XOrg.0.log for (WW) warnings related to flip.
No such warnings there.
thanks, -mario
Thanks, Vlastimil
On 01/18/2016 11:49 AM, Vlastimil Babka wrote:
On 01/16/2016 05:24 AM, Mario Kleiner wrote:
On 01/15/2016 01:26 PM, Ville Syrjälä wrote:
On Fri, Jan 15, 2016 at 11:34:08AM +0100, Vlastimil Babka wrote:
I'm currently running...
while xinit /usr/bin/ksplashqml --test -- :1 ; do echo yay; done
... in an endless loop on Linux 4.4 SMP PREEMPT on HD-5770 and so far i can't trigger a hang after hundreds of runs.
Does this also hang for you?
No, test mode seems to be fine.
I think a drm.debug=0x21 setting and grep'ping the syslog for "vblank" should probably give useful info around the time of the hang.
Attached. Captured by having kdm running, switching to console, running "dmesg -C ; dmesg -w > /tmp/dmesg", switch to kdm, enter password, see frozen splashscreen, switch back, terminate dmesg. So somewhere around the middle there should be where ksplashscreen starts...
Maybe also check XOrg.0.log for (WW) warnings related to flip.
No such warnings there.
thanks, -mario
Thanks, Vlastimil
Thanks. So the problem is that AMDs hardware frame counters reset to zero during a modeset. The old DRM code dealt with drivers doing that by keeping vblank irqs enabled during modesets and incrementing vblank count by one during each vblank irq, i think that's what drm_vblank_pre_modeset() and drm_vblank_post_modeset() were meant for.
The new code in drm_update_vblank_count() breaks this. The reset of the counter to zero is treated as counter wraparound, so our software vblank counter jumps forward by up to 2^24 counts in response (in case of AMD's 24 bit hw counters), and then the vblank event handling code in drm_handle_vblank_events() and other places detects the counter being more than 2^23 counts ahead of queued vblank events and as part of its own wraparound handling for the 32-Bit software counter doesn't deliver these queued events for a long time -> no vblank swap trigger event -> no swap -> client hangs waiting for swap completion.
I think i remember seeing the ksplash progress screen occasionally blanking half way through login, i guess that's when kwin triggers a modeset in parallel to ksplash doing its OpenGL animations. So depending on the hw vblank count at the time of login ksplash would or wouldn't hang, apparently i got "lucky" with my counts at login.
-mario
On 21.01.2016 05:32, Mario Kleiner wrote:
So the problem is that AMDs hardware frame counters reset to zero during a modeset. The old DRM code dealt with drivers doing that by keeping vblank irqs enabled during modesets and incrementing vblank count by one during each vblank irq, i think that's what drm_vblank_pre_modeset() and drm_vblank_post_modeset() were meant for.
Right, looks like there's been a regression breaking this. I suspect the problem is that vblank->last isn't getting updated from drm_vblank_post_modeset. Not sure which change broke that though, or how to fix it. Ville?
BTW, I'm seeing a similar issue with drm_vblank_on/off as well, which exposed the bug fixed by 209e4dbc ("drm/vblank: Use u32 consistently for vblank counters"). I've been meaning to track that down since then; one of these days hopefully, but if anybody has any ideas offhand...
On 01/21/2016 04:43 AM, Michel Dänzer wrote:
On 21.01.2016 05:32, Mario Kleiner wrote:
So the problem is that AMDs hardware frame counters reset to zero during a modeset. The old DRM code dealt with drivers doing that by keeping vblank irqs enabled during modesets and incrementing vblank count by one during each vblank irq, i think that's what drm_vblank_pre_modeset() and drm_vblank_post_modeset() were meant for.
Right, looks like there's been a regression breaking this. I suspect the problem is that vblank->last isn't getting updated from drm_vblank_post_modeset. Not sure which change broke that though, or how to fix it. Ville?
The whole logic has changed and the software counter updates are now driven all the time by the hw counter.
BTW, I'm seeing a similar issue with drm_vblank_on/off as well, which exposed the bug fixed by 209e4dbc ("drm/vblank: Use u32 consistently for vblank counters"). I've been meaning to track that down since then; one of these days hopefully, but if anybody has any ideas offhand...
I spent the last few hours reading through the drm and radeon code and i think what should probably work is to replace the drm_vblank_pre/post_modeset calls in radeon/amdgpu by drm_vblank_off/on calls. These are apparently meant for drivers whose hw counters reset during modeset, and seem to reinitialize stuff properly and release clients queued vblank events to avoid blocking - not tested so far, just looked at the code.
Once drm_vblank_off is called, drm_vblank_get will no-op and return an error, so clients can't enable vblank irqs during the modeset - pageflip ioctl and waitvblank ioctl would fail while a modeset happens - hopefully userspace handles this correctly everywhere.
It would also cause radeons power management to not sync its actions to vblank if it would get invoked during a modeset, but that seems to be handled by a 200 msec timeout and hopefully only cause visual glitches - or invisible glitches while the crtc is blanked during modeset?
There could be another tiny race with the new "vblank counter bumping" logic from commit 5b5561b ("drm/radeon: Fixup hw vblank counters/ts ...") if drm_update_vblank_counter() would be called multiple times in quick succession within the "radeon_crtc->lb_vblank_lead_lines" scanlines before start of real vblank iff at the same time a modeset would happen and set radeon_crtc->lb_vblank_lead_lines to a smaller value due to a change in horizontal mode resolution. That needs a modeset to happen to a higher horizontal resolution just exactly when the scanout is in exactly the right 5 or so scanlines and some client is calling drm_vblank_get() to enable vblank irqs at the same time, but it would cause the same hang if it happened - not that likely to happen often, but still not nice, also Murphy's law... If we could switch to drm_vblank_off/on instead of drm_vblank_pre/post_modeset we could remove those race as well by forbidding any vblank irq related activity during a modeset.
I'll hack up a patch for demonstration now.
On 21.01.2016 14:31, Mario Kleiner wrote:
On 01/21/2016 04:43 AM, Michel Dänzer wrote:
On 21.01.2016 05:32, Mario Kleiner wrote:
So the problem is that AMDs hardware frame counters reset to zero during a modeset. The old DRM code dealt with drivers doing that by keeping vblank irqs enabled during modesets and incrementing vblank count by one during each vblank irq, i think that's what drm_vblank_pre_modeset() and drm_vblank_post_modeset() were meant for.
Right, looks like there's been a regression breaking this. I suspect the problem is that vblank->last isn't getting updated from drm_vblank_post_modeset. Not sure which change broke that though, or how to fix it. Ville?
The whole logic has changed and the software counter updates are now driven all the time by the hw counter.
BTW, I'm seeing a similar issue with drm_vblank_on/off as well, which exposed the bug fixed by 209e4dbc ("drm/vblank: Use u32 consistently for vblank counters"). I've been meaning to track that down since then; one of these days hopefully, but if anybody has any ideas offhand...
I spent the last few hours reading through the drm and radeon code and i think what should probably work is to replace the drm_vblank_pre/post_modeset calls in radeon/amdgpu by drm_vblank_off/on calls. These are apparently meant for drivers whose hw counters reset during modeset, [...]
... just like drm_vblank_pre/post_modeset. That those were broken is a regression which needs to be fixed anyway. I don't think switching to drm_vblank_on/off is suitable for stable trees.
Looking at Vlastimil's original post again, I'd say the most likely culprit is 4dfd6486 ("drm: Use vblank timestamps to guesstimate how many vblanks were missed").
Once drm_vblank_off is called, drm_vblank_get will no-op and return an error, so clients can't enable vblank irqs during the modeset - pageflip ioctl and waitvblank ioctl would fail while a modeset happens - hopefully userspace handles this correctly everywhere.
We've fixed xf86-video-ati for this.
I'll hack up a patch for demonstration now.
You're a bit late to that party. :)
http://lists.freedesktop.org/archives/dri-devel/2015-May/083614.html http://lists.freedesktop.org/archives/dri-devel/2015-July/086451.html
On 21.01.2016 15:38, Michel Dänzer wrote:
On 21.01.2016 14:31, Mario Kleiner wrote:
On 01/21/2016 04:43 AM, Michel Dänzer wrote:
On 21.01.2016 05:32, Mario Kleiner wrote:
So the problem is that AMDs hardware frame counters reset to zero during a modeset. The old DRM code dealt with drivers doing that by keeping vblank irqs enabled during modesets and incrementing vblank count by one during each vblank irq, i think that's what drm_vblank_pre_modeset() and drm_vblank_post_modeset() were meant for.
Right, looks like there's been a regression breaking this. I suspect the problem is that vblank->last isn't getting updated from drm_vblank_post_modeset. Not sure which change broke that though, or how to fix it. Ville?
The whole logic has changed and the software counter updates are now driven all the time by the hw counter.
BTW, I'm seeing a similar issue with drm_vblank_on/off as well, which exposed the bug fixed by 209e4dbc ("drm/vblank: Use u32 consistently for vblank counters"). I've been meaning to track that down since then; one of these days hopefully, but if anybody has any ideas offhand...
I spent the last few hours reading through the drm and radeon code and i think what should probably work is to replace the drm_vblank_pre/post_modeset calls in radeon/amdgpu by drm_vblank_off/on calls. These are apparently meant for drivers whose hw counters reset during modeset, [...]
... just like drm_vblank_pre/post_modeset. That those were broken is a regression which needs to be fixed anyway. I don't think switching to drm_vblank_on/off is suitable for stable trees.
Even more so since as I mentioned, there is (has been since at least about half a year ago) a counter jumping bug with drm_vblank_on/off as well.
On 01/21/2016 07:38 AM, Michel Dänzer wrote:
On 21.01.2016 14:31, Mario Kleiner wrote:
On 01/21/2016 04:43 AM, Michel Dänzer wrote:
On 21.01.2016 05:32, Mario Kleiner wrote:
So the problem is that AMDs hardware frame counters reset to zero during a modeset. The old DRM code dealt with drivers doing that by keeping vblank irqs enabled during modesets and incrementing vblank count by one during each vblank irq, i think that's what drm_vblank_pre_modeset() and drm_vblank_post_modeset() were meant for.
Right, looks like there's been a regression breaking this. I suspect the problem is that vblank->last isn't getting updated from drm_vblank_post_modeset. Not sure which change broke that though, or how to fix it. Ville?
The whole logic has changed and the software counter updates are now driven all the time by the hw counter.
BTW, I'm seeing a similar issue with drm_vblank_on/off as well, which exposed the bug fixed by 209e4dbc ("drm/vblank: Use u32 consistently for vblank counters"). I've been meaning to track that down since then; one of these days hopefully, but if anybody has any ideas offhand...
I spent the last few hours reading through the drm and radeon code and i think what should probably work is to replace the drm_vblank_pre/post_modeset calls in radeon/amdgpu by drm_vblank_off/on calls. These are apparently meant for drivers whose hw counters reset during modeset, [...]
... just like drm_vblank_pre/post_modeset. That those were broken is a regression which needs to be fixed anyway. I don't think switching to drm_vblank_on/off is suitable for stable trees.
Looking at Vlastimil's original post again, I'd say the most likely culprit is 4dfd6486 ("drm: Use vblank timestamps to guesstimate how many vblanks were missed").
Yes, i think reverting that one alone would likely fix it by reverting to the old vblank update logic.
Once drm_vblank_off is called, drm_vblank_get will no-op and return an error, so clients can't enable vblank irqs during the modeset - pageflip ioctl and waitvblank ioctl would fail while a modeset happens - hopefully userspace handles this correctly everywhere.
We've fixed xf86-video-ati for this.
I'll hack up a patch for demonstration now.
You're a bit late to that party. :)
http://lists.freedesktop.org/archives/dri-devel/2015-May/083614.html http://lists.freedesktop.org/archives/dri-devel/2015-July/086451.html
Oops. Just sent out my little (so far untested) creations. Yes, they are essentially the same as Daniel's patches. The only addition is to also fix that other potential small race i describe by slightly moving the xxx_pm_compute_clocks() calls around. And a fix for drm_vblank_get/put imbalance in radeon_pm if vblank_on/off would be used.
-mario
dri-devel@lists.freedesktop.org