https://bugs.freedesktop.org/show_bug.cgi?id=68235
Priority: medium Bug ID: 68235 Assignee: dri-devel@lists.freedesktop.org Summary: Display freezes after login with kernel 3.11.0-rc5 on Cayman with dpm=1 Severity: normal Classification: Unclassified OS: All Reporter: alexandre.f.demers@gmail.com Hardware: Other Status: NEW Version: XOrg CVS Component: DRM/Radeon Product: DRI
I was testing kernel 3.11.0-rc5 and ended up with my display freezing after login (2 tests: one this morning, one tonight). It always freezes when dpm=1, but it doesn't if disabled.
The result on my screen looks like screenshot posted in bug 66963 (https://bugs.freedesktop.org/attachment.cgi?id=83470)
So I connected through ssh and I got some error in dmesg just after display froze. I'll be attaching my errors.log file in a moment. You'll see
VM is also enabled. I could try without it.
Also, it doesn't freezes with commit 69e0b57, I've been using and testing it regularly for the last week and half. So I could bisect if we don't have enough info.
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #1 from Alexandre Demers alexandre.f.demers@gmail.com --- Created attachment 84186 --> https://bugs.freedesktop.org/attachment.cgi?id=84186&action=edit errors when freeze happens
Errors logged from my two last try at booting and logging with kernel 3.11.0-rc5 when dpm=1 with RADEON_va=1
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #2 from Alexandre Demers alexandre.f.demers@gmail.com --- I began bisecting tonight. Rc2 was already having this bug. More news to come before the weekend.
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #3 from Alexandre Demers alexandre.f.demers@gmail.com --- kernel 3.11.0-rc1 was experiencing a bug, but not the one seen in rc2 and beyond. I'll dig on the "fix" that brought us to the state seen since rc2. If nothing can be found, I'll go up the drm-next branch that was included in rc1.
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #4 from Alexandre Demers alexandre.f.demers@gmail.com --- After bisect in one direction, I've ended up with the following commit: f90555cbe629e14c6af1dcec1933a3833ecd321f is the first bad commit commit f90555cbe629e14c6af1dcec1933a3833ecd321f Author: Alex Deucher alexander.deucher@amd.com Date: Wed Jul 17 16:34:12 2013 -0400
drm/radeon/dpm/atom: fix broken gcc harder
See bugs: https://bugs.freedesktop.org/show_bug.cgi?id=66932 https://bugs.freedesktop.org/show_bug.cgi?id=66972 https://bugs.freedesktop.org/show_bug.cgi?id=66945
Signed-off-by: Alex Deucher alexander.deucher@amd.com
:040000 040000 c32ad9a80c5356236e935eeb5198683727b9d00d eb5aa1083eb33e7b9aebebdb310dda0399152e87 M drivers
Now, I must say this commit actually fixes a visual problem after commit 69e0b57 (which is a good commit over here without any known problem). So, I'll dig in the other direction to find which commit broke the known good state.
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #5 from Alex Deucher agd5f@yahoo.com --- You might try this branch in case gcc is having problems with the variable sized arrays used in the driver: http://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next-3.12-wip-gcc-fixes
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #6 from Alexandre Demers alexandre.f.demers@gmail.com --- (In reply to comment #5)
Ok, I'll try it tonight.
About the bisection I'm doing on the other direction (to find what broke the display), I should also be able to narrow it down tonight.
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #7 from Alexandre Demers alexandre.f.demers@gmail.com --- Hi Alex. I'm about to test your suggestion. Meanwhile, I identified the original commit that broke the driver before being fixed by f90555cbe629e14c6af1dcec1933a3833ecd321f (but ending by the display hanging, eventhough I can connect through ssh)
So the first bad commit was: 7ad8d0687bb5030c3328bc7229a3183ce179ab25 is the first bad commit commit 7ad8d0687bb5030c3328bc7229a3183ce179ab25 Author: Alex Deucher alexander.deucher@amd.com Date: Mon Jul 1 16:07:18 2013 -0400
drm/radeon/dpm: re-enable state transitions for Cayman
Was disabled due to stability issues on certain boards caused by the a bug in the parsing of the atom mc reg tables. That's fixed now so re-enable.
Signed-off-by: Alex Deucher alexander.deucher@amd.com
:040000 040000 de8dfc2a15d5114e81636811d7e3b39c15fc515b d0e1ee828f10456d39e2ab30cc6598203e50fa6e M drivers
Heading for your suggestion right away with http://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next-3.12-wip-gcc-fixes.
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #8 from Alexandre Demers alexandre.f.demers@gmail.com --- Tested with http://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next-3.12-wip-gcc-fixes and it does exactly the same thing: it boots fine, show the login screen. I can even login in if it doesn't hang right away. Then, it will hang at some point (either at login screen or after loading the desktop). It displays generally grey vertical bars.
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #9 from Alexandre Demers alexandre.f.demers@gmail.com --- Still the same with kernel 3.11.0. Tried with VM=0, aspm=0, disconnected my UPC (just in case it was something with a "battery" state or something similar), tried Gnome 3 and XFCE, all the same. The only thing working for now is to set dpm=0 or to force ret=1 in ni_dpm_set_power_state when checking what ni_restrict_performance_levels_before_switch answered.
However, I don't know if the problem is with ni_dpm_set_power_state or with something executed after, so I'll play in there.
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #10 from Alexandre Demers alexandre.f.demers@gmail.com --- If ret=1 just after ni_restrict_performance_levels_before_switch(), ni_dpm_set_power_state() doesn't go any further and there is no hang. So, it seems like if the problem is not with ni_restrict_performance_levels_before_switch() but instead with a combination of some sort.
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #11 from Alexandre Demers alexandre.f.demers@gmail.com --- So, after getting out at different points from ni_dpm_set_power_state(), it seems I can go down to ni_power_control_set_level() without problem. However, if I move to the next call which is ret = ni_dpm_force_performance_level(rdev, RADEON_DPM_FORCED_LEVEL_AUTO), it hangs.
Could it be because we are setting something wrong in auto performance level? I'll be attaching my vbios just in case.
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #12 from Alexandre Demers alexandre.f.demers@gmail.com --- Created attachment 85157 --> https://bugs.freedesktop.org/attachment.cgi?id=85157&action=edit Cayman 6950 XFX vbios
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #13 from Alexandre Demers alexandre.f.demers@gmail.com --- Is there anything else I can do to give a better idea of what is happening and why it crashes?
If this can be of any value, my 6950 is of the following model: XFX HD-695X-ZNDC (1GB DDR5, 830MHz Core Clock and 5200MHz Memory Clock)
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #14 from Alex Deucher agd5f@yahoo.com --- Created attachment 85578 --> https://bugs.freedesktop.org/attachment.cgi?id=85578&action=edit disable various dpm features
I would suggest disabling various dpm features and see if you can narrow down which, if any, help. This patch disables just about everything.
ni_dpm_force_performance_level(rdev, RADEON_DPM_FORCED_LEVEL_AUTO) is what actually sets the dynamic performance switching into motion. Prior to that the hw is locked into the low performance level. I sounds like there is some bad parameter that is causing a lock up when the smc enables state switching.
Separate from the patch can you also try changing the ni_dpm_force_performance_level(rdev, RADEON_DPM_FORCED_LEVEL_AUTO) call in ni_dpm_set_power_state() to low (RADEON_DPM_FORCED_LEVEL_LOW) or high (RADEON_DPM_FORCED_LEVEL_HIGH) rather than auto? See if you still get a lock up.
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #15 from Alex Deucher agd5f@yahoo.com --- Another thing worth checking, what is the value of module_index passed to radeon_atom_init_mc_reg_table() in ni_initialize_mc_reg_table() in ni_dpm.c on your system?
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #16 from Alexandre Demers alexandre.f.demers@gmail.com --- (In reply to comment #14)
I'll try it later today.
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #17 from Alexandre Demers alexandre.f.demers@gmail.com --- (In reply to comment #15)
How can I get it? Should I print it in dmesg?
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #18 from Alex Deucher agd5f@yahoo.com --- (In reply to comment #17)
yes, that would be great.
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #19 from Alexandre Demers alexandre.f.demers@gmail.com --- (In reply to comment #16)
I had time for now to play with forcing RADEON_DPM_FORCED_LEVEL_LOW and RADEON_DPM_FORCED_LEVEL_HIGH. The first one works fine, the second triggers the problem.
I'm about to play with the suggested patch.
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #20 from Alexandre Demers alexandre.f.demers@gmail.com --- Ok, if I apply the whole suggested patch but the following, it hangs: @@ -4152,14 +4152,14 @@ int ni_dpm_init(struct radeon_device *rdev) } ni_pi->mclk_rtt_mode_threshold = eg_pi->mclk_edc_wr_enable_threshold;
- pi->voltage_control = - radeon_atom_is_voltage_gpio(rdev, SET_VOLTAGE_TYPE_ASIC_VDDC, 0); + pi->voltage_control = false; +// radeon_atom_is_voltage_gpio(rdev, SET_VOLTAGE_TYPE_ASIC_VDDC, 0);
- pi->mvdd_control = - radeon_atom_is_voltage_gpio(rdev, SET_VOLTAGE_TYPE_ASIC_MVDDC, 0); + pi->mvdd_control = false; +// radeon_atom_is_voltage_gpio(rdev, SET_VOLTAGE_TYPE_ASIC_MVDDC, 0);
- eg_pi->vddci_control = - radeon_atom_is_voltage_gpio(rdev, SET_VOLTAGE_TYPE_ASIC_VDDCI, 0); + eg_pi->vddci_control = false; +// radeon_atom_is_voltage_gpio(rdev, SET_VOLTAGE_TYPE_ASIC_VDDCI, 0);
rv770_get_engine_memory_ss(rdev);
I'll try to play with that a bit and I'll come back. I also still have to give you the module_index.
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #21 from Alexandre Demers alexandre.f.demers@gmail.com --- Adding printk(KERN_DEBUG "DEBUG: about to pass the following value of module_index to radeon_atom_init_mc_reg_table(): %d", module_index); just before calling radeon_atom_init_mc_reg_table() returns 2.
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #22 from Alex Deucher agd5f@yahoo.com --- (In reply to comment #20)
So does just applying this portion of the patch by itself fix the hang?
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #23 from Alex Deucher agd5f@yahoo.com --- (In reply to comment #21)
Ok, that looks good.
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #24 from Alexandre Demers alexandre.f.demers@gmail.com --- (In reply to comment #22)
Applying just this returns an error when booting: ni_upload_sw_state failed, but obviously the system doesn't hang after that (though it can't change its performance state)
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #25 from Alexandre Demers alexandre.f.demers@gmail.com --- (In reply to comment #22)
The only way I don't have a "ni_upload_sw_state failed" is by letting pi->voltage_control = radeon_atom_is_voltage_gpio(rdev, SET_VOLTAGE_TYPE_ASIC_VDDC, 0); However, I inevitably end up with a hang either at login or when my session is loading (however, going in a terminal before it hangs prevents any hang from happening as long as I stay in terminal).
If I patch that part of code, I always have the "ni_upload_sw_state failed" error, thus not hanging but preventing any dpm.
I can patch everything else or nothing at all (I tried different combinations) and they don't seem to change a thing about the hang.
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #26 from Alex Deucher agd5f@yahoo.com --- Can you attach your dmesg with dpm enabled?
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #27 from Alexandre Demers alexandre.f.demers@gmail.com --- (In reply to comment #26)
Can you attach your dmesg with dpm enabled?
Do you mean with the patch applied (total and/or problematic part left alone)?
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #28 from Alex Deucher agd5f@yahoo.com --- (In reply to comment #27)
Doesn't matter. I just want to see the basic driver output and power state list.
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #29 from Alexandre Demers alexandre.f.demers@gmail.com --- Created attachment 85798 --> https://bugs.freedesktop.org/attachment.cgi?id=85798&action=edit dpm=1 with partial patch applied on 3.11.0
dmesg output when dpm=1 with partial patch applied (deactivation of pretty much everything but one to pass ni_upload_sw_state)
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #30 from Alexandre Demers alexandre.f.demers@gmail.com --- If there were any fixes pushed in kernel 3.12-rc1, none changed anything.
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #31 from Alex Deucher agd5f@yahoo.com --- Created attachment 85989 --> https://bugs.freedesktop.org/attachment.cgi?id=85989&action=edit testing patch
Try this patch independent from any other patches. It forces the engine and memory clocks of all performance levels within a power state to the lowest level. If it works, then try and comment out either the sclk part or the mclk part and see if either helps. That should help us narrow down whether it's a mclk problem or an sclk problem.
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #32 from Alexandre Demers alexandre.f.demers@gmail.com --- (In reply to comment #31)
Running with the patch works fine over a vanilla kernel 3.12-rc1. The following works also fine: // if (pl->sclk > 25000) // pl->sclk = 25000; if (pl->mclk > 15000) pl->mclk = 15000; Which means sclk is working properly.
However, the opposite results in a blank screen before I can even get at the login screen. It seems mclk is the problematic part.
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #33 from Alex Deucher agd5f@yahoo.com --- Created attachment 86111 --> https://bugs.freedesktop.org/attachment.cgi?id=86111&action=edit testing patch - force mclk to high
Try this patch by itself. This patch will force the mclk to the highest for all performance levels. If it works, the issue is probably related to the changing of mclks, if not, then we are probably programming one of the mclk parameters wrong.
https://bugs.freedesktop.org/show_bug.cgi?id=68235
Alex Deucher agd5f@yahoo.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Attachment #86111|0 |1 is obsolete| |
--- Comment #34 from Alex Deucher agd5f@yahoo.com --- Created attachment 86112 --> https://bugs.freedesktop.org/attachment.cgi?id=86112&action=edit testing patch - force mclk to high
Sorry, had some garbage in my tree. use this one instead.
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #35 from Alexandre Demers alexandre.f.demers@gmail.com --- (In reply to comment #34)
Tested and the screen ended up blank or frozen somewhere near when Xorg and gdm are being launched (tried twice). Before that, the console was being displayed OK.
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #36 from Alexandre Demers alexandre.f.demers@gmail.com --- A test of my own: diff --git a/drivers/gpu/drm/radeon/ni_dpm.c b/drivers/gpu/drm/radeon/ni_dpm.c index f7b625c..c1875d2 100644 --- a/drivers/gpu/drm/radeon/ni_dpm.c +++ b/drivers/gpu/drm/radeon/ni_dpm.c @@ -3952,10 +3952,14 @@ static void ni_parse_pplib_clock_info(struct radeon_device *rdev, pl->mclk = le16_to_cpu(clock_info->evergreen.usMemoryClockLow); pl->mclk |= clock_info->evergreen.ucMemoryClockHigh << 16;
+ pl->mclk = 100000; + pl->vddc = le16_to_cpu(clock_info->evergreen.usVDDC); pl->vddci = le16_to_cpu(clock_info->evergreen.usVDDCI); pl->flags = le32_to_cpu(clock_info->evergreen.ulFlags);
+ pl->vddci = 1150; + /* patch up vddc if necessary */ if (pl->vddc == 0xff01) { if (radeon_atom_get_max_vddc(rdev, 0, 0, &vddc) == 0)
This works. I haven't pushed higher yet.
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #37 from Alexandre Demers alexandre.f.demers@gmail.com --- Went to pl->mclk = 115000, runs fine.
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #38 from Alexandre Demers alexandre.f.demers@gmail.com --- Running with mclk at 120000.
I went under Windows and launch GPU-Z. We should be able to reach 1300MHz.
I've read that some Cayman cards were made to use a VDDCi between 1.15 and 1.16. I'm pretty sure I can reach stability at 130000 by rising VDDCI a bit.
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #39 from Alexandre Demers alexandre.f.demers@gmail.com --- Running with mclk at 125000
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #40 from Alexandre Demers alexandre.f.demers@gmail.com --- Should I continu to see what value I can reach?
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #41 from Alex Deucher agd5f@yahoo.com --- Created attachment 86147 --> https://bugs.freedesktop.org/attachment.cgi?id=86147&action=edit mclk debugging pll debugging output
Can you attach the dmesg output with this patch applied? I want to make sure the mclk parameters are being properly calculated for the 130000 mclk.
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #42 from Alexandre Demers alexandre.f.demers@gmail.com --- (In reply to comment #41)
I'll try it at home later today.
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #43 from Alexandre Demers alexandre.f.demers@gmail.com --- Created attachment 86168 --> https://bugs.freedesktop.org/attachment.cgi?id=86168&action=edit dmesg with 86147
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #44 from Alex Deucher agd5f@yahoo.com --- Created attachment 86296 --> https://bugs.freedesktop.org/attachment.cgi?id=86296&action=edit patch 1/2
This patch set works around the issue by limiting the sclk and mclk to the highest levels listed in the clk/voltage dependency tables. I'll need to dig a bit more internally to try and figure out how to handle these clks properly.
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #45 from Alex Deucher agd5f@yahoo.com --- Created attachment 86297 --> https://bugs.freedesktop.org/attachment.cgi?id=86297&action=edit patch 2/2
apply these two patches independent of any others.
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #46 from Alexandre Demers alexandre.f.demers@gmail.com --- It seems to allow the system to work properly. No crash with patches on 3.11.0 (but another problem with 3.12-rc1, probably a new bug). I added a printk to show what are the max values. Here is what I get: [ 3.088984] : Hitting max values... max_sclk_vddc->80000, max_mclk_vddci->125000, max_mclk_vddc->125000
So, as it is, I'm unable to run at top speed (mem) if I understand correctly, right?
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #47 from Alex Deucher agd5f@yahoo.com --- (In reply to comment #46)
Right, it will limit you the the fastest clock in the voltage dependency tables until I sort out how I'm suuposed to properly handle faster clocks.
https://bugs.freedesktop.org/show_bug.cgi?id=68235
Alexandre Demers alexandre.f.demers@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution|--- |FIXED
--- Comment #48 from Alexandre Demers alexandre.f.demers@gmail.com --- OK, then with the two last patches on top of kernel 3.11.0, it works fine and I'm closing this bug. Should I open a new "bug" for the part about the faster clock and vddci?
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #49 from Alexandre Demers alexandre.f.demers@gmail.com --- Also, the bug I saw when testing patches with kernel 3.12-rc1 just happened with 3.11.0. The screen turns white and everything is frozen. I can't connect through ssh (without the patches, when the screen hanged, I was able to connect through ssh).
I can't find anything in logs that could help identify what is going on. I wasn't doing anything special and I can start a game under Steam where the GPU's fan will accelerate (which is a sign the card is now running faster) without any problem. The computer can just sit there while nothing happens and freezes (with a white screen).
I'm tempted to open a different bug, what do you think Alex?
https://bugs.freedesktop.org/show_bug.cgi?id=68235
--- Comment #50 from Alex Deucher agd5f@yahoo.com --- Go ahead and open new bugs for those issues.
https://bugs.freedesktop.org/show_bug.cgi?id=68235
Alexandre Demers alexandre.f.demers@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- See Also| |https://bugs.freedesktop.or | |g/show_bug.cgi?id=69721
https://bugs.freedesktop.org/show_bug.cgi?id=68235
Alexandre Demers alexandre.f.demers@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- See Also| |https://bugs.freedesktop.or | |g/show_bug.cgi?id=69723
dri-devel@lists.freedesktop.org