[Bug 65761] New: HD 7970M Hybrid - hangs and errors and rmmod causes crash - dri-devel - freedesktop.org experimental mailing list

List overview All Threads
Download

newer

[Bug 65761] New: HD 7970M Hybrid - hangs and errors and rmmod causes crash

older

[Bug 73338] Fan speed in idle at...

WARNING: at...

bugzilla-daemon＠bugzilla.kernel.org

25 Nov 2013 25 Nov '13

4:49 p.m.

https://bugzilla.kernel.org/show_bug.cgi?id=65761

Bug ID: 65761 Summary: HD 7970M Hybrid - hangs and errors and rmmod causes crash Product: Drivers Version: 2.5 Kernel Version: 3.13-rc1 Hardware: x86-64 OS: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: Video(DRI - non Intel) Assignee: drivers_video-dri@kernel-bugs.osdl.org Reporter: haagch.christoph@googlemail.com Regression: No

Created attachment 115961 --> https://bugzilla.kernel.org/attachment.cgi?id=115961&action=edit dmesg with multiple issues

00:02.0 VGA compatible controller: Intel Corporation 3rd Gen Core processor Graphics Controller (rev 09) 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Wimbledon XT [Radeon HD 7970M] muxless, so there's no display outputs on the radeon gpu.

With 3.13 and dpm and runtime power management enabled by default this should be getting usable. I did not start X this time. (when starting X on the intel gpu quickly after booting (presumably before radeon is initialized) the display output hangs, which is yet another problem).

Dmesg seems to contain several issues.

1. While booting it already produces these before even "dpm initialized" is logged:

[ 20.120137] WARNING: CPU: 0 PID: 53 at drivers/gpu/drm/drm_mm.c:578 drm_mm_takedown+0x2e/0x30 [drm]() [ 20.120138] Memory manager not clean during takedown.

2. These are just periodically displayed in the tty:

[ 79.179928] [drm:r600_ib_test] *ERROR* radeon: fence wait failed (-35). [ 79.181378] [drm:radeon_ib_ring_tests] *ERROR* radeon: failed testing IB on GFX ring (-35). [ 79.182790] [drm:radeon_resume_kms] *ERROR* ib ring test failed (-35).

3. After some time in a tty, I did "rmmod radeon" and got some other issues ending with

[ 192.663059] Fixing recursive fault but reboot is needed!

and then, after a little while a lot of other errors happen, but they are not captured here in the dmesg.

-- You are receiving this mail because: You are watching the assignee of the bug.

Show replies by date

bugzilla-daemon＠bugzilla.kernel.org

25 Nov 25 Nov

5:15 p.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

Mike Lothian mike@fireburn.co.uk changed:

What |Removed |Added ---------------------------------------------------------------------------- CC| |mike@fireburn.co.uk

--- Comment #1 from Mike Lothian mike@fireburn.co.uk --- I believe this is related to https://bugs.freedesktop.org/show_bug.cgi?id=71930

I think the memory Warnings have been fixed by either:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drive...

As a workaround you can boot with radeon.runpm=0 but that'll leave your discreet card powered up

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

26 Nov 26 Nov

8:03 p.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

Alex Deucher alexdeucher@gmail.com changed:

--- Comment #2 from Alex Deucher alexdeucher@gmail.com ---

...

From the bisect in https://bugs.freedesktop.org/show_bug.cgi?id=71930 it looks

like commit bbd34fcdd1b201e996235731a7c98fd5197d9e51 caused a regression in vgaswitcheroo and hence runpm. Adding Rafael. Also bug 65811 looks like a duplicate of this one.

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

30 Nov 30 Nov

12:12 p.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #3 from Christoph Haag haagch.christoph@googlemail.com --- Created attachment 116871 --> https://bugzilla.kernel.org/attachment.cgi?id=116871&action=edit dmesg from 3.13-rc2 with rmmod radeon in line 2095

Having enough time to test everything would be nice.

But in the meantime I have only tested 3.13-rc2. With defaults the issues are still there (I have not yet checked whether mentioned commits made it into rc2).

Now I have some more messsages in dmesg after rmmod'ing radeon. This time it didn't hang after all that, but rebooting didn't work for some reason. Maybe it's just me but I'd like to have proper unloading to have the option to switch between radeon and fglrx without rebooting. :)

Anyway with runpm=0 I don't get any errors so dpm does really work fine. On an unrelated note: WOW, has radeonsi improved in performance! Much better than fglrx with "official" hybrid support in some applications at least.

Unfortunately using runpm=0 and using vgaswitcheroo manually is not a viable workaround because of https://bugzilla.kernel.org/show_bug.cgi?id=51381

But with runpm=0 there still is a similar (?) problem when rmmod'ing radeon:

[ 1897.087151] [drm] radeon: finishing device. [ 1897.518615] ------------[ cut here ]------------ [ 1897.518634] WARNING: CPU: 4 PID: 2539 at drivers/gpu/drm/drm_mm.c:578 drm_mm_takedown+0x2e/0x30 [drm]() [ 1897.518635] Memory manager not clean during takedown. [ 1897.518636] Modules linked in: bnep bluetooth iTCO_wdt iTCO_vendor_support arc4 iwldvm snd_hda_codec_hdmi mac80211 snd_hda_codec_realtek joydev x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crct10dif_common crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd snd_hda_intel microcode snd_hda_codec iwlwifi snd_hwdep psmouse cfg80211 snd_pcm serio_raw pcspkr snd_page_alloc radeon(-) rtsx_pci_ms r8169 snd_timer memstick rfkill lpc_ich mii snd i2c_i801 soundcore ttm wmi thermal mei_me mei shpchp processor battery ac evdev nfs lockd sunrpc fscache fuse ext4 crc16 mbcache jbd2 sr_mod cdrom sd_mod hid_generic usbhid hid rtsx_pci_sdmmc mmc_core ahci i915 libahci crc32c_intel i2c_algo_bit libata intel_agp firewire_ohci intel_gtt [ 1897.518659] firewire_core crc_itu_t drm_kms_helper ehci_pci ehci_hcd xhci_hcd scsi_mod rtsx_pci drm usbcore i2c_core usb_common video button [ 1897.518665] CPU: 4 PID: 2539 Comm: rmmod Not tainted 3.13.0-1-mainline #1 [ 1897.518666] Hardware name: CLEVO P170EM/P170EM, BIOS 4.6.5 08/22/2012 [ 1897.518667] 0000000000000009 ffff8807d4b47c20 ffffffff814f5570 ffff8807d4b47c68 [ 1897.518669] ffff8807d4b47c58 ffffffff81061bad ffff8807fef616c0 ffff8807fef61768 [ 1897.518671] ffff8808029a47b0 0000000000000000 0000000001ce5090 ffff8807d4b47cb8 [ 1897.518673] Call Trace: [ 1897.518677] [<ffffffff814f5570>] dump_stack+0x4d/0x6f [ 1897.518681] [<ffffffff81061bad>] warn_slowpath_common+0x7d/0xa0 [ 1897.518683] [<ffffffff81061c1c>] warn_slowpath_fmt+0x4c/0x50 [ 1897.518688] [<ffffffffa05644b8>] ? ttm_bo_man_takedown+0x48/0x70 [ttm] [ 1897.518693] [<ffffffffa00c2f6e>] drm_mm_takedown+0x2e/0x30 [drm] [ 1897.518696] [<ffffffffa05644a8>] ttm_bo_man_takedown+0x38/0x70 [ttm] [ 1897.518699] [<ffffffffa055ebf9>] ttm_bo_clean_mm+0x49/0x80 [ttm] [ 1897.518709] [<ffffffffa05eebcd>] radeon_ttm_fini+0xbd/0x190 [radeon] [ 1897.518716] [<ffffffffa05ef662>] radeon_bo_fini+0x12/0x20 [radeon] [ 1897.518727] [<ffffffffa06570c1>] si_fini+0xc1/0x100 [radeon] [ 1897.518733] [<ffffffffa05d404e>] radeon_device_fini+0x3e/0x120 [radeon] [ 1897.518739] [<ffffffffa05d5f7e>] radeon_driver_unload_kms+0x4e/0x70 [radeon] [ 1897.518744] [<ffffffffa00bed4c>] drm_dev_unregister+0x2c/0xe0 [drm] [ 1897.518749] [<ffffffffa00bee8b>] drm_put_dev+0x3b/0x70 [drm] [ 1897.518754] [<ffffffffa05d229d>] radeon_pci_remove+0x1d/0x20 [radeon] [ 1897.518756] [<ffffffff812c66cb>] pci_device_remove+0x3b/0xb0 [ 1897.518759] [<ffffffff8138caaf>] __device_release_driver+0x7f/0xf0 [ 1897.518761] [<ffffffff8138d438>] driver_detach+0xb8/0xc0 [ 1897.518763] [<ffffffff8138c6e5>] bus_remove_driver+0x55/0xd0 [ 1897.518765] [<ffffffff8138daac>] driver_unregister+0x2c/0x50 [ 1897.518767] [<ffffffff812c60a9>] pci_unregister_driver+0x29/0x90 [ 1897.518772] [<ffffffffa00c10e8>] drm_pci_exit+0x98/0xa0 [drm] [ 1897.518778] [<ffffffffa06a35af>] radeon_exit+0x17/0x1e [radeon] [ 1897.518780] [<ffffffff810d6682>] SyS_delete_module+0x172/0x240 [ 1897.518783] [<ffffffff81014e5c>] ? do_notify_resume+0x8c/0xa0 [ 1897.518785] [<ffffffff815039ed>] system_call_fastpath+0x1a/0x1f [ 1897.518786] ---[ end trace ba1fe37dd4719714 ]--- [ 1897.518790] [TTM] Finalizing pool allocator [ 1897.518793] [TTM] Finalizing DMA pool allocator [ 1897.518797] ------------[ cut here ]------------ [ 1897.518799] WARNING: CPU: 4 PID: 2539 at drivers/gpu/drm/ttm/ttm_page_alloc_dma.c:534 ttm_dma_free_pool+0x12b/0x130 [ttm]() [ 1897.518800] Modules linked in: bnep bluetooth iTCO_wdt iTCO_vendor_support arc4 iwldvm snd_hda_codec_hdmi mac80211 snd_hda_codec_realtek joydev x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crct10dif_common crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd snd_hda_intel microcode snd_hda_codec iwlwifi snd_hwdep psmouse cfg80211 snd_pcm serio_raw pcspkr snd_page_alloc radeon(-) rtsx_pci_ms r8169 snd_timer memstick rfkill lpc_ich mii snd i2c_i801 soundcore ttm wmi thermal mei_me mei shpchp processor battery ac evdev nfs lockd sunrpc fscache fuse ext4 crc16 mbcache jbd2 sr_mod cdrom sd_mod hid_generic usbhid hid rtsx_pci_sdmmc mmc_core ahci i915 libahci crc32c_intel i2c_algo_bit libata intel_agp firewire_ohci intel_gtt [ 1897.518829] firewire_core crc_itu_t drm_kms_helper ehci_pci ehci_hcd xhci_hcd scsi_mod rtsx_pci drm usbcore i2c_core usb_common video button [ 1897.518843] CPU: 4 PID: 2539 Comm: rmmod Tainted: G W 3.13.0-1-mainline #1 [ 1897.518844] Hardware name: CLEVO P170EM/P170EM, BIOS 4.6.5 08/22/2012 [ 1897.518844] 0000000000000009 ffff8807d4b47c18 ffffffff814f5570 0000000000000000 [ 1897.518846] ffff8807d4b47c50 ffffffff81061bad ffff880801a31e40 ffff8808032a2290 [ 1897.518848] 0000000000000008 0000000000000000 0000000001ce5090 ffff8807d4b47c60 [ 1897.518850] Call Trace: [ 1897.518852] [<ffffffff814f5570>] dump_stack+0x4d/0x6f [ 1897.518854] [<ffffffff81061bad>] warn_slowpath_common+0x7d/0xa0 [ 1897.518856] [<ffffffff81061c8a>] warn_slowpath_null+0x1a/0x20 [ 1897.518861] [<ffffffffa0564c9b>] ttm_dma_free_pool+0x12b/0x130 [ttm] [ 1897.518863] [<ffffffffa0565dee>] ttm_dma_page_alloc_fini+0x8e/0x104 [ttm] [ 1897.518867] [<ffffffffa055b549>] ttm_mem_global_release+0x19/0x90 [ttm] [ 1897.518875] [<ffffffffa05ed892>] radeon_ttm_mem_global_release+0x12/0x20 [radeon] [ 1897.518879] [<ffffffffa00d06c3>] drm_global_item_unref+0x53/0x80 [drm] [ 1897.518887] [<ffffffffa05eebfe>] radeon_ttm_fini+0xee/0x190 [radeon] [ 1897.518894] [<ffffffffa05ef662>] radeon_bo_fini+0x12/0x20 [radeon] [ 1897.518904] [<ffffffffa06570c1>] si_fini+0xc1/0x100 [radeon] [ 1897.518910] [<ffffffffa05d404e>] radeon_device_fini+0x3e/0x120 [radeon] [ 1897.518917] [<ffffffffa05d5f7e>] radeon_driver_unload_kms+0x4e/0x70 [radeon] [ 1897.518922] [<ffffffffa00bed4c>] drm_dev_unregister+0x2c/0xe0 [drm] [ 1897.518926] [<ffffffffa00bee8b>] drm_put_dev+0x3b/0x70 [drm] [ 1897.518932] [<ffffffffa05d229d>] radeon_pci_remove+0x1d/0x20 [radeon] [ 1897.518934] [<ffffffff812c66cb>] pci_device_remove+0x3b/0xb0 [ 1897.518936] [<ffffffff8138caaf>] __device_release_driver+0x7f/0xf0 [ 1897.518938] [<ffffffff8138d438>] driver_detach+0xb8/0xc0 [ 1897.518940] [<ffffffff8138c6e5>] bus_remove_driver+0x55/0xd0 [ 1897.518942] [<ffffffff8138daac>] driver_unregister+0x2c/0x50 [ 1897.518945] [<ffffffff812c60a9>] pci_unregister_driver+0x29/0x90 [ 1897.518951] [<ffffffffa00c10e8>] drm_pci_exit+0x98/0xa0 [drm] [ 1897.518956] [<ffffffffa06a35af>] radeon_exit+0x17/0x1e [radeon] [ 1897.518958] [<ffffffff810d6682>] SyS_delete_module+0x172/0x240 [ 1897.518960] [<ffffffff81014e5c>] ? do_notify_resume+0x8c/0xa0 [ 1897.518962] [<ffffffff815039ed>] system_call_fastpath+0x1a/0x1f [ 1897.518963] ---[ end trace ba1fe37dd4719715 ]--- [ 1897.519005] [TTM] Zone kernel: Used memory at exit: 13 kiB [ 1897.519007] [TTM] Zone dma32: Used memory at exit: 9 kiB [ 1897.519009] [drm] radeon: ttm finalized [ 1897.519012] vga_switcheroo: disabled [ 1897.519212] [drm] Module unloaded

Fortunately all the other problems are not happening.

Another problem that also persists is starting X early (kdm with systemd). The display just hangs with the _ cursor in the upper left, no error messages anywhere to be found. Killing X leaves the display hanging there and only starting another X (presumably reinitializing the display output) releases the hanging and goes back to normal. This only happens with radeon enabled with and without runpm but it might rather be a problem in X or maybe even the intel driver. Anyway, not starting X directly at boot but waiting a few seconds before starting it works.

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

6 Dec 6 Dec

10:21 p.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #4 from Christoph Haag haagch.christoph@googlemail.com --- Created attachment 117751 --> https://bugzilla.kernel.org/attachment.cgi?id=117751&action=edit kernel panic on reboot with runpm=1

3.13-rc3, same problems.

Also, here is a crappy picture of a kernel panic that happens when I try to reboot when runpm is enabled.

When I have more time next week I'll figure out how to setup netconsole properly and get a full log.

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

10:34 p.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #5 from Alex Deucher alexdeucher@gmail.com --- Does runpm work on my old runpm branch prior to merging with 3.13?

http://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-fixes-3.12-radeon-powero...

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

7 Dec 7 Dec

1:06 a.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #6 from Christoph Haag haagch.christoph@googlemail.com --- Created attachment 117761 --> https://bugzilla.kernel.org/attachment.cgi?id=117761&action=edit dmesg with drm-fixes-3.12-radeon-poweroff branch

Hm, I think it kind of works with drm-fixes-3.12-radeon-poweroff.

There seems to be no error in dmesg and X works fine.

I have monitored /sys/kernel/debug/vgaswitcheroo/switch a while.

Unfortunately most of the time it says 0:IGD:+:Pwr:0000:00:02.0 1:DIS: :DynPwr:0000:01:00.0

I have seen it say 0:IGD:+:Pwr:0000:00:02.0 1:DIS: :DynOff:0000:01:00.0 But I don't know really know how I got it to switch and whether it was something I did at all.

Then, unprovoked, it switched to on ("DynPwr") again. (There's nothing running on the radeon GPU, only X with xcompmgr on intel and radeon is configured as provideroffloadsink)

I don't know how to coax it to power the gpu off. But at least it CAN work already.

Maybe this is useful, maybe not: With the radeon in idle I have this: $ cat /sys/kernel/debug/dri/1/radeon_pm_info uvd vclk: 0 dclk: 0 power level 0 sclk: 30000 mclk: 15000 vddc: 825 vddci: 850 pcie gen: 3

with glxgears I have this: $ cat /sys/kernel/debug/dri/1/radeon_pm_info uvd vclk: 0 dclk: 0 power level 2 sclk: 85000 mclk: 120000 vddc: 1050 vddci: 975 pcie gen: 3

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

1:17 a.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #7 from Christoph Haag haagch.christoph@googlemail.com --- Created attachment 117771 --> https://bugzilla.kernel.org/attachment.cgi?id=117771&action=edit Same as "117761: dmesg with drm-fixes-3.12-radeon-poweroff branch", but with rmmod radeon and modprobe radeon

Okay, I exited X and from a tty did rmmod radeon. It produced the same error in dmesg than it does on 3.13, see attached file...

Then with modprobe radeon I got the GPU lockup CP stall for more than 10000msec, but interestingly only once and then it worked.

The radeon gpu was at DynOff then.

As soon as I started X, the gpu was powered on to DynPwr again. Not even configured as provideroffloadsink or so, just the X startup and presumably hardware detection powered it up and it does not go back to DynOff.

But again, knowing that runpm itself works without errors is encouraging.

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

24 Dec 24 Dec

3:45 p.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #8 from Christoph Haag haagch.christoph@googlemail.com --- Created attachment 119501 --> https://bugzilla.kernel.org/attachment.cgi?id=119501&action=edit rc5 with starting X (~line 1811) and trying to render something

rc5 with these patches now: http://lists.freedesktop.org/archives/dri-devel/2013-December/050902.html

I'm not sure what I can do more without getting into trying to debug kernel code of which I know nothing...

Not really a change with rc5, but I think when starting X the lockups go away (not really sure). When trying to render something with DRI_PRIME=1 however, it locks up a bit harder and can't recover apparently because when trying to start another opengl program, it'll just segfault.

rmmod radeon still creates problems that lock up the cpus and prevent proper rebooting etc.

With runpm=0 it's not really good either at the moment. For example the Distance alpha causes a lockup too and after that lockup glxgears says

$ DRI_PRIME=1 glxgears -info radeon: Failed to allocate virtual address for buffer: radeon: size : 4352 bytes radeon: alignment : 4096 bytes radeon: domains : 4 radeon: va : 0x0000000000800000 radeon: Failed to allocate virtual address for buffer: radeon: size : 4352 bytes radeon: alignment : 4096 bytes radeon: domains : 4 radeon: va : 0x0000000000800000 [1] 2915 segmentation fault (core dumped)

I'm cramming way too much in here, but I just want to say that the HD 7970M with the 8970M are the top mobile consumer GPUs AMD currently has and it would be really cool if the developers could buy that hardware (hey, it's christmas!) and fix this mess directly. From what I have seen so far it is really close to working amazingly well. If there's something specific I could test, I'll gladly try. Thanks

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

7:43 p.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #9 from Alex Deucher alexdeucher@gmail.com --- This looks like a pci hotplug problem, not a radeon issue. See: https://bugs.freedesktop.org/show_bug.cgi?id=70687

"I seem to have discovered the root of the issue.

I've just built 3.13-rc5 kernel which has the dynamic powering of the discrete gpu and all hell broke loose.

I've narrowed the error down to the pci hotplug driver. My machine loads shpchp pci hotplug driver from what I can see in lsmod output. But the trick is, that there is another pci hotplug driver, acpi pci hotplug one, which seems to break all hell loose here. Disabling it seems to fix everything for me, at least on kernel 3.13.

# CONFIG_HOTPLUG_PCI_ACPI is not set

This kernel config option is the culprit for this, and that also can be seen from my backtrace:

[ 22.731998] [<ffffffff81343cb1>] ? acpiphp_check_bridge+0x72/0x88

So the trick behind this is that acpi pci hotplug driver conflicts with shpchp one that my machine uses. And since it is a builtin driver, and can't be built as module it is always loaded. The other possibility is that this machine doesn't support acpi hotplug, but does support shpc pci hotplug. We need a kernel workarround so that acpi pci hotplug is disabled and out of the way when shpc pci hotplug is enabled."

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

7:45 p.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #10 from Alex Deucher alexdeucher@gmail.com --- See also: https://bugzilla.kernel.org/show_bug.cgi?id=61891

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

9:40 p.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #11 from Christoph Haag haagch.christoph@googlemail.com --- Created attachment 119551 --> https://bugzilla.kernel.org/attachment.cgi?id=119551&action=edit dmesg with acpiphp.disable=1

A quick search also showed this: https://bugzilla.kernel.org/show_bug.cgi?id=67461

Good news: With acpiphp.disable=1 there are no errors with booting and using X. In a tty the gpu is in DynOff state.

Not so good news: When starting X it switches to DynPwr and after several minutes I don't think it ever powered down again. But when killing X it goes back to DynOff.

Still bad news: rmmod radeon.

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

6 Jan 6 Jan

11:57 a.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #12 from Christoph Haag haagch.christoph@googlemail.com --- Okay, with 3.13 rc7 I can confirm that the acpiphp problem is fixed. Now with everything in the default configuration you can boot and use X without any error.

rmmod radeon still causes nasty crashes.

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

9:52 p.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

Jack q@cyphernaut.org changed:

What |Removed |Added ---------------------------------------------------------------------------- CC| |q@cyphernaut.org

--- Comment #13 from Jack q@cyphernaut.org --- Additionally vgaswitcheroo/switch remains at DynPwr whenever X is running, even when nothing is using DRI_PRIME=1. This might be an issue specifically with this card (I have a 7970M as well), as I've heard reports of users with earlier generation cards (6xxxM) have seen vgaswitcheroo report DynOff when DRI_PRIME=1 is not set by any running application.

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

19 Jan 19 Jan

11:22 p.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #14 from Jack q@cyphernaut.org --- Still having issues in rc8, unfortunately.

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

21 Jan 21 Jan

12:15 a.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

Christoph Haag haagch.christoph@googlemail.com changed:

--- Comment #15 from Christoph Haag haagch.christoph@googlemail.com --- Created attachment 122811 --> https://bugzilla.kernel.org/attachment.cgi?id=122811&action=edit erratic dpm with window manager actions

Okay, so we got 3.13 without crashes with default settings, so far so good, thanks so much. :)

The dynpm power management is wonky. It doesn't really fit in here, but I already dumped a lot of stuff in here anyway and this report has been linked to in relation to me mentioning that the power management isn't working well yet, so here is another (not so) small point.

Here is a screenshot from the tool radeon-profile. I believe over the whole period shown there I have not used the radeon GPU to render anything, but it is configured as offload slave. I also hope that this tool did not cause this behaviour, but from the fan noise I think it's happening without it too.

So there are some random "short" spikes that die off after a while of doing nothing in X. However there are these longer spikes (about 5-7 seconds) when doing some window management actions. An example of that is minimizing/unminizing a window or simply changing the focus from one window to another window via the mouse. I have tested this with kwin with compositing and openbox without compositing and I think it's pretty clear that this actually causes it. Changing tabs in chromium also seems to cause it.

The very quick spikes happened when I additionally ran glxgears with kwin (still everything on the Intel GPU). They died down but appeared sometimes again after doing something with the windows. With openbox on the other hand I don't see such quick spikes with glxgears.

Another funny effect in openbox is that a single click on the "gears" in glxgears doesn't have an effect, but a double click produces one of the longer spikes. If I single click in chromium here in the bugtracker on the bakground or in the textarea it produces a spike too. But simply typing in the textarea does not.

I don't really know what to make of those observations, but I would guess that the radeon gpu doesn't enter the dynoff state while X is running because of something related to this. A short google search didn't show anything new about radeonsi + dynoff, especially nobody who claims it's working, so maybe glamor is worth looking at?

The intel gpu uses SNA and radeon obviously glamor. If the Xorg.0.log or so would be helpful, just ask.

Maybe someone with a pre radeonsi gpu could test whether they have the same issue when using glamor over exa?

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

12:23 a.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #16 from Alex Deucher alexdeucher@gmail.com --- I'm not familiar with radeon-profile, but if it does some sort of manual polling of the card, it might be waking up the card every time it polls or it might just be reading garbage because the card is powered down. Does it help if you don't run radeon-profile?

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

12:55 a.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #17 from Christoph Haag haagch.christoph@googlemail.com --- It reads /sys/kernel/debug/dri/0/radeon_pm_info once a second and lm_sensors I think.

Powered down in terms of vgaswitcheroo/dynoff, I don't think so. When X is running I don't read dynoff for the radeon gpu from /sys/kernel/debug/vgaswitcheroo/switch but when X is not running, it powers down.

I checked with radeon-profile because it's actually the fan that is sometimes annoying to me. The fan only powers off for seconds before starting to run again, mostly only up to "level 1" for a while. But sometimes it runs at high speed for seconds or even minutes for seemingly no reason and then powers down again.

When monitoring with sudo sh -c 'while true; do cat /sys/kernel/debug/dri/0/radeon_pm_info; sleep 0.5; done' I see the same thing. When clicking on an unfocussed window and focussing it with that, it goes from

power level 0 sclk: 30000 mclk: 15000 vddc: 825 vddci: 850 pcie gen: 3 uvd vclk: 0 dclk: 0

power level 2 sclk: 85000 mclk: 120000 vddc: 1050 vddci: 975 pcie gen: 3 uvd vclk: 0 dclk: 0

stays there for ~6 seconds and then goes back to power level 0.

If it was garbage it would be pretty consistent garbage with the gpu temperature rising pretty much exactly on every of those spike... If constant polling from /sys/kernel/debug/dri/0/radeon_pm_info or so is causing it, then the long periods of no spikes when actually doing nothing in X confuse me.

Currently it's just this situation that the fan keeps going on and off all the time and I am pretty sure it is on most of the time when I'm doing something, like clicking things and stuff and when I'm exiting X and see dynoff in /sys/kernel/debug/vgaswitcheroo/switch then the fan actually stays off (I think. Going to a tty while X is running in the background does not result in dynoff)

If there is a better way to find out why the fan doesn't shut off, I'd gladly investigate.

The code is here, by the way: https://github.com/marazmista/radeon-profile

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

9:58 p.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #18 from Jack q@cyphernaut.org ---

...

If there is a better way to find out why the fan doesn't shut off

I'm pretty sure the reason the fans don't shut off is because they're reacting to the dGPU usage. In my incredibly un-scientific observation, I've noticed that the laptop gets warmer to the touch in Linux than it does in Windows. I'll be doing some actual testing on this later tonight, but it seems to me that the actual issue isn't the fans -- it's the fact that the dGPU is revving up and down on a pretty constant basis, and that's driving the fans to spin up and down with the dGPU.

One thing I noticed is that if I don't have a compositor running it only goes up to power level 1, instead of two, when doing things in the OS, ie, browsing in Chrome, etc. I still don't even think it should be anything but 0 unless using DRI_PRIME, but might the compositor be interacting with the dGPU through PRIME, accidentally? I say this without having any real technical knowledge of how PRIME works.

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

22 Jan 22 Jan

1:53 a.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

Hohahiu rakothedin@gmail.com changed:

What |Removed |Added ---------------------------------------------------------------------------- CC| |rakothedin@gmail.com

--- Comment #19 from Hohahiu rakothedin@gmail.com --- I have similar trouble with vgaswitcheroo. My GPUs are Intel hd4000 + Amd mobility radeon 7750M: /sbin/lspci | grep VGA 00:02.0 VGA compatible controller: Intel Corporation 3rd Gen Core processor Graphics Controller (rev 09) 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Chelsea LP [Radeon HD 7730M] (rev ff)

If I don't set anything during the boot my dGPU is powered down successfully: cat /sys/kernel/debug/vgaswitcheroo/switch 0:IGD:+:Pwr:0000:00:02.0 1:DIS: :DynOff:0000:01:00.0

But in the meantime: xrandr --listproviders Providers: number : 1 Provider 0: id: 0x4a cap: 0xb, Source Output, Sink Output, Sink Offload crtcs: 4 outputs: 7 associated providers: 0 name:Intel

So I cannot access my discrete GPU.

If after the boot I use xrandr --setprovideroffloadsink radeon Intel then dGPU never switches off.

Truly speaking runpm never worked for me even with Alex's early patches.

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

1:53 a.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #20 from Hohahiu rakothedin@gmail.com --- Created attachment 122961 --> https://bugzilla.kernel.org/attachment.cgi?id=122961&action=edit dmesg

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

1:53 a.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #21 from Hohahiu rakothedin@gmail.com --- Created attachment 122971 --> https://bugzilla.kernel.org/attachment.cgi?id=122971&action=edit Xorg.0.log

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

24 Jan 24 Jan

2:20 a.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #22 from Hohahiu rakothedin@gmail.com --- What does the order of GPUs in /sys/kernel/debug/vgaswitcheroo/switch mean? Sometimes it is like this (dGPU is powered off): 0:IGD:+:Pwr:0000:00:02.0 1:DIS: :DynOff:0000:01:00.0 In other case it is like that (and dGPU is obviously on): 0:DIS: :PwrDyn:0000:01:00.0 1:IGD:+:Pwr:0000:00:02.0

Can it affect runpm?

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

31 Jan 31 Jan

12:10 a.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #23 from Christoph Haag haagch.christoph@googlemail.com --- Created attachment 123891 --> https://bugzilla.kernel.org/attachment.cgi?id=123891&action=edit quitting steam after some time

So maybe radeon-profile shows garbage, but in the steam client I have found one of the worse offenders. It doesn't happen always, but often, especially after quitting a game I think, but maybe it's just after some time.

When "it" happens, the fan runs one "level" higher than usual and after I noticed it I started radeon-profile and it showed what you see in the attached screenshot.

As soon as I quit steam, the temperature (read by lm_sensors I think) drops quickly, the voltages and frequencies (read from debugfs I think) are (mostly) stopping to being at the top all the time and only exhibit the spikes from my last comment, but most importantly the fan quickly drops to the usual behavior of switching between the lowest level and off instead of running on the "second lowest level" all the time like when steam was running.

I have had a quick look at the xf86-video-ati but I'm not sure what I'm looking for: Is there one single function or so I can set a break point on that would definitely tell me whether the radeon gpu is ordered to wake up/render something? I'd be happy to try it on the kernel side too, but maybe there's a chance to get a backtrace on the X side to see where it is coming from - all given that it actually happens like I naively imagine. But I don't see anything else: According to LIBGL_DEBUG=verbose steam doesn't open the radeonsi driver, but yet sometimes manages to keep the fan of the radeon gpu running on higher rpm. I would be happy to also test patches against whatever that could add debug logging to point out what the radeon gpu is doing and why.

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

2:14 p.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #24 from Alex Deucher alexdeucher@gmail.com --- The kernel runtime pm infrastructure is used. See radeon_drv.c in the kernel. The driver registers a struct dev_pm_ops structure with the pm core. The pm core will then suspend and resume the device on demand.

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

8 p.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #25 from Christoph Haag haagch.christoph@googlemail.com --- Created attachment 124041 --> https://bugzilla.kernel.org/attachment.cgi?id=124041&action=edit just some printk to easily see what's going on

Well, I don't really know how it is supposed to work. I figured the easiest thing I could do right now was adding some printk() to the code to have an overview of what is happening in dmesg -H | grep radeonpm.

After booting up with no X I see this:

[ +0,000016] radeonpm: runtime_idle [ +4,000483] radeonpm: runtime_suspend

After starting X this is added:

[Jan31 19:04] radeonpm: runtime_resume [ +1,126688] radeonpm: runtime_idle [ +0,000203] radeonpm: runtime_idle [ +0,000400] radeonpm: runtime_idle [ +0,000232] radeonpm: runtime_idle

But then it never does anything again.

After this, I did the usual xrandr --setprovideroffloadsink radeon Intel and started and exited DRI_PRIME=1 glxgears and stuff, but nothing with radeonpm was printed to the log again.

After a while I tried exiting X and sure enough, another line:

[ +2,368895] radeonpm: runtime_suspend

And after starting X for the second time:

[ +19,119650] radeonpm: runtime_resume [ +1,131037] radeonpm: runtime_idle

And then... nothing with radeonpm again. I have read a bit in the Documentation/power/runtime_pm.txt and maybe I did not catch it but I haven't really seen when you call pm_runtime_put_autosuspend whether it is supposed to eventually call back radeon_pmops_runtime_suspend but I would think that's what it does, right? So that just never happens until X is quit.

After boot with no X running I see this with for i in /sys/class/drm/card0/device/power/*; do echo "$i: $(cat $i)"; done

/sys/class/drm/card0/device/power/async: enabled /sys/class/drm/card0/device/power/autosuspend_delay_ms: 5000 /sys/class/drm/card0/device/power/control: auto /sys/class/drm/card0/device/power/runtime_active_kids: 0 /sys/class/drm/card0/device/power/runtime_active_time: 13260 /sys/class/drm/card0/device/power/runtime_enabled: enabled /sys/class/drm/card0/device/power/runtime_status: suspended /sys/class/drm/card0/device/power/runtime_suspended_time: 218426 /sys/class/drm/card0/device/power/runtime_usage: 0 [snipped some wakeup* stuff]

The GPU is off, fan is not running, etc.

After starting X I see this:

/sys/class/drm/card0/device/power/async: enabled /sys/class/drm/card0/device/power/autosuspend_delay_ms: 5000 /sys/class/drm/card0/device/power/control: auto /sys/class/drm/card0/device/power/runtime_active_kids: 0 /sys/class/drm/card0/device/power/runtime_active_time: 185230 /sys/class/drm/card0/device/power/runtime_enabled: enabled /sys/class/drm/card0/device/power/runtime_status: active /sys/class/drm/card0/device/power/runtime_suspended_time: 226856 /sys/class/drm/card0/device/power/runtime_usage: 0

runtime_status is always active in X...

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

8:46 p.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #26 from Alex Deucher alexdeucher@gmail.com --- Can you see how far it's getting in runtime_idle()? Does it get all the way to the end or does it think a crtc is active?

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

9:24 p.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #27 from Christoph Haag haagch.christoph@googlemail.com --- It does get to the end of radeon_pmops_runtime_idle every time (checked right before the return) and crtc->enabled is never true.

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

11 p.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #28 from Alex Deucher alexdeucher@gmail.com --- Created attachment 124051 --> https://bugzilla.kernel.org/attachment.cgi?id=124051&action=edit return IRQ_NONE if we don't have any interrupts

Does the attached patch help? Dave suggested it might be interrupts that causing the problem.

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

1 Feb 1 Feb

12:08 a.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #29 from Christoph Haag haagch.christoph@googlemail.com --- No, it does not seem to help.

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

2 Feb 2 Feb

11:58 p.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #30 from Christoph Haag haagch.christoph@googlemail.com --- Created attachment 124221 --> https://bugzilla.kernel.org/attachment.cgi?id=124221&action=edit sysprof with 2 xterm windows and kwin

I still kind of want to try the opposite direction, to find out what X does that keeps the gpu awake so I'm trying to get a good dynamic call graph. The first thing that kind of worked is sysprof...

So I started recording this after starting X and two xterms and kwin on the intel gpu and just clicked a bit between the two because it seems this triggers whatever happens. The radeon gpu was not even configured as offload slave, just default X configuration.

If I saw that correctly none of the applications called functions from the radeon driver which is probably good, but in X you can see this (more detail is available in the file):

[/usr/bin/X] 0,00% 65,99% In file [heap] 0,00% 29,73% ioctl 0,18% 16,17% - - kernel - - 0,00% 15,99% system_call_fastpath 0,00% 15,99% sys_ioctl 0,04% 15,99% do_vfs_ioctl 0,00% 15,63% radeon_drm_ioctl 0,11% 15,60% drm_ioctl 0,43% 14,63% __pm_runtime_suspend 0,00% 0,57% __pm_runtime_resume 0,04% 0,14% copy_user_enhanced_fast_string 0,07% 0,07% irq_exit 0,00% 0,04% _copy_from_user 0,04% 0,04% drm_ioctl 0,04% 0,04% fget_light 0,25% 0,25% radeon_drm_ioctl 0,04% 0,04% fput 0,04% 0,04%

I have only this GPU so I don't have any comparison to one where runpm works correctly, whether it should be this active... Maybe it's interesting that stuff like __pm_runtime_suspend and __pm_runtime_resume is called?

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

6 Feb 6 Feb

1:48 p.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #31 from Christoph Haag haagch.christoph@googlemail.com --- Created attachment 124781 --> https://bugzilla.kernel.org/attachment.cgi?id=124781&action=edit callgrind output from X

Not much news: I tried to run X with valgrind --tool=callgrind to try to find out about what it does that results in the radeon gpu clocking higher and going to higher voltage when clicking around with windows like minimizing/restoring with kwin & plasma.

Unfortunately it does not happen when X runs in valgrind, the radeon gpu always keeps at the lowest frequency and voltage level. But the runpm status is still "active" all the time with or without valgrind.

With the exact same xorg and xf86* builds the clocking higher and raising the voltages happens when it runs not in valgrind...

I still post the callgrind log, even if it did not clock higher, because in the Callgraph Blockhandler() calls intel sna stuff but it also calls a lot of radeon stuff.

Whether this is something that is caused by valgrind or something that should happen with a radeon gpu that is not configured to be used for anything, I don't know.

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

7 Feb 7 Feb

12:49 a.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #32 from Hohahiu rakothedin@gmail.com --- I have the same messages in dmesg: http://pastebin.com/vAsRRVW8

It seems like radeon driver tries to set clock and it couldn't do it. Therefore it tries again and again and GPU does not turn off.

Christoph, does your dmesg have the same messages?

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

2:52 p.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #33 from Christoph Haag haagch.christoph@googlemail.com --- Created attachment 125111 --> https://bugzilla.kernel.org/attachment.cgi?id=125111&action=edit possibly call chain that calls radeon stuff

@ Hohahiu

Do you mean those? [ 8077.648324] [drm:si_dpm_set_power_state] *ERROR* si_set_sw_state failed

No, I don't see those at all.

By the way, I'm using linux 3.14-rc1 now with this patch: https://bugzilla.kernel.org/attachment.cgi?id=124621&action=diff

But I have an awful lot of this block in dmesg, I guess from every time starting X:

[Feb 6 13:39] [drm] Disabling audio 0 support [ +0,000004] [drm] Disabling audio 1 support [ +0,000002] [drm] Disabling audio 2 support [ +0,000001] [drm] Disabling audio 3 support [ +0,000001] [drm] Disabling audio 4 support [ +0,000002] [drm] Disabling audio 5 support [ +1,257992] [drm] probing gen 2 caps for device 8086:151 = 261ad01/e [ +0,000005] [drm] PCIE gen 3 link speeds already enabled [ +0,004127] [drm] PCIE GART of 1024M enabled (table at 0x0000000000478000). [ +0,000102] radeon 0000:01:00.0: WB enabled [ +0,000004] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000080000c00 and cpu addr 0xffff8807fd9f8c00 [ +0,000003] radeon 0000:01:00.0: fence driver on ring 1 use gpu addr 0x0000000080000c04 and cpu addr 0xffff8807fd9f8c04 [ +0,000002] radeon 0000:01:00.0: fence driver on ring 2 use gpu addr 0x0000000080000c08 and cpu addr 0xffff8807fd9f8c08 [ +0,000002] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000080000c0c and cpu addr 0xffff8807fd9f8c0c [ +0,000002] radeon 0000:01:00.0: fence driver on ring 4 use gpu addr 0x0000000080000c10 and cpu addr 0xffff8807fd9f8c10 [ +0,000390] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000075a18 and cpu addr 0xffffc90008b35a18 [ +0,144839] [drm] ring test on 0 succeeded in 2 usecs [ +0,000004] [drm] ring test on 1 succeeded in 1 usecs [ +0,000003] [drm] ring test on 2 succeeded in 1 usecs [ +0,000059] [drm] ring test on 3 succeeded in 2 usecs [ +0,000007] [drm] ring test on 4 succeeded in 1 usecs [ +0,187260] [drm] ring test on 5 succeeded in 2 usecs [ +0,000004] [drm] UVD initialized successfully. [ +0,001745] [drm] Enabling audio 0 support [ +0,000001] [drm] Enabling audio 1 support [ +0,000000] [drm] Enabling audio 2 support [ +0,000001] [drm] Enabling audio 3 support [ +0,000001] [drm] Enabling audio 4 support [ +0,000001] [drm] Enabling audio 5 support [ +0,000035] [drm] ib test on ring 0 succeeded in 0 usecs [ +0,000029] [drm] ib test on ring 1 succeeded in 0 usecs [ +0,000029] [drm] ib test on ring 2 succeeded in 0 usecs [ +0,000019] [drm] ib test on ring 3 succeeded in 0 usecs [ +0,000018] [drm] ib test on ring 4 succeeded in 0 usecs [ +0,158619] [drm] ib test on ring 5 succeeded

But I wouldn't really thing that it's a problem. Just a bit bloat in the log.

Anyway, I have stared a bit longer at the callgraph and what I have attached seems suspicious I think. There's a whole lot of calls throughout the radeon driver but the kcachegrind gui is quite limited and the graph is very convoluted at this point so it's not clear to me whether this is during normal operation. I'm still kind of searching for a tool that could create a call graph and annotate it over time... Where this is coming from is in dix/dixutils.c line 719 ("(*(cbr->proc)) (pcbl, cbr->data, call_data);") at least but to my eyes it looks like it could do anything and has a whole lot of callers that valgrind caught:

WriteToClient <cycle 19> (Xorg: io.c, ...) FlushAllOutput (Xorg: io.c, ...) XaceHook <cycle 19> (Xorg: xace.c, ...) XaceHookPropertyAccess (Xorg: xace.c, ...) XaceHookDispatch (Xorg: xace.c, ...) CloseDownConnection (Xorg: connection.c, ...) DeleteClientFromAnySelections (Xorg: selection.c, ...) CloseDownClient (Xorg: dispatch.c, ...) ProcSetSelectionOwner (Xorg: selection.c, ...) SendConnSetup (Xorg: dispatch.c, ...) NextAvailableClient (Xorg: dispatch.c, ...)

But maybe I'm just going in a totally wrong direction...

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

10:39 p.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #34 from Hohahiu rakothedin@gmail.com --- (In reply to Christoph Haag from comment #33)

...

Created attachment 125111 [details] possibly call chain that calls radeon stuff

@ Hohahiu

Do you mean those? [ 8077.648324] [drm:si_dpm_set_power_state] *ERROR* si_set_sw_state failed

No, I don't see those at all.

By the way, I'm using linux 3.14-rc1 now with this patch: https://bugzilla.kernel.org/attachment.cgi?id=124621&action=diff

But I have an awful lot of this block in dmesg, I guess from every time starting X:

[Feb 6 13:39] [drm] Disabling audio 0 support [ +0,000004] [drm] Disabling audio 1 support [ +0,000002] [drm] Disabling audio 2 support [ +0,000001] [drm] Disabling audio 3 support [ +0,000001] [drm] Disabling audio 4 support [ +0,000002] [drm] Disabling audio 5 support [ +1,257992] [drm] probing gen 2 caps for device 8086:151 = 261ad01/e [ +0,000005] [drm] PCIE gen 3 link speeds already enabled [ +0,004127] [drm] PCIE GART of 1024M enabled (table at 0x0000000000478000). [ +0,000102] radeon 0000:01:00.0: WB enabled [ +0,000004] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000080000c00 and cpu addr 0xffff8807fd9f8c00 [ +0,000003] radeon 0000:01:00.0: fence driver on ring 1 use gpu addr 0x0000000080000c04 and cpu addr 0xffff8807fd9f8c04 [ +0,000002] radeon 0000:01:00.0: fence driver on ring 2 use gpu addr 0x0000000080000c08 and cpu addr 0xffff8807fd9f8c08 [ +0,000002] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000080000c0c and cpu addr 0xffff8807fd9f8c0c [ +0,000002] radeon 0000:01:00.0: fence driver on ring 4 use gpu addr 0x0000000080000c10 and cpu addr 0xffff8807fd9f8c10 [ +0,000390] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000075a18 and cpu addr 0xffffc90008b35a18 [ +0,144839] [drm] ring test on 0 succeeded in 2 usecs [ +0,000004] [drm] ring test on 1 succeeded in 1 usecs [ +0,000003] [drm] ring test on 2 succeeded in 1 usecs [ +0,000059] [drm] ring test on 3 succeeded in 2 usecs [ +0,000007] [drm] ring test on 4 succeeded in 1 usecs [ +0,187260] [drm] ring test on 5 succeeded in 2 usecs [ +0,000004] [drm] UVD initialized successfully. [ +0,001745] [drm] Enabling audio 0 support [ +0,000001] [drm] Enabling audio 1 support [ +0,000000] [drm] Enabling audio 2 support [ +0,000001] [drm] Enabling audio 3 support [ +0,000001] [drm] Enabling audio 4 support [ +0,000001] [drm] Enabling audio 5 support [ +0,000035] [drm] ib test on ring 0 succeeded in 0 usecs [ +0,000029] [drm] ib test on ring 1 succeeded in 0 usecs [ +0,000029] [drm] ib test on ring 2 succeeded in 0 usecs [ +0,000019] [drm] ib test on ring 3 succeeded in 0 usecs [ +0,000018] [drm] ib test on ring 4 succeeded in 0 usecs [ +0,158619] [drm] ib test on ring 5 succeeded

But I wouldn't really thing that it's a problem. Just a bit bloat in the log.

Anyway, I have stared a bit longer at the callgraph and what I have attached seems suspicious I think. There's a whole lot of calls throughout the radeon driver but the kcachegrind gui is quite limited and the graph is very convoluted at this point so it's not clear to me whether this is during normal operation. I'm still kind of searching for a tool that could create a call graph and annotate it over time... Where this is coming from is in dix/dixutils.c line 719 ("(*(cbr->proc)) (pcbl, cbr->data, call_data);") at least but to my eyes it looks like it could do anything and has a whole lot of callers that valgrind caught:

WriteToClient <cycle 19> (Xorg: io.c, ...) FlushAllOutput (Xorg: io.c, ...) XaceHook <cycle 19> (Xorg: xace.c, ...) XaceHookPropertyAccess (Xorg: xace.c, ...) XaceHookDispatch (Xorg: xace.c, ...) CloseDownConnection (Xorg: connection.c, ...) DeleteClientFromAnySelections (Xorg: selection.c, ...) CloseDownClient (Xorg: dispatch.c, ...) ProcSetSelectionOwner (Xorg: selection.c, ...) SendConnSetup (Xorg: dispatch.c, ...) NextAvailableClient (Xorg: dispatch.c, ...)

But maybe I'm just going in a totally wrong direction...

I meant these repeated messages not the error.

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

10:44 p.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #35 from Alex Deucher alexdeucher@gmail.com --- (In reply to Christoph Haag from comment #33)

...

But I have an awful lot of this block in dmesg, I guess from every time starting X:

Those are the messages from the driver each time the card is powered off or on.

...

[Feb 6 13:39] [drm] Disabling audio 0 support [ +0,000004] [drm] Disabling audio 1 support [ +0,000002] [drm] Disabling audio 2 support [ +0,000001] [drm] Disabling audio 3 support [ +0,000001] [drm] Disabling audio 4 support [ +0,000002] [drm] Disabling audio 5 support

Card is powered off.

...

[ +1,257992] [drm] probing gen 2 caps for device 8086:151 = 261ad01/e [ +0,000005] [drm] PCIE gen 3 link speeds already enabled [ +0,004127] [drm] PCIE GART of 1024M enabled (table at 0x0000000000478000). [ +0,000102] radeon 0000:01:00.0: WB enabled [ +0,000004] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000080000c00 and cpu addr 0xffff8807fd9f8c00 [ +0,000003] radeon 0000:01:00.0: fence driver on ring 1 use gpu addr 0x0000000080000c04 and cpu addr 0xffff8807fd9f8c04 [ +0,000002] radeon 0000:01:00.0: fence driver on ring 2 use gpu addr 0x0000000080000c08 and cpu addr 0xffff8807fd9f8c08 [ +0,000002] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000080000c0c and cpu addr 0xffff8807fd9f8c0c [ +0,000002] radeon 0000:01:00.0: fence driver on ring 4 use gpu addr 0x0000000080000c10 and cpu addr 0xffff8807fd9f8c10 [ +0,000390] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000075a18 and cpu addr 0xffffc90008b35a18 [ +0,144839] [drm] ring test on 0 succeeded in 2 usecs [ +0,000004] [drm] ring test on 1 succeeded in 1 usecs [ +0,000003] [drm] ring test on 2 succeeded in 1 usecs [ +0,000059] [drm] ring test on 3 succeeded in 2 usecs [ +0,000007] [drm] ring test on 4 succeeded in 1 usecs [ +0,187260] [drm] ring test on 5 succeeded in 2 usecs [ +0,000004] [drm] UVD initialized successfully. [ +0,001745] [drm] Enabling audio 0 support [ +0,000001] [drm] Enabling audio 1 support [ +0,000000] [drm] Enabling audio 2 support [ +0,000001] [drm] Enabling audio 3 support [ +0,000001] [drm] Enabling audio 4 support [ +0,000001] [drm] Enabling audio 5 support [ +0,000035] [drm] ib test on ring 0 succeeded in 0 usecs [ +0,000029] [drm] ib test on ring 1 succeeded in 0 usecs [ +0,000029] [drm] ib test on ring 2 succeeded in 0 usecs [ +0,000019] [drm] ib test on ring 3 succeeded in 0 usecs [ +0,000018] [drm] ib test on ring 4 succeeded in 0 usecs [ +0,158619] [drm] ib test on ring 5 succeeded

Card is powered back on.

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

10:47 p.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #36 from Hohahiu rakothedin@gmail.com --- (In reply to Alex Deucher from comment #35)

...

(In reply to Christoph Haag from comment #33)

...
But I have an awful lot of this block in dmesg, I guess from every time starting X:

Those are the messages from the driver each time the card is powered off or on.

...
[Feb 6 13:39] [drm] Disabling audio 0 support [ +0,000004] [drm] Disabling audio 1 support [ +0,000002] [drm] Disabling audio 2 support [ +0,000001] [drm] Disabling audio 3 support [ +0,000001] [drm] Disabling audio 4 support [ +0,000002] [drm] Disabling audio 5 support

Card is powered off.

...
[ +1,257992] [drm] probing gen 2 caps for device 8086:151 = 261ad01/e [ +0,000005] [drm] PCIE gen 3 link speeds already enabled [ +0,004127] [drm] PCIE GART of 1024M enabled (table at 0x0000000000478000). [ +0,000102] radeon 0000:01:00.0: WB enabled [ +0,000004] radeon 0000:01:00.0: fence driver on ring 0 use gpu addr 0x0000000080000c00 and cpu addr 0xffff8807fd9f8c00 [ +0,000003] radeon 0000:01:00.0: fence driver on ring 1 use gpu addr 0x0000000080000c04 and cpu addr 0xffff8807fd9f8c04 [ +0,000002] radeon 0000:01:00.0: fence driver on ring 2 use gpu addr 0x0000000080000c08 and cpu addr 0xffff8807fd9f8c08 [ +0,000002] radeon 0000:01:00.0: fence driver on ring 3 use gpu addr 0x0000000080000c0c and cpu addr 0xffff8807fd9f8c0c [ +0,000002] radeon 0000:01:00.0: fence driver on ring 4 use gpu addr 0x0000000080000c10 and cpu addr 0xffff8807fd9f8c10 [ +0,000390] radeon 0000:01:00.0: fence driver on ring 5 use gpu addr 0x0000000000075a18 and cpu addr 0xffffc90008b35a18 [ +0,144839] [drm] ring test on 0 succeeded in 2 usecs [ +0,000004] [drm] ring test on 1 succeeded in 1 usecs [ +0,000003] [drm] ring test on 2 succeeded in 1 usecs [ +0,000059] [drm] ring test on 3 succeeded in 2 usecs [ +0,000007] [drm] ring test on 4 succeeded in 1 usecs [ +0,187260] [drm] ring test on 5 succeeded in 2 usecs [ +0,000004] [drm] UVD initialized successfully. [ +0,001745] [drm] Enabling audio 0 support [ +0,000001] [drm] Enabling audio 1 support [ +0,000000] [drm] Enabling audio 2 support [ +0,000001] [drm] Enabling audio 3 support [ +0,000001] [drm] Enabling audio 4 support [ +0,000001] [drm] Enabling audio 5 support [ +0,000035] [drm] ib test on ring 0 succeeded in 0 usecs [ +0,000029] [drm] ib test on ring 1 succeeded in 0 usecs [ +0,000029] [drm] ib test on ring 2 succeeded in 0 usecs [ +0,000019] [drm] ib test on ring 3 succeeded in 0 usecs [ +0,000018] [drm] ib test on ring 4 succeeded in 0 usecs [ +0,158619] [drm] ib test on ring 5 succeeded

Card is powered back on.

Thanks for explanation. In my case sometimes the card is powered on and off a lot of times without any explicit use.

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

11 Feb 11 Feb

1:58 a.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

Joshua M. Thompson joshua.thompson@gmail.com changed:

What |Removed |Added ---------------------------------------------------------------------------- CC| |joshua.thompson@gmail.com

--- Comment #37 from Joshua M. Thompson joshua.thompson@gmail.com --- I'd just like to chime in, since I've been having similar runpm problems with both 3.13 and 3.14 kernels on my laptop with built-in Intel HD4000 and a Radeon 7730M. The dGPU is constantly suspending and resuming every few minutes, even though nothing is using it (although xrandr offload has been configured). This is accompanied by the burst of kernel messages mentioned above, as well as several seconds where the mouse cursor moves but X is otherwise unresponsive. If I let the laptop sit idle for a while and come back it'll be frozen with a blank screen, presumably because it either froze up blanking the screen or shortly after.

Running with radeon.runpm=0 eliminates the freezes and the lockups, but it keeps the laptop running a lot hotter than I'd like.

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

7:30 a.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

Jani Nikula jani.nikula@intel.com changed:

What |Removed |Added ---------------------------------------------------------------------------- CC| |michael@mijobe.de

--- Comment #38 from Jani Nikula jani.nikula@intel.com --- *** Bug 69291 has been marked as a duplicate of this bug. ***

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

12 Feb 12 Feb

7:55 a.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #39 from Michel Dänzer michel@daenzer.net --- (In reply to Christoph Haag from comment #33)

...

I'm still kind of searching for a tool that could create a call graph and annotate it over time...

I suspect profiling is just not the right tool for this.

While the Radeon card is supposed to be off, can you attach gdb to the X server process and set a breakpoint on radeon_cs_emit, and if that triggers, attach the output of bt full?

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

9:41 a.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #40 from Christoph Haag haagch.christoph@googlemail.com --- Created attachment 125701 --> https://bugzilla.kernel.org/attachment.cgi?id=125701&action=edit radeon_cs_emit did not trigger, but ProcGetInputFocus causes some activity

(In reply to Michel Dänzer from comment #39)

...

(In reply to Christoph Haag from comment #33)

...
I'm still kind of searching for a tool that could create a call graph and annotate it over time...

I suspect profiling is just not the right tool for this.

Yes, what I actually wanted was a tool that traces every single function call and outputs the call chain for every single function call but there doesn't seem to be much in that regard except gdb scripting or long abandoned tools.

...

While the Radeon card is supposed to be off, can you attach gdb to the X server process and set a breakpoint on radeon_cs_emit, and if that triggers, attach the output of bt full?

It does not trigger. On the other hand it also does not trigger when running a 3d application like xonotic on the radeon gpu. So here's another maybe useless gdb log (radeon_cs_emit was Breakpoint 1)...

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

13 Feb 13 Feb

2:53 a.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #41 from Michel Dänzer michel@daenzer.net --- Created attachment 125821 --> https://bugzilla.kernel.org/attachment.cgi?id=125821&action=edit radeonsi: Short-circuit flushes with no preceding draw calls

Does this Mesa patch help?

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

9 a.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #42 from Christoph Haag haagch.christoph@googlemail.com --- 0:DIS: :DynOff:0000:01:00.0 1:IGD:+:Pwr:0000:00:02.0

Nice. It now powers off properly in X.

I haven't extensively tried but with the latest mesa git and this patch this bug seems to have vanished too: https://bugs.freedesktop.org/show_bug.cgi?id=69101 https://bugs.kde.org/show_bug.cgi?id=324864

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

12:12 p.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #43 from Christoph Haag haagch.christoph@googlemail.com --- But it silently breaks opencl.

Before the patch:

opencl-example (git)-[master] % sudo ./hello_world ./hello_world: /usr/lib/libOpenCL.so.1: no version information available (required by ./hello_world) There are 1 platforms. There are 1 GPU devices. clCreateContext() succeeded. clCreateCommandQueue() succeeded. clCreateProgramWithSource() suceeded. clBuildProgram() suceeded. clCreateKernel() suceeded. clCreateBuffer() succeeded. clSetKernelArg() succeeded. clEnqueueNDRangeKernel() suceeded. clFinish() succeeded. clEnqueueReadBuffer() suceeded. pi = 3.141590

After the patch:

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

14 Feb 14 Feb

2:59 a.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

Michel Dänzer michel@daenzer.net changed:

--- Comment #44 from Michel Dänzer michel@daenzer.net --- Created attachment 126011 --> https://bugzilla.kernel.org/attachment.cgi?id=126011&action=edit r600g,radeonsi: Consolidate logic for short-circuiting flushes

This patch doesn't break OpenCL for me.

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

15 Feb 15 Feb

5:33 p.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #45 from Christoph Haag haagch.christoph@googlemail.com --- Yes, this is working pretty good, haven't noticed anything bad with it.

Even suspending works great, manual vgaswitcheroo was always a bit buggy after waking up, but this is working great.

Maybe a minor thing: When the GPU is off lm_sensors displays radeon-pci-0100 Adapter: PCI adapter temp1: +511.0°C Should maybe be 0 or -1 or the last detected temperature or something, 511°C could throw some tools off.

Anyway, in comment #42 I was not right, the glx compositing black window issue was not fixed, but it certainly had an influence on it. Might be worth checking it out.

It's starting to work better than fglrx now, so, very nice.

Now the only issue I have is unloading the radeon kernel module. But fortunately you don't really have to do that during normal usage so it's not a priority for anyone I guess.

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

17 Feb 17 Feb

5:04 p.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #46 from Alex Deucher alexdeucher@gmail.com --- Created attachment 126451 --> https://bugzilla.kernel.org/attachment.cgi?id=126451&action=edit report 0 for temp when dGPU is powered off

(In reply to Christoph Haag from comment #45)

...

Yes, this is working pretty good, haven't noticed anything bad with it.

Even suspending works great, manual vgaswitcheroo was always a bit buggy after waking up, but this is working great.

Maybe a minor thing: When the GPU is off lm_sensors displays radeon-pci-0100 Adapter: PCI adapter temp1: +511.0°C Should maybe be 0 or -1 or the last detected temperature or something, 511°C could throw some tools off.

Attached patch should report 0.

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

19 Feb 19 Feb

12:45 a.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #47 from Hohahiu rakothedin@gmail.com --- The problem is fixed now in kernel 3.14-rc3. Truly speaking the laptop now is very cold compared to what it was before. So thank you very much for your work, Alex and Michel! However whenever my dGPU is powered on there are following error messages: [ 32.333940] [drm:uvd_v1_0_start] *ERROR* UVD not responding, trying to reset the VCPU!!! [ 33.357599] [drm:uvd_v1_0_start] *ERROR* UVD not responding, trying to reset the VCPU!!! [ 34.381217] [drm:uvd_v1_0_start] *ERROR* UVD not responding, trying to reset the VCPU!!! [ 35.404882] [drm:uvd_v1_0_start] *ERROR* UVD not responding, trying to reset the VCPU!!! [ 36.428483] [drm:uvd_v1_0_start] *ERROR* UVD not responding, trying to reset the VCPU!!! [ 37.452137] [drm:uvd_v1_0_start] *ERROR* UVD not responding, trying to reset the VCPU!!! [ 38.475777] [drm:uvd_v1_0_start] *ERROR* UVD not responding, trying to reset the VCPU!!! [ 39.499447] [drm:uvd_v1_0_start] *ERROR* UVD not responding, trying to reset the VCPU!!! [ 40.523078] [drm:uvd_v1_0_start] *ERROR* UVD not responding, trying to reset the VCPU!!! [ 41.546747] [drm:uvd_v1_0_start] *ERROR* UVD not responding, trying to reset the VCPU!!! [ 41.566826] [drm:uvd_v1_0_start] *ERROR* UVD not responding, giving up!!! [ 41.566879] [drm:si_startup] *ERROR* radeon: failed initializing UVD (-1).

Is it related to this issue or do you want me to open new bug report?

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

12:46 a.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #48 from Hohahiu rakothedin@gmail.com --- Created attachment 126711 --> https://bugzilla.kernel.org/attachment.cgi?id=126711&action=edit dmesg

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

1:26 a.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #49 from Alex Deucher alexdeucher@gmail.com --- (In reply to Hohahiu from comment #47)

...

[ 41.546747] [drm:uvd_v1_0_start] *ERROR* UVD not responding, trying to reset the VCPU!!! [ 41.566826] [drm:uvd_v1_0_start] *ERROR* UVD not responding, giving up!!! [ 41.566879] [drm:si_startup] *ERROR* radeon: failed initializing UVD (-1).

Is it related to this issue or do you want me to open new bug report?

Please open a different bug for this.

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

2:45 a.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #50 from Joshua M. Thompson joshua.thompson@gmail.com --- On Fedora Rawhide kernel 3.14.0-0.rc3.git0.2.fc21.x86_64 still doesn't fix this (maybe it doesn't have the necessary patch yet?)

With radeon.runpm=1 I get the "UVD not responding" messages, the machine locks up frequently for 5-10 seconds to reinit the radeon, and the radeon performance is horrible (~400 fps in glxgears).

With radeon.runpm=0 I don't get "UVD not responding", the lockups are gone, and glxgears is at 3650 fps (still less than the 6-7k fps I got under 3.12 though.)

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

3 Mar 3 Mar

12:19 a.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

saadnaji89@gmail.com changed:

What |Removed |Added ---------------------------------------------------------------------------- CC| |saadnaji89@gmail.com

--- Comment #51 from saadnaji89@gmail.com --- Hello,

I am runnig Manjaro (Arch based distro 64 bit) with Kernel 3.13.5 as right now and I am having the same error except that I am geting repetitive blocks in the kernel log like this ever certain amount of time:

03/02/14 06:59:12 PM [drm] ring test on 0 succeeded in 2 usecs 03/02/14 06:59:12 PM [drm] ring test on 1 succeeded in 1 usecs 03/02/14 06:59:12 PM [drm] ring test on 2 succeeded in 1 usecs 03/02/14 06:59:12 PM [drm] ring test on 3 succeeded in 2 usecs 03/02/14 06:59:12 PM [drm] ring test on 4 succeeded in 1 usecs 03/02/14 06:59:12 PM [drm] ring test on 5 succeeded in 2 usecs 03/02/14 06:59:12 PM [drm] UVD initialized successfully. 03/02/14 06:59:12 PM [drm] Enabling audio 0 support 03/02/14 06:59:12 PM [drm] Enabling audio 1 support 03/02/14 06:59:12 PM [drm] Enabling audio 2 support 03/02/14 06:59:12 PM [drm] Enabling audio 3 support 03/02/14 06:59:12 PM [drm] Enabling audio 4 support 03/02/14 06:59:12 PM [drm] Enabling audio 5 support 03/02/14 06:59:12 PM [drm] ib test on ring 0 succeeded in 0 usecs 03/02/14 06:59:12 PM [drm] ib test on ring 1 succeeded in 0 usecs 03/02/14 06:59:12 PM [drm] ib test on ring 2 succeeded in 0 usecs 03/02/14 06:59:12 PM [drm] ib test on ring 3 succeeded in 0 usecs 03/02/14 06:59:12 PM [drm] ib test on ring 4 succeeded in 0 usecs 03/02/14 06:59:12 PM [drm] ib test on ring 5 succeeded 03/02/14 06:59:48 PM [drm] Disabling audio 0 support 03/02/14 06:59:48 PM [drm] Disabling audio 1 support 03/02/14 06:59:48 PM [drm] Disabling audio 2 support 03/02/14 06:59:48 PM [drm] Disabling audio 3 support 03/02/14 06:59:48 PM [drm] Disabling audio 4 support 03/02/14 06:59:48 PM [drm] Disabling audio 5 support

and this is causing freeze for my laptop for 2-3 seconds, as well as causing the temperature of the gpu to raise up by 10 C . This bug was not present in prior 3.13 kernel

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

12:20 a.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #52 from saadnaji89@gmail.com --- Created attachment 127801 --> https://bugzilla.kernel.org/attachment.cgi?id=127801&action=edit kenrel log

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

8:22 a.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #53 from Michel Dänzer michel@daenzer.net --- (In reply to saadnaji89 from comment #51)

...

I am runnig Manjaro (Arch based distro 64 bit) with Kernel 3.13.5 as right now and I am having the same error except that I am geting repetitive blocks in the kernel log like this ever certain amount of time:

See comment #35.

...

and this is causing freeze for my laptop for 2-3 seconds, as well as causing the temperature of the gpu to raise up by 10 C .

You probably need my Mesa patch to prevent the GPU from powering up needlessly: https://bugzilla.kernel.org/attachment.cgi?id=126011

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

11:16 p.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #54 from saadnaji89@gmail.com --- (In reply to Michel Dänzer from comment #53)

...

(In reply to saadnaji89 from comment #51)

...
I am runnig Manjaro (Arch based distro 64 bit) with Kernel 3.13.5 as right now and I am having the same error except that I am geting repetitive blocks in the kernel log like this ever certain amount of time:

See comment #35.

...
and this is causing freeze for my laptop for 2-3 seconds, as well as causing the temperature of the gpu to raise up by 10 C .

You probably need my Mesa patch to prevent the GPU from powering up needlessly: https://bugzilla.kernel.org/attachment.cgi?id=126011

Thanks you.

Are we going to see this patch applied to fix the problem in future kernel veriosn ?. I don't know whether you work for AMD or just someone is contributing to fix the problem

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

11:25 p.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #55 from Christoph Haag haagch.christoph@googlemail.com --- (In reply to saadnaji89 from comment #54)

...

Are we going to see this patch applied to fix the problem in future kernel veriosn ?. I don't know whether you work for AMD or just someone is contributing to fix the problem

I wondered whether to reply, but now...

He works for AMD. It's even on Wikipedia.

This patch is not to the kernel, it is to mesa.

It has been in mesa for 14 days: http://cgit.freedesktop.org/mesa/mesa/commit/?id=cf0172d46ab940a691da6516057...

(Is it necessary to keep talking about already fixed stuff?)

I'd say the only reason this is not closed yet is because of the radeon module unloading problems. Thinking of it, I haven't tried in a while.. Should I try to get some more information for it or is the delay just until someone has time to look at it?

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

1 Apr 1 Apr

9 a.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

Mike Cloaked mike.cloaked@gmail.com changed:

What |Removed |Added ---------------------------------------------------------------------------- CC| |mike.cloaked@gmail.com

--- Comment #56 from Mike Cloaked mike.cloaked@gmail.com --- It is possible that this report is directly related:

https://bugzilla.kernel.org/show_bug.cgi?id=73291

So I guess at this time the bug is still awaiting the mesa fix to test further in arch linux at least?

-- You are receiving this mail because: You are watching the assignee of the bug.

bugzilla-daemon＠bugzilla.kernel.org

4 Jan 4 Jan

1:09 a.m.

New subject: [Bug 65761] HD 7970M Hybrid - hangs and errors and rmmod causes crash

https://bugzilla.kernel.org/show_bug.cgi?id=65761

--- Comment #57 from Christoph Haag haagch.christoph@googlemail.com --- Long time, no update from me.

Using 3.19-rc2:

rmmod radeon works fine modprobe radeon after rmmod stalls once:

[ 53.031311] radeon 0000:01:00.0: ring 5 stalled for more than 10000msec [ 53.032230] radeon 0000:01:00.0: GPU lockup (current fence id 0x0000000000000000 last fence id 0x0000000000000002 on ring 5) [ 53.032322] [drm:uvd_v1_0_ib_test [radeon]] *ERROR* radeon: fence wait failed (-35). [ 53.033262] [drm:radeon_ib_ring_tests [radeon]] *ERROR* radeon: failed testing IB on ring 5 (-35). [ 53.037997] [drm] Radeon Display Connectors [ 53.041287] radeon 0000:01:00.0: No connectors reported connected with modes [ 53.041289] [drm] Cannot find any crtc or sizes - going 1024x768 [ 53.041834] [drm] fb mappable at 0xE0479000 [ 53.041835] [drm] vram apper at 0xE0000000 [ 53.041835] [drm] size 3145728 [ 53.041836] [drm] fb depth is 24 [ 53.041837] [drm] pitch is 4096 [ 53.041941] radeon 0000:01:00.0: fb1: radeondrmfb frame buffer device [ 53.042347] [drm] Initialized radeon 2.40.0 20080528 for 0000:01:00.0 on minor 1

but it recovers and then works.

So rmmod radeon and modprobe radeon do not cause any critical errors anymore, it only is a bit annoying that modprobe radeon takes 10 seconds.

-- You are receiving this mail because: You are watching the assignee of the bug.

3760

Age (days ago)

4165

Last active (days ago)

dri-devel@lists.freedesktop.org

57 comments

1 participants

tags (0)

participants (1)

bugzilla-daemon＠bugzilla.kernel.org