On Wed, Nov 26, 2014 at 7:38 AM, Andy Lutomirski luto@amacapital.net wrote:
On Tue, Nov 25, 2014 at 10:42 PM, Michel Dänzer michel@daenzer.net wrote:
On 20.11.2014 09:58, Andy Lutomirski wrote:
On Wed, Nov 19, 2014 at 4:07 PM, Andy Lutomirski luto@amacapital.net wrote:
On Tue, Nov 18, 2014 at 11:19 PM, Michel Dänzer michel@daenzer.net wrote:
On 19.11.2014 09:21, Andy Lutomirski wrote:
On Mon, Nov 17, 2014 at 1:51 AM, Michel Dänzer michel@daenzer.net wrote: > > > On 15.11.2014 07:21, Andy Lutomirski wrote: >> >> >> >> On recent kernels (3.16 through 3.18-rc4, perhaps), doing anything >> graphics intensive seems to cause my system to become unusable for >> tens of seconds. Pointing Firefox at Google Maps is a big offender >> -- >> it can take several minutes for me to move my mouse far enough to >> close the tab and get my computer back. >> >> On bootup, I get this warning: >> [drm:btc_dpm_set_power_state] *ERROR* >> rv770_restrict_performance_levels_before_switch failed >> >> Setting radeon.dpm=0 seems to work around this problem at the cost of >> giving my rather slow graphics. >> >> Are there known issues here? > > > > > Can you bisect the kernel, or at least isolate which kernel version > first > introduced the problem?
With whatever userspace I'm running, I'm seeing it 3.13, 3.14, 3.15, 3.16, and 3.18-rc4+. I haven't tried other versions.
With radeon.dpm=0, I can still trigger short stalls (around one second), but I seem unable to trigger long stalls easily. (I say easily because, just as I was typing this email, my system stalled for about a minute.)
I can only think of two things offhand that could cause such extremely long stalls: Swap thrashing or IRQ storms.
With a setup where you can easily trigger long stalls, can you try getting a CPU profile for a stall with sysprof or perf?
Got one with perf:
16.82% Xorg libc-2.18.so [.] __memcpy_sse2_unaligned 9.20% swapper [kernel.kallsyms] [k] intel_idle 1.00% Xorg [kernel.kallsyms] [k] evergreen_irq_set 0.83% firefox libxul.so [.] 0x0000000001d93281 0.69% firefox libxul.so [.] 0x0000000001d932ad 0.62% firefox [kernel.kallsyms] [k] copy_user_generic_string 0.55% swapper [kernel.kallsyms] [k] evergreen_irq_ack 0.54% firefox libpthread-2.18.so [.] pthread_mutex_lock 0.52% firefox libpthread-2.18.so [.] pthread_mutex_unlock 0.45% Xorg [kernel.kallsyms] [k] drm_mm_insert_node_in_range_generic 0.41% Xorg [kernel.kallsyms] [k] lock_release 0.40% Xorg [kernel.kallsyms] [k] lock_acquire 0.35% firefox firefox [.] 0x000000000001245d 0.33% Xorg [kernel.kallsyms] [k] __module_address 0.31% firefox [kernel.kallsyms] [k] clear_page_c 0.29% Xorg [kernel.kallsyms] [k] copy_user_generic_string 0.28% firefox firefox [.] 0x0000000000013159
and:
Samples: 11K of event 'irq:irq_handler_entry', Event count (approx.): 11802 87.43% swapper [kernel.kallsyms] [k] handle_irq_event_percpu 7.52% firefox [kernel.kallsyms] [k] handle_irq_event_percpu 1.84% irq/36-ahci [kernel.kallsyms] [k] handle_irq_event_percpu 1.14% Xorg [kernel.kallsyms] [k] handle_irq_event_percpu 0.75% kworker/5:0 [kernel.kallsyms] [k] handle_irq_event_percpu 0.32% gnome-shell [kernel.kallsyms] [k] handle_irq_event_percpu 0.25% kworker/5:1H [kernel.kallsyms] [k] handle_irq_event_percpu 0.25% Media D~ode #10 [kernel.kallsyms] [k] handle_irq_event_percpu 0.19% ImageDe~er #330 [kernel.kallsyms] [k] handle_irq_event_percpu 0.07% pulseaudio [kernel.kallsyms] [k] handle_irq_event_percpu
The cycles were with -e cycles:pp, so I think that iret would have shown up if there were enough IRQs to cause the problem.
I'll build a kernel with latencytop.
I just caught call_rwsem_down_write_failed for 5379 ms in khugepaged (holy crap) and radeon_fence_default_wait for 489.2ms in Xorg.
Turning off THP gets rid of the khugepaged thing. The 489.2ms is radeon_fence_default_wait is amazingly reproducible -- I've seen that exact number three times now.
Sounds like the long stalls were THP, but the shorter ones might be radeon?
Can you get some call graphs for the profile or from latencytop? Make sure at least the kernel is built with frame pointers (CONFIG_FRAME_POINTER=y), preferably also userspace (-fno-omit-frame-pointer).
Will try next week. I'm out of town.
The relevant line from latencytop seems to be:
154 20441402 489139 radeon_fence_default_wait [radeon] fence_wait_timeout ttm_bo_wait [ttm] ttm_bo_move_accel_cleanup [ttm] radeon_move_blit.isra.12 [radeon] radeon_bo_move [radeon] ttm_bo_handle_move_mem [ttm] ttm_bo_evict [ttm] ttm_mem_evict_first [ttm] ttm_bo_mem_space [ttm] ttm_bo_validate [ttm] radeon_bo_fault_reserve_notify [radeon]
--Andy
--Andy
-- Earthling Michel Dänzer | http://www.amd.com Libre software enthusiast | Mesa and X developer
-- Andy Lutomirski AMA Capital Management, LLC