On Tue, Apr 23, 2013 at 10:15 AM, Michel Dänzer michel@daenzer.net wrote:
On Die, 2013-04-23 at 10:08 -0700, Andy Lutomirski wrote:
On Mon, Apr 22, 2013 at 10:55 PM, Michel Dänzer michel@daenzer.net wrote:
On Mon, 2013-04-22 at 16:19 -0700, Andy Lutomirski wrote:
I'm not convinced there's an actual hang. 40 seconds is a long time, and I've only ever seen this when clicking something, and when this happens, the screen goes blank immediately (not after a 40 second delay).
Hmm, now that you mention this, I notice in your original report it claims that the CP stalled for 'more than 5102593msec', which is clearly bogus. Looks like something's wrong with the lockup detection. Did this start after a kernel update or something like that?
It's recent. It may have been when F18 switched from 3.7 to 3.8.
Can you reproduce it with an upstream kernel? Can you bisect? I realize it'll probably take a long time, but unless someone has an idea which change might have introduced the problem...
Yuck. I can try, but it takes days to reproduce this, so it will take forever (and may end up with a wrong answer if I get lucky and don't crash).
I think there are bugs in the lockup detection and in the lockup recovery. Firefox, in particular, is *really* slow afterwards. Are interrupts possibly getting dropped or misconfigured during the reset?
Let's not get ahead of ourselves and focus on the lockup detection issue for now.
I don't understand the r600_gpu_check_soft_reset code, but could this be the sequence of events that triggers it?
1. radeon_ring_is_lockup is called just as the very last command on the ring completes, so last_rptr gets set to the rptr. 2. Nothing happens for a while (i.e. > lockup_timeout). rptr doesn't change. 3. A very slightly slow operation starts. 4. radeon_ring_is_lockup gets called before that command completes.
radeon_ring_test_lockup will not detect a jiffies wrap-around (because there wasn't one), rptr will equal last_rptr (because there hasn't been any progress since last time), and the elapsed time will be really long, because the function hasn't been called for a long time. So a lockup gets detected, even though nothing's wrong.
There's a comment above radeon_ring_test_lockup that says:
* A possible false positivie is if we get call after while and last_cp_rptr == * the current CP rptr, even if it's unlikely it might happen. To avoid this * if the elapsed time since last call is bigger than 2 second than we return * false and update the tracking information. Due to this the caller must call * radeon_ring_test_lockup several time in less than 2sec for lockup to be reported * the fencing code should be cautious about that.
but the corresponding code doesn't appear to exist anywhere.
Also, and unrelatedly, I revoke my comment about gmail issues being fixed with hyperz off. Gmail still draws incorrectly. This may or may not have anything to do with the radeon driver.
--Andy