Hi guys,
Last week I've switched from my old & good 3.4.63 to 3.14-rc1 and noticed nasty display corruptions when using nouveau. It seems that changing parts of the screen are appearing for a fraction of second in random places. I've recorded this behavior: http://www.youtube.com/watch?v=IEq7JzGVzj0
My hardware is some old motherboard with 00:05.0 VGA compatible controller [0300]: NVIDIA Corporation C51G [GeForce 6100] [10de:0242] (rev a2) integrated. Since my CPU is ancient AMD Sempron(tm) Processor 2800+ it took me few days to track this issue.
There goes some summary of various kernels:
1) 3.4.63 No display problems. Works great.
2) commit 928c2f0c006bf7f381f58af2b2786d2a858ae311 drm/fb-helper: don't sleep for screen unblank when an oops is in progress Scrollbars have a pink line. I didn't track which commit introduced this pink corruption. No other problems.
3) commit c21eb21cb50d58e7cbdcb8b9e7ff68b85cfa5095 Revert "drm: mark context support as a legacy subsystem" This fixes pink lines on scrollbars and introduces this nasty display corruption. It's one commit after previous one. It means it's the first bad commit for these nasty corruptions recoded and uploaded to YouTube.
4) 3.14-rc1 No changes since c21eb21cb50d58e7cbdcb8b9e7ff68b85cfa5095. No pink lines, but display corruptions happening.
As you can see, this is a bit complex. Regression with pink lines on scrollbars was introduced at some point, but it was fixed. Unfortunately the same commit that fixed pink color also introduced much more nasty corruption during screen refresh.
Is there any more info I can provide to help fixing this?
I'm a bit afraid the tracked commit c21eb21cb50d58e7cbdcb8b9e7ff68b85cfa5095 only exposed some hidden issue. Could bisecting for a pink lines regression help?
On Sun, Feb 9, 2014 at 5:08 PM, Rafał Miłecki zajec5@gmail.com wrote:
Hi guys,
Last week I've switched from my old & good 3.4.63 to 3.14-rc1 and noticed nasty display corruptions when using nouveau. It seems that changing parts of the screen are appearing for a fraction of second in random places. I've recorded this behavior: http://www.youtube.com/watch?v=IEq7JzGVzj0
My hardware is some old motherboard with 00:05.0 VGA compatible controller [0300]: NVIDIA Corporation C51G [GeForce 6100] [10de:0242] (rev a2) integrated. Since my CPU is ancient AMD Sempron(tm) Processor 2800+ it took me few days to track this issue.
There goes some summary of various kernels:
- 3.4.63
No display problems. Works great.
- commit 928c2f0c006bf7f381f58af2b2786d2a858ae311
drm/fb-helper: don't sleep for screen unblank when an oops is in progress Scrollbars have a pink line. I didn't track which commit introduced this pink corruption. No other problems.
- commit c21eb21cb50d58e7cbdcb8b9e7ff68b85cfa5095
Revert "drm: mark context support as a legacy subsystem" This fixes pink lines on scrollbars and introduces this nasty display corruption. It's one commit after previous one. It means it's the first bad commit for these nasty corruptions recoded and uploaded to YouTube.
- 3.14-rc1
No changes since c21eb21cb50d58e7cbdcb8b9e7ff68b85cfa5095. No pink lines, but display corruptions happening.
Can you boot with nouveau.config=NvMSI=0 ? If that helps, there are some patches on the nouveau/dri-devel lists (search for "nv4c") that may help you.
-ilia
2014-02-09 23:12 GMT+01:00 Ilia Mirkin imirkin@alum.mit.edu:
On Sun, Feb 9, 2014 at 5:08 PM, Rafał Miłecki zajec5@gmail.com wrote:
Last week I've switched from my old & good 3.4.63 to 3.14-rc1 and noticed nasty display corruptions when using nouveau. It seems that changing parts of the screen are appearing for a fraction of second in random places. I've recorded this behavior: http://www.youtube.com/watch?v=IEq7JzGVzj0
My hardware is some old motherboard with 00:05.0 VGA compatible controller [0300]: NVIDIA Corporation C51G [GeForce 6100] [10de:0242] (rev a2) integrated. Since my CPU is ancient AMD Sempron(tm) Processor 2800+ it took me few days to track this issue.
There goes some summary of various kernels:
- 3.4.63
No display problems. Works great.
- commit 928c2f0c006bf7f381f58af2b2786d2a858ae311
drm/fb-helper: don't sleep for screen unblank when an oops is in progress Scrollbars have a pink line. I didn't track which commit introduced this pink corruption. No other problems.
- commit c21eb21cb50d58e7cbdcb8b9e7ff68b85cfa5095
Revert "drm: mark context support as a legacy subsystem" This fixes pink lines on scrollbars and introduces this nasty display corruption. It's one commit after previous one. It means it's the first bad commit for these nasty corruptions recoded and uploaded to YouTube.
- 3.14-rc1
No changes since c21eb21cb50d58e7cbdcb8b9e7ff68b85cfa5095. No pink lines, but display corruptions happening.
Can you boot with nouveau.config=NvMSI=0 ? If that helps, there are some patches on the nouveau/dri-devel lists (search for "nv4c") that may help you.
Unfortunately this config parameter doesn't help :(
On Mon, Feb 10, 2014 at 10:12 AM, Rafał Miłecki zajec5@gmail.com wrote:
2014-02-09 23:12 GMT+01:00 Ilia Mirkin imirkin@alum.mit.edu:
On Sun, Feb 9, 2014 at 5:08 PM, Rafał Miłecki zajec5@gmail.com wrote:
Last week I've switched from my old & good 3.4.63 to 3.14-rc1 and noticed nasty display corruptions when using nouveau. It seems that changing parts of the screen are appearing for a fraction of second in random places. I've recorded this behavior: http://www.youtube.com/watch?v=IEq7JzGVzj0
My hardware is some old motherboard with 00:05.0 VGA compatible controller [0300]: NVIDIA Corporation C51G [GeForce 6100] [10de:0242] (rev a2) integrated. Since my CPU is ancient AMD Sempron(tm) Processor 2800+ it took me few days to track this issue.
There goes some summary of various kernels:
- 3.4.63
No display problems. Works great.
- commit 928c2f0c006bf7f381f58af2b2786d2a858ae311
drm/fb-helper: don't sleep for screen unblank when an oops is in progress Scrollbars have a pink line. I didn't track which commit introduced this pink corruption. No other problems.
- commit c21eb21cb50d58e7cbdcb8b9e7ff68b85cfa5095
Revert "drm: mark context support as a legacy subsystem" This fixes pink lines on scrollbars and introduces this nasty display corruption. It's one commit after previous one. It means it's the first bad commit for these nasty corruptions recoded and uploaded to YouTube.
- 3.14-rc1
No changes since c21eb21cb50d58e7cbdcb8b9e7ff68b85cfa5095. No pink lines, but display corruptions happening.
Can you boot with nouveau.config=NvMSI=0 ? If that helps, there are some patches on the nouveau/dri-devel lists (search for "nv4c") that may help you.
Unfortunately this config parameter doesn't help :(
Too bad. It may still be worthwhile applying the patches and seeing what happens... it seems like some registers got switched around on the nv4x IGP's:
http://lists.freedesktop.org/archives/nouveau/2014-February/016032.html http://lists.freedesktop.org/archives/nouveau/2014-February/016033.html http://lists.freedesktop.org/archives/nouveau/2014-February/016034.html
BTW, youtube says "this video is unavailable".
Is there anything in dmesg when the display corruptions happen?
There was also an issue with libdrm_nouveau for pre-nv50 chips, when compiled with gcc-4.8 some time back... fixed in... 2.4.48 or so?
Lastly, it may be worth trying 3.11.x and 3.12.x to get a better handle on when problems happened. The commits you cite are in the middle of releases, and may have various badness associated with them (e.g. 3.12-rc had a later-disabled MSI implementation, back in 3.13... probably some other stuff).
-ilia
2014-02-10 20:06 GMT+01:00 Ilia Mirkin imirkin@alum.mit.edu:
On Mon, Feb 10, 2014 at 10:12 AM, Rafał Miłecki zajec5@gmail.com wrote:
2014-02-09 23:12 GMT+01:00 Ilia Mirkin imirkin@alum.mit.edu:
On Sun, Feb 9, 2014 at 5:08 PM, Rafał Miłecki zajec5@gmail.com wrote:
Last week I've switched from my old & good 3.4.63 to 3.14-rc1 and noticed nasty display corruptions when using nouveau. It seems that changing parts of the screen are appearing for a fraction of second in random places. I've recorded this behavior: http://www.youtube.com/watch?v=IEq7JzGVzj0
My hardware is some old motherboard with 00:05.0 VGA compatible controller [0300]: NVIDIA Corporation C51G [GeForce 6100] [10de:0242] (rev a2) integrated. Since my CPU is ancient AMD Sempron(tm) Processor 2800+ it took me few days to track this issue.
There goes some summary of various kernels:
- 3.4.63
No display problems. Works great.
- commit 928c2f0c006bf7f381f58af2b2786d2a858ae311
drm/fb-helper: don't sleep for screen unblank when an oops is in progress Scrollbars have a pink line. I didn't track which commit introduced this pink corruption. No other problems.
- commit c21eb21cb50d58e7cbdcb8b9e7ff68b85cfa5095
Revert "drm: mark context support as a legacy subsystem" This fixes pink lines on scrollbars and introduces this nasty display corruption. It's one commit after previous one. It means it's the first bad commit for these nasty corruptions recoded and uploaded to YouTube.
- 3.14-rc1
No changes since c21eb21cb50d58e7cbdcb8b9e7ff68b85cfa5095. No pink lines, but display corruptions happening.
Can you boot with nouveau.config=NvMSI=0 ? If that helps, there are some patches on the nouveau/dri-devel lists (search for "nv4c") that may help you.
Unfortunately this config parameter doesn't help :(
Too bad. It may still be worthwhile applying the patches and seeing what happens... it seems like some registers got switched around on the nv4x IGP's:
http://lists.freedesktop.org/archives/nouveau/2014-February/016032.html http://lists.freedesktop.org/archives/nouveau/2014-February/016033.html http://lists.freedesktop.org/archives/nouveau/2014-February/016034.html
I've applied all 3 patches, compiled, tried... didn't help. I've also tried nouveau.config=NvMSI=0 on top on your patches, didn't help.
BTW, youtube says "this video is unavailable".
Ohh, Google/YouTube really doesn't like ppl removing G+ account... http://files.zajec.net/20140208-nouveau.mp4
Is there anything in dmesg when the display corruptions happen?
No.
There was also an issue with libdrm_nouveau for pre-nv50 chips, when compiled with gcc-4.8 some time back... fixed in... 2.4.48 or so?
I use openSUSE 12.2 (x86_64) which provides gcc 4.7.1 and libdrm_nouveau1-2.4.33-2.3.2.x86_64. I assume libdrm_nouveau was compiled using that 4.7.1.
Lastly, it may be worth trying 3.11.x and 3.12.x to get a better handle on when problems happened. The commits you cite are in the middle of releases, and may have various badness associated with them (e.g. 3.12-rc had a later-disabled MSI implementation, back in 3.13... probably some other stuff).
I'll provide results tomorrow.
On Mon, Feb 10, 2014 at 3:05 PM, Rafał Miłecki zajec5@gmail.com wrote:
2014-02-10 20:06 GMT+01:00 Ilia Mirkin imirkin@alum.mit.edu:
There was also an issue with libdrm_nouveau for pre-nv50 chips, when compiled with gcc-4.8 some time back... fixed in... 2.4.48 or so?
I use openSUSE 12.2 (x86_64) which provides gcc 4.7.1 and libdrm_nouveau1-2.4.33-2.3.2.x86_64. I assume libdrm_nouveau was compiled using that 4.7.1.
Hmmm... the nouveau drm rewrite went into 2.4.34... I guess you're using pretty old userspace in general, since everything depends on the post-rewrite libdrm_nouveau. Of course it definitely sounds like a kernel issue, but I can't help but wonder if this is a non-issue with later userspace.
So there are basically 2 things left to do, in order of time-consuming-ness:
(a) try a live{cd,usb} (e.g. arch, or something else that has recent software), and see if the issue is still present there. (b) bisect. you can (almost) definitely restrict the bisect to drivers/gpu/drm/nouveau. if you have additional computational power, i would recommend looking into distcc for speeding up the compiles. it may be interesting to also try 3.6.x since 3.7 received a pretty big rewrite. but a git bisect is a lot more direct in figuring these things out :)
After I watched your video, it definitely brought back memories of another bug or perhaps email on this list a while back (definitely within the past year), but unfortunately I can't quite place it :(
-ilia
2014-02-11 11:41 GMT+01:00 Ilia Mirkin imirkin@alum.mit.edu:
(b) bisect. you can (almost) definitely restrict the bisect to drivers/gpu/drm/nouveau. if you have additional computational power, i would recommend looking into distcc for speeding up the compiles. it may be interesting to also try 3.6.x since 3.7 received a pretty big rewrite. but a git bisect is a lot more direct in figuring these things out :)
I've already bisected commit that changed this pink line issue into a general screen corruption. Just to remind it was:
commit c21eb21cb50d58e7cbdcb8b9e7ff68b85cfa5095 Author: Dave Airlie airlied@redhat.com Date: Fri Sep 20 08:32:59 2013 +1000
Revert "drm: mark context support as a legacy subsystem"
Would you like me to bisect commit that introduced this pink line issue?
On Tue, Feb 11, 2014 at 6:09 AM, Rafał Miłecki zajec5@gmail.com wrote:
2014-02-11 11:41 GMT+01:00 Ilia Mirkin imirkin@alum.mit.edu:
(b) bisect. you can (almost) definitely restrict the bisect to drivers/gpu/drm/nouveau. if you have additional computational power, i would recommend looking into distcc for speeding up the compiles. it may be interesting to also try 3.6.x since 3.7 received a pretty big rewrite. but a git bisect is a lot more direct in figuring these things out :)
I've already bisected commit that changed this pink line issue into a general screen corruption. Just to remind it was:
commit c21eb21cb50d58e7cbdcb8b9e7ff68b85cfa5095 Author: Dave Airlie airlied@redhat.com Date: Fri Sep 20 08:32:59 2013 +1000
Revert "drm: mark context support as a legacy subsystem"
Right, and this commit was reverting another commit -- 7c510133d93dd6f15ca040733ba7b2891ed61fd1. I bet that if you check out to right before this commit, you'll get the "general" corruption again (it went into 3.12-rc1, the revert went into 3.12-rc2 -- I assume that when you tested 3.11.x, the corruption was the same as in 3.12.x, which contains both this commit and its revert).
Would you like me to bisect commit that introduced this pink line issue?
I would like you to bisect whatever corruption you see when you tested 3.11.x -- is that the pink lines, or the flickering textures?
-ilia
2014-02-11 11:41 GMT+01:00 Ilia Mirkin imirkin@alum.mit.edu:
On Mon, Feb 10, 2014 at 3:05 PM, Rafał Miłecki zajec5@gmail.com wrote:
2014-02-10 20:06 GMT+01:00 Ilia Mirkin imirkin@alum.mit.edu:
There was also an issue with libdrm_nouveau for pre-nv50 chips, when compiled with gcc-4.8 some time back... fixed in... 2.4.48 or so?
I use openSUSE 12.2 (x86_64) which provides gcc 4.7.1 and libdrm_nouveau1-2.4.33-2.3.2.x86_64. I assume libdrm_nouveau was compiled using that 4.7.1.
Hmmm... the nouveau drm rewrite went into 2.4.34... I guess you're using pretty old userspace in general, since everything depends on the post-rewrite libdrm_nouveau. Of course it definitely sounds like a kernel issue, but I can't help but wonder if this is a non-issue with later userspace.
So there are basically 2 things left to do, in order of time-consuming-ness:
(a) try a live{cd,usb} (e.g. arch, or something else that has recent software), and see if the issue is still present there.
I've tried Fedora 20 booted from USB. It suffers from the same issue. It's based on kernel 3.11.10, but I'm sure it has more up to date userspace.
(b) bisect. you can (almost) definitely restrict the bisect to drivers/gpu/drm/nouveau. if you have additional computational power, i would recommend looking into distcc for speeding up the compiles. it may be interesting to also try 3.6.x since 3.7 received a pretty big rewrite. but a git bisect is a lot more direct in figuring these things out :)
Bisecting nouveau between 3.10 and 3.11 is a real pain.
Ben introduced booting regression with commit: commit dceef5d87cc01358cc1434416f3272e2ddc3d97a Author: Ben Skeggs bskeggs@redhat.com Date: Mon Mar 4 13:01:21 2013 +1000
drm/nouveau/fb: initialise vram controller as pfb sub-object
I had to first bisect fix for that regression which appeared to be: commit 6284bf41b97fb36ed96b664a3c23b6dc3661f5f9 Author: Ilia Mirkin imirkin@alum.mit.edu Date: Fri Aug 9 17:25:54 2013 -0400
drm/nouveau/fb: fix null derefs in nv49 and nv4e init
Unfortunately meanwhile another init regression was introduced with: commit 0108bc808107b97e101b15af9705729626be6447 Author: Maarten Lankhorst maarten.lankhorst@canonical.com Date: Sun Jul 7 10:40:19 2013 +0200
drm/nouveau: do not allow negative sizes for now
And I had to find fix for that which was: commit 35095f7529bb6abdfc956e7a41ca6957520b70a7 Author: Maarten Lankhorst maarten.lankhorst@canonical.com Date: Sat Jul 27 10:17:12 2013 +0200
drm/nouveau: fix size check for cards without vm
Then I finally was able to test every commit between 3.10 and 3.11 without skipping 90% of them.
2014-02-11 11:41 GMT+01:00 Ilia Mirkin imirkin@alum.mit.edu:
(b) bisect. you can (almost) definitely restrict the bisect to drivers/gpu/drm/nouveau. if you have additional computational power, i would recommend looking into distcc for speeding up the compiles. it may be interesting to also try 3.6.x since 3.7 received a pretty big rewrite. but a git bisect is a lot more direct in figuring these things out :)
After I watched your video, it definitely brought back memories of another bug or perhaps email on this list a while back (definitely within the past year), but unfortunately I can't quite place it :(
I've finally bisected between 3.10 and 3.11:
78ae0ad403daf11cf63da86923d2b5dbeda3af8f is the first bad commit commit 78ae0ad403daf11cf63da86923d2b5dbeda3af8f Author: Ben Skeggs bskeggs@redhat.com Date: Wed Aug 21 11:30:36 2013 +1000
drm/nv04/disp: fix framebuffer pin refcounting
I've booted that commit and one commit older few times. Every time I booted 78ae0ad I got corruption. Every time I booted 6ff8c76 (it's the earlier commit), it was OK.
Ben: any idea why this commit caused regression for my hardware? From the commit message I assume it was supposed to affect some ancient nv04 hardware only. Did it accidentally touch my nv4e path code maybe?
On Sun, Feb 16, 2014 at 10:17 AM, Rafał Miłecki zajec5@gmail.com wrote:
2014-02-11 11:41 GMT+01:00 Ilia Mirkin imirkin@alum.mit.edu:
(b) bisect. you can (almost) definitely restrict the bisect to drivers/gpu/drm/nouveau. if you have additional computational power, i would recommend looking into distcc for speeding up the compiles. it may be interesting to also try 3.6.x since 3.7 received a pretty big rewrite. but a git bisect is a lot more direct in figuring these things out :)
After I watched your video, it definitely brought back memories of another bug or perhaps email on this list a while back (definitely within the past year), but unfortunately I can't quite place it :(
I've finally bisected between 3.10 and 3.11:
78ae0ad403daf11cf63da86923d2b5dbeda3af8f is the first bad commit commit 78ae0ad403daf11cf63da86923d2b5dbeda3af8f Author: Ben Skeggs bskeggs@redhat.com Date: Wed Aug 21 11:30:36 2013 +1000
drm/nv04/disp: fix framebuffer pin refcounting
I've booted that commit and one commit older few times. Every time I booted 78ae0ad I got corruption. Every time I booted 6ff8c76 (it's the earlier commit), it was OK.
But I bet if you restart X, you get a backtrace, right?
Ben: any idea why this commit caused regression for my hardware? From the commit message I assume it was supposed to affect some ancient nv04 hardware only. Did it accidentally touch my nv4e path code maybe?
All pre-nv50 hardware (including your nv4e) use this path.
-ilia
2014-02-16 19:55 GMT+01:00 Ilia Mirkin imirkin@alum.mit.edu:
On Sun, Feb 16, 2014 at 10:17 AM, Rafał Miłecki zajec5@gmail.com wrote:
2014-02-11 11:41 GMT+01:00 Ilia Mirkin imirkin@alum.mit.edu:
(b) bisect. you can (almost) definitely restrict the bisect to drivers/gpu/drm/nouveau. if you have additional computational power, i would recommend looking into distcc for speeding up the compiles. it may be interesting to also try 3.6.x since 3.7 received a pretty big rewrite. but a git bisect is a lot more direct in figuring these things out :)
After I watched your video, it definitely brought back memories of another bug or perhaps email on this list a while back (definitely within the past year), but unfortunately I can't quite place it :(
I've finally bisected between 3.10 and 3.11:
78ae0ad403daf11cf63da86923d2b5dbeda3af8f is the first bad commit commit 78ae0ad403daf11cf63da86923d2b5dbeda3af8f Author: Ben Skeggs bskeggs@redhat.com Date: Wed Aug 21 11:30:36 2013 +1000
drm/nv04/disp: fix framebuffer pin refcounting
I've booted that commit and one commit older few times. Every time I booted 78ae0ad I got corruption. Every time I booted 6ff8c76 (it's the earlier commit), it was OK.
But I bet if you restart X, you get a backtrace, right?
That's right.
78ae0ad: Corruptions
6ff8c76: WARNING: at drivers/gpu/drm/nouveau/nouveau_bo.c:151 nouveau_bo_del_ttm+0x80/0x90 [nouveau]() (after quiting X by "init 3")
On Sun, Feb 16, 2014 at 2:15 PM, Rafał Miłecki zajec5@gmail.com wrote:
2014-02-16 19:55 GMT+01:00 Ilia Mirkin imirkin@alum.mit.edu:
On Sun, Feb 16, 2014 at 10:17 AM, Rafał Miłecki zajec5@gmail.com wrote:
2014-02-11 11:41 GMT+01:00 Ilia Mirkin imirkin@alum.mit.edu:
(b) bisect. you can (almost) definitely restrict the bisect to drivers/gpu/drm/nouveau. if you have additional computational power, i would recommend looking into distcc for speeding up the compiles. it may be interesting to also try 3.6.x since 3.7 received a pretty big rewrite. but a git bisect is a lot more direct in figuring these things out :)
After I watched your video, it definitely brought back memories of another bug or perhaps email on this list a while back (definitely within the past year), but unfortunately I can't quite place it :(
I've finally bisected between 3.10 and 3.11:
78ae0ad403daf11cf63da86923d2b5dbeda3af8f is the first bad commit commit 78ae0ad403daf11cf63da86923d2b5dbeda3af8f Author: Ben Skeggs bskeggs@redhat.com Date: Wed Aug 21 11:30:36 2013 +1000
drm/nv04/disp: fix framebuffer pin refcounting
I've booted that commit and one commit older few times. Every time I booted 78ae0ad I got corruption. Every time I booted 6ff8c76 (it's the earlier commit), it was OK.
But I bet if you restart X, you get a backtrace, right?
That's right.
78ae0ad: Corruptions
6ff8c76: WARNING: at drivers/gpu/drm/nouveau/nouveau_bo.c:151 nouveau_bo_del_ttm+0x80/0x90 [nouveau]() (after quiting X by "init 3")
OK, as expected. And those backtraces are the fallout from a boatload of ttm changes that went into 3.10. So 3.9 should be safe for you :)
So that these findings don't get lost/forgotten, mind filing a bug with your various findings as per http://nouveau.freedesktop.org/wiki/Bugs/ ? Unfortunately I have never observed your particular issue on any of my pre-nv50 cards (nv05/18/34/42/44), so there must be some special component. Perhaps it's the IGP-ness. Although others with IGP's haven't complained about this.
-ilia
2014-02-10 20:06 GMT+01:00 Ilia Mirkin imirkin@alum.mit.edu:
Lastly, it may be worth trying 3.11.x and 3.12.x to get a better handle on when problems happened. The commits you cite are in the middle of releases, and may have various badness associated with them (e.g. 3.12-rc had a later-disabled MSI implementation, back in 3.13... probably some other stuff).
I've tried Linux 3.11.0, 3.11.10 and 3.12.0. All of them suffer from this corruption.
dri-devel@lists.freedesktop.org