On Tue, Nov 09, 2010 at 11:32:57AM +0100, Michel Dänzer wrote:
On Die, 2010-11-09 at 11:07 +0100, Thomas Hellstrom wrote:
On 11/09/2010 10:53 AM, Thomas Hellstrom wrote:
On 11/09/2010 10:29 AM, Markus Trippelsdorf wrote:
OK I've found the buggy commit by bisection:
e376573f7267390f4e1bdc552564b6fb913bce76 is the first bad commit commit e376573f7267390f4e1bdc552564b6fb913bce76 Author: Michel Dänzerdaenzer@vmware.com Date: Thu Jul 8 12:43:28 2010 +1000
drm/radeon: fall back to GTT if bo creation/validation in VRAM
fails.
This fixes a problem where on low VRAM cards we'd run out of
space for validation.
[airlied: Tested on my M7, Thinkpad T42, compiz works with no
problems.]
Signed-off-by: Michel Dänzer<daenzer@vmware.com> Cc: stable@kernel.org Signed-off-by: Dave Airlie<airlied@redhat.com>
Please note that this is an old commit from 2.6.36-rc. When I revert it the kernel no longer crashes. Instead I see the following in my dmesg:
Hmm, so this sounds like something in the Radeon eviction error path is causing corruption. I had a similar problem with vmwgfx, when I tried to unref a BO _after_ ttm_bo_init() failed. ttm_bo_init() is really supposed to call unref itself for various reasons, so calling unref() or kfree() after a failed ttm_bo_init() will cause corruption.
In any case, the error below also suggests something is a bit fragile in the Radeon driver:
First, an accelerated eviction may fail, like in the message below, but then there must always be a backup plan, like unaccelerated eviction to system. On BO creation, there are a number of placement strategies, but if all else fails, it should be possible to initially place the BO in system memory.
Second, If bo validation fails during a command submission, due to insufficient VRAM / TT, then the driver should retry the complete validation cycle after first blocking all other validators and then evicting everything not pinned, to avoid failures due to fragmentation.
/Thomas
Indeed, it seems like the commit you mention just retries ttm_bo_init() after it previously failed. At that point the bo has been destroyed, so that is probably what's causing the BUG you are seeing.
Admittedly, ttm_bo_init() calling unref on failure is not properly documented in the function description. The reason for doing so is to have a single path for freeing all BO resources already allocated on the point of failure.
Does the patch below fix the problem?
Yes, indeed. I was just about to send the same patch to the list.
Thanks.