Re: Radeon RS780 - BUG: unable to handle kernel NULL pointer dereference

9 Nov 2010

On Die, 2010-11-09 at 11:07 +0100, Thomas Hellstrom wrote:
...
On 11/09/2010 10:53 AM, Thomas Hellstrom wrote:
...
On 11/09/2010 10:29 AM, Markus Trippelsdorf wrote:
...
OK I've found the buggy commit by bisection:
e376573f7267390f4e1bdc552564b6fb913bce76 is the first bad commit
commit e376573f7267390f4e1bdc552564b6fb913bce76
Author: Michel Dänzerdaenzer@vmware.com
Date:   Thu Jul 8 12:43:28 2010 +1000
 drm/radeon: fall back to GTT if bo creation/validation in VRAM 

fails.
 This fixes a problem where on low VRAM cards we'd run out of 

space for validation.
 [airlied: Tested on my M7, Thinkpad T42, compiz works with no 

problems.]
 Signed-off-by: Michel Dänzer<daenzer@vmware.com>
 Cc: stable@kernel.org
 Signed-off-by: Dave Airlie<airlied@redhat.com>


Please note that this is an old commit from 2.6.36-rc. When I revert 
it the
kernel no longer crashes. Instead I see the following in my dmesg:
Hmm, so this sounds like something in the Radeon eviction error path 
is causing corruption.
I had a similar problem with vmwgfx, when I tried to unref a BO 
_after_ ttm_bo_init() failed.
ttm_bo_init() is really supposed to call unref itself for various 
reasons,  so calling unref() or kfree() after a failed ttm_bo_init() 
will cause corruption.
In any case, the error below also suggests something is a bit fragile 
in the Radeon driver:
First, an accelerated eviction may fail, like in the message below, 
but then there must always be a backup plan, like unaccelerated 
eviction to system. On BO creation, there are a number of placement 
strategies, but if all else fails, it should be possible to initially 
place the BO in system memory.
Second, If bo validation fails during a command submission, due to 
insufficient VRAM / TT, then the driver should retry the complete 
validation cycle after first blocking all other validators and then 
evicting everything not pinned, to avoid failures due to fragmentation.
/Thomas
Indeed, it seems like the commit you mention just retries ttm_bo_init() 
after it previously failed. At that point the bo has been destroyed, so 
that is probably what's causing the BUG you are seeing.
Admittedly, ttm_bo_init() calling unref on failure is not properly 
documented in the function description.  The reason for doing so is to 
have a single path for freeing all BO resources already allocated on the 
point of failure.
Does the patch below fix the problem?
commit e224472eedbda391ddb6d8b88f26e82e1c3b036b
Author: Michel Dänzer daenzer@vmware.com
Date:   Tue Nov 9 11:30:41 2010 +0100
drm/radeon/kms: Fix retrying ttm_bo_init() after it failed once.
If ttm_bo_init() returns failure, it already destroyed the BO, so we need to
    retry from scratch.
Signed-off-by: Michel Dänzer daenzer@vmware.com
    Cc: stable@kernel.org

diff --git a/drivers/gpu/drm/radeon/radeon_object.c b/drivers/gpu/drm/radeon/radeon_object.c
index 1b9004e..bbe92d5 100644
--- a/drivers/gpu/drm/radeon/radeon_object.c
+++ b/drivers/gpu/drm/radeon/radeon_object.c
@@ -102,6 +102,8 @@ int radeon_bo_create(struct radeon_device *rdev, struct drm_gem_object *gobj,
    	type = ttm_bo_type_device;
    }
    *bo_ptr = NULL;
+
+retry:
    bo = kzalloc(sizeof(struct radeon_bo), GFP_KERNEL);
    if (bo == NULL)
    	return -ENOMEM;
@@ -109,8 +111,6 @@ int radeon_bo_create(struct radeon_device *rdev, struct drm_gem_object *gobj,
    bo->gobj = gobj;
    bo->surface_reg = -1;
    INIT_LIST_HEAD(&bo->list);
-
-retry:
    radeon_ttm_placement_from_domain(bo, domain);
    /* Kernel allocation are uninterruptible */
    mutex_lock(&rdev->vram_mutex);
-- 
Earthling Michel Dänzer           |                http://www.vmware.com
Libre software enthusiast         |          Debian, X and DRI developer

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: Radeon RS780 - BUG: unable to handle kernel NULL pointer dereference