[PATCH] drm/nouveau: avoid null deref on bad arguments to nouveau_vma_getmap

List overview All Threads
Download

newer

older

[PATCH 00/18] dri3+gallium patch...

[Bug 67961] New: radeon drm:...

Ilia Mirkin

22 Aug 2013 22 Aug '13

12:10 a.m.

The code expects non-VRAM mem nodes to have a pages list. If that's not set, it will do a null deref down the line. Warn on that condition and return an error.

See https://bugs.freedesktop.org/show_bug.cgi?id=64774

Reported-by: Pasi Kärkkäinen pasik@iki.fi Tested-by: Pasi Kärkkäinen pasik@iki.fi Signed-off-by: Ilia Mirkin imirkin@alum.mit.edu Cc: stable@vger.kernel.org # 3.8+ ---

I don't exactly understand what's going on, but this is just a straightforward way to avoid a null deref that you see happens in the bug. I haven't figured out the root cause of this, but it's getting well into the "I have no idea how TTM works" space. However this seems like a bit of defensive programming -- nouveau_vm_map_sg will pass node->pages as a list down, which will be dereferenced by nvc0_vm_map_sg. Perhaps the other arguments should make that dereferencing not happen, but it definitely was happening here, as you can see in the bug.

Ben/Maarten, I'll let you judge whether this check is appropriate, since like I hope I was able to convey above, I'm just not really sure :)

drivers/gpu/drm/nouveau/nouveau_bo.c | 6 ++++++ 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c index cdc3282..191145d 100644 --- a/drivers/gpu/drm/nouveau/nouveau_bo.c +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c @@ -963,6 +963,12 @@ nouveau_vma_getmap(struct nouveau_channel *chan, struct nouveau_bo *nvbo, struct nouveau_mem *node = mem->mm_node; int ret;

+ /* If we ever get here for a non-vram mem node that doesn't + * have pages, we will end up doing a null deref in + * nouveau_vm_map_sg. */ + if (WARN_ON(mem->mem_type != TTM_PL_VRAM && !node->pages)) + return -EINVAL; + ret = nouveau_vm_get(nv_client(chan->cli)->vm, mem->num_pages << PAGE_SHIFT, node->page_shift, NV_MEM_ACCESS_RW, vma);

-- 1.8.1.5

Show replies by date

Ben Skeggs

22 Aug 22 Aug

6:41 a.m.

On Thu, Aug 22, 2013 at 10:10 AM, Ilia Mirkin imirkin@alum.mit.edu wrote:

...

The code expects non-VRAM mem nodes to have a pages list. If that's not set, it will do a null deref down the line. Warn on that condition and return an error.

See https://bugs.freedesktop.org/show_bug.cgi?id=64774

Reported-by: Pasi Kärkkäinen pasik@iki.fi Tested-by: Pasi Kärkkäinen pasik@iki.fi Signed-off-by: Ilia Mirkin imirkin@alum.mit.edu Cc: stable@vger.kernel.org # 3.8+

I don't exactly understand what's going on, but this is just a straightforward way to avoid a null deref that you see happens in the bug. I haven't figured out the root cause of this, but it's getting well into the "I have no idea how TTM works" space. However this seems like a bit of defensive programming -- nouveau_vm_map_sg will pass node->pages as a list down, which will be dereferenced by nvc0_vm_map_sg. Perhaps the other arguments should make that dereferencing not happen, but it definitely was happening here, as you can see in the bug.

Ben/Maarten, I'll let you judge whether this check is appropriate, since like I hope I was able to convey above, I'm just not really sure :)

drivers/gpu/drm/nouveau/nouveau_bo.c | 6 ++++++ 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c index cdc3282..191145d 100644 --- a/drivers/gpu/drm/nouveau/nouveau_bo.c +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c @@ -963,6 +963,12 @@ nouveau_vma_getmap(struct nouveau_channel *chan, struct nouveau_bo *nvbo, struct nouveau_mem *node = mem->mm_node; int ret;
  /* If we ever get here for a non-vram mem node that doesn't
   * have pages, we will end up doing a null deref in
   * nouveau_vm_map_sg. */
  if (WARN_ON(mem->mem_type != TTM_PL_VRAM && !node->pages))
          return -EINVAL;

My guess here is that this is a mapping that requires the use of map_sg_table() (see nouveau_bo_move_ntfy() for the condition).

I'm not entirely sure this should even be happening to be honest. I guess TTM is trying to move a shared buffer from GART to VRAM for some reason (userspace probably asked for it?).. And well, this really shouldn't be allowed.. The other device won't be able to touch it then.

If you can confirm this is indeed what's happening, we should find out why and fix it (and have the kernel completely reject such attempts).

Ben.

...

  ret = nouveau_vm_get(nv_client(chan->cli)->vm, mem->num_pages <<
                       PAGE_SHIFT, node->page_shift,
                       NV_MEM_ACCESS_RW, vma);
-- 1.8.1.5

dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel

Pasi Kärkkäinen

6:54 a.m.

On Thu, Aug 22, 2013 at 04:41:06PM +1000, Ben Skeggs wrote:

...

On Thu, Aug 22, 2013 at 10:10 AM, Ilia Mirkin imirkin@alum.mit.edu wrote:

...
The code expects non-VRAM mem nodes to have a pages list. If that's not set, it will do a null deref down the line. Warn on that condition and return an error.

See https://bugs.freedesktop.org/show_bug.cgi?id=64774

Reported-by: Pasi Kärkkäinen pasik@iki.fi Tested-by: Pasi Kärkkäinen pasik@iki.fi Signed-off-by: Ilia Mirkin imirkin@alum.mit.edu Cc: stable@vger.kernel.org # 3.8+

I don't exactly understand what's going on, but this is just a straightforward way to avoid a null deref that you see happens in the bug. I haven't figured out the root cause of this, but it's getting well into the "I have no idea how TTM works" space. However this seems like a bit of defensive programming -- nouveau_vm_map_sg will pass node->pages as a list down, which will be dereferenced by nvc0_vm_map_sg. Perhaps the other arguments should make that dereferencing not happen, but it definitely was happening here, as you can see in the bug.

Ben/Maarten, I'll let you judge whether this check is appropriate, since like I hope I was able to convey above, I'm just not really sure :)

drivers/gpu/drm/nouveau/nouveau_bo.c | 6 ++++++ 1 file changed, 6 insertions(+)

diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c index cdc3282..191145d 100644 --- a/drivers/gpu/drm/nouveau/nouveau_bo.c +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c @@ -963,6 +963,12 @@ nouveau_vma_getmap(struct nouveau_channel *chan, struct nouveau_bo *nvbo, struct nouveau_mem *node = mem->mm_node; int ret;
  /* If we ever get here for a non-vram mem node that doesn't
   * have pages, we will end up doing a null deref in
   * nouveau_vm_map_sg. */
  if (WARN_ON(mem->mem_type != TTM_PL_VRAM && !node->pages))
          return -EINVAL;
My guess here is that this is a mapping that requires the use of map_sg_table() (see nouveau_bo_move_ntfy() for the condition).

I'm not entirely sure this should even be happening to be honest. I guess TTM is trying to move a shared buffer from GART to VRAM for some reason (userspace probably asked for it?).. And well, this really shouldn't be allowed.. The other device won't be able to touch it then.

If you can confirm this is indeed what's happening, we should find out why and fix it (and have the kernel completely reject such attempts).

Yes it does happen. I've been experiencing the kernel crash with Linux 3.8.x, 3.9.x and 3.10.x.

I'm able to reproduce the crash when having Optimus enabled in BIOS on my Lenovo T430 laptop with Intel IGD + Nvidia GF108 GPU, booting to Xorg desktop, and when I try to enable external DVI monitor connected to nouveau card, the kernel crashes hard..

crash traceback and WARN_ON() tracebacks with this patch applied available in the bugzilla entry: https://bugs.freedesktop.org/show_bug.cgi?id=64774

-- Pasi

...

Ben.

...
  ret = nouveau_vm_get(nv_client(chan->cli)->vm, mem->num_pages <<
                       PAGE_SHIFT, node->page_shift,
                       NV_MEM_ACCESS_RW, vma);
-- 1.8.1.5

dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel

Maarten Lankhorst

7:12 a.m.

Op 22-08-13 02:10, Ilia Mirkin schreef:

...

The code expects non-VRAM mem nodes to have a pages list. If that's not set, it will do a null deref down the line. Warn on that condition and return an error.

See https://bugs.freedesktop.org/show_bug.cgi?id=64774

Reported-by: Pasi Kärkkäinen pasik@iki.fi Tested-by: Pasi Kärkkäinen pasik@iki.fi Signed-off-by: Ilia Mirkin imirkin@alum.mit.edu Cc: stable@vger.kernel.org # 3.8+

I don't exactly understand what's going on, but this is just a straightforward way to avoid a null deref that you see happens in the bug. I haven't figured out the root cause of this, but it's getting well into the "I have no idea how TTM works" space. However this seems like a bit of defensive programming -- nouveau_vm_map_sg will pass node->pages as a list down, which will be dereferenced by nvc0_vm_map_sg. Perhaps the other arguments should make that dereferencing not happen, but it definitely was happening here, as you can see in the bug.

Ben/Maarten, I'll let you judge whether this check is appropriate, since like I hope I was able to convey above, I'm just not really sure :)

Not it really isn't appropriate..

You'd have to call call nouveau_vm_map_sg_table instead, the only place that doesn't handle that correctly is where it's not expected to be called.

Here, have a completely untested patch to fix things...

diff --git a/drivers/gpu/drm/nouveau/nouveau_display.c b/drivers/gpu/drm/nouveau/nouveau_display.c --- a/drivers/gpu/drm/nouveau/nouveau_display.c +++ b/drivers/gpu/drm/nouveau/nouveau_display.c @@ -138,17 +143,26 @@ nouveau_user_framebuffer_create(struct drm_device *dev, { struct nouveau_framebuffer *nouveau_fb; struct drm_gem_object *gem; + struct nouveau_bo *nvbo; int ret = -ENOMEM;

gem = drm_gem_object_lookup(dev, file_priv, mode_cmd->handles[0]); if (!gem) return ERR_PTR(-ENOENT);

+ nvbo = nouveau_gem_object(gem); + if (!(nvbo->valid_domains & NOUVEAU_GEM_DOMAIN_VRAM)) { + nv_warn(nouveau_drm(dev), "Trying to create a fb in vram with" + " valid_domains=%08x\n", nvbo->valid_domains); + ret = -EINVAL; + goto err_unref; + } + nouveau_fb = kzalloc(sizeof(struct nouveau_framebuffer), GFP_KERNEL); if (!nouveau_fb) goto err_unref;

- ret = nouveau_framebuffer_init(dev, nouveau_fb, mode_cmd, nouveau_gem_object(gem)); + ret = nouveau_framebuffer_init(dev, nouveau_fb, mode_cmd, nvbo); if (ret) goto err;

Pasi Kärkkäinen

7:58 a.m.

On Thu, Aug 22, 2013 at 09:12:40AM +0200, Maarten Lankhorst wrote:

...

Op 22-08-13 02:10, Ilia Mirkin schreef:

...
The code expects non-VRAM mem nodes to have a pages list. If that's not set, it will do a null deref down the line. Warn on that condition and return an error.

See https://bugs.freedesktop.org/show_bug.cgi?id=64774

Reported-by: Pasi Kärkkäinen pasik@iki.fi Tested-by: Pasi Kärkkäinen pasik@iki.fi Signed-off-by: Ilia Mirkin imirkin@alum.mit.edu Cc: stable@vger.kernel.org # 3.8+

I don't exactly understand what's going on, but this is just a straightforward way to avoid a null deref that you see happens in the bug. I haven't figured out the root cause of this, but it's getting well into the "I have no idea how TTM works" space. However this seems like a bit of defensive programming -- nouveau_vm_map_sg will pass node->pages as a list down, which will be dereferenced by nvc0_vm_map_sg. Perhaps the other arguments should make that dereferencing not happen, but it definitely was happening here, as you can see in the bug.

Ben/Maarten, I'll let you judge whether this check is appropriate, since like I hope I was able to convey above, I'm just not really sure :)

Not it really isn't appropriate..

You'd have to call call nouveau_vm_map_sg_table instead, the only place that doesn't handle that correctly is where it's not expected to be called.

Here, have a completely untested patch to fix things...

Thanks! I'll give it a try later today..

-- Pasi

...

diff --git a/drivers/gpu/drm/nouveau/nouveau_display.c b/drivers/gpu/drm/nouveau/nouveau_display.c --- a/drivers/gpu/drm/nouveau/nouveau_display.c +++ b/drivers/gpu/drm/nouveau/nouveau_display.c @@ -138,17 +143,26 @@ nouveau_user_framebuffer_create(struct drm_device *dev, { struct nouveau_framebuffer *nouveau_fb; struct drm_gem_object *gem;
struct nouveau_bo *nvbo; int ret = -ENOMEM;

gem = drm_gem_object_lookup(dev, file_priv, mode_cmd->handles[0]); if (!gem) return ERR_PTR(-ENOENT);

nvbo = nouveau_gem_object(gem);

if (!(nvbo->valid_domains & NOUVEAU_GEM_DOMAIN_VRAM)) {
nv_warn(nouveau_drm(dev), "Trying to create a fb in vram with"
	" valid_domains=%08x\n", nvbo->valid_domains);
ret = -EINVAL;
goto err_unref;
}

nouveau_fb = kzalloc(sizeof(struct nouveau_framebuffer), GFP_KERNEL); if (!nouveau_fb) goto err_unref;
ret = nouveau_framebuffer_init(dev, nouveau_fb, mode_cmd, nouveau_gem_object(gem));

ret = nouveau_framebuffer_init(dev, nouveau_fb, mode_cmd, nvbo); if (ret) goto err;

dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel

Pasi Kärkkäinen

23 Aug 23 Aug

8:20 p.m.

On Thu, Aug 22, 2013 at 09:12:40AM +0200, Maarten Lankhorst wrote:

...

Op 22-08-13 02:10, Ilia Mirkin schreef:

...
The code expects non-VRAM mem nodes to have a pages list. If that's not set, it will do a null deref down the line. Warn on that condition and return an error.

See https://bugs.freedesktop.org/show_bug.cgi?id=64774

Reported-by: Pasi Kärkkäinen pasik@iki.fi Tested-by: Pasi Kärkkäinen pasik@iki.fi Signed-off-by: Ilia Mirkin imirkin@alum.mit.edu Cc: stable@vger.kernel.org # 3.8+

I don't exactly understand what's going on, but this is just a straightforward way to avoid a null deref that you see happens in the bug. I haven't figured out the root cause of this, but it's getting well into the "I have no idea how TTM works" space. However this seems like a bit of defensive programming -- nouveau_vm_map_sg will pass node->pages as a list down, which will be dereferenced by nvc0_vm_map_sg. Perhaps the other arguments should make that dereferencing not happen, but it definitely was happening here, as you can see in the bug.

Ben/Maarten, I'll let you judge whether this check is appropriate, since like I hope I was able to convey above, I'm just not really sure :)

Not it really isn't appropriate..

You'd have to call call nouveau_vm_map_sg_table instead, the only place that doesn't handle that correctly is where it's not expected to be called.

Here, have a completely untested patch to fix things...

Maarten: I've been testing this stuff with Linux 3.10.x, so I had to modify the patch a bit to make it apply there.. I've attached the modified copy that applies to 3.10.9, hopefully I did the backport correctly.

With Linux 3.10.9 and the patch applied the kernel doesn't crash anymore, and I get this error in dmesg:

[ 76.105643] nouveau W[ DRM] Trying to create a fb in vram with valid_domains=00000004

Does that help?

-- Pasi

Pasi Kärkkäinen

28 Aug 28 Aug

6:29 a.m.

On Fri, Aug 23, 2013 at 11:20:42PM +0300, Pasi Kärkkäinen wrote:

...

On Thu, Aug 22, 2013 at 09:12:40AM +0200, Maarten Lankhorst wrote:

...
Op 22-08-13 02:10, Ilia Mirkin schreef:

...
The code expects non-VRAM mem nodes to have a pages list. If that's not set, it will do a null deref down the line. Warn on that condition and return an error.

See https://bugs.freedesktop.org/show_bug.cgi?id=64774

Reported-by: Pasi Kärkkäinen pasik@iki.fi Tested-by: Pasi Kärkkäinen pasik@iki.fi Signed-off-by: Ilia Mirkin imirkin@alum.mit.edu Cc: stable@vger.kernel.org # 3.8+

I don't exactly understand what's going on, but this is just a straightforward way to avoid a null deref that you see happens in the bug. I haven't figured out the root cause of this, but it's getting well into the "I have no idea how TTM works" space. However this seems like a bit of defensive programming -- nouveau_vm_map_sg will pass node->pages as a list down, which will be dereferenced by nvc0_vm_map_sg. Perhaps the other arguments should make that dereferencing not happen, but it definitely was happening here, as you can see in the bug.

Ben/Maarten, I'll let you judge whether this check is appropriate, since like I hope I was able to convey above, I'm just not really sure :)

Not it really isn't appropriate..

You'd have to call call nouveau_vm_map_sg_table instead, the only place that doesn't handle that correctly is where it's not expected to be called.

Here, have a completely untested patch to fix things...

Maarten: I've been testing this stuff with Linux 3.10.x, so I had to modify the patch a bit to make it apply there.. I've attached the modified copy that applies to 3.10.9, hopefully I did the backport correctly.

With Linux 3.10.9 and the patch applied the kernel doesn't crash anymore, and I get this error in dmesg:

[ 76.105643] nouveau W[ DRM] Trying to create a fb in vram with valid_domains=00000004

Does that help?

Any comments?

Maarten's patch works for me, I get that warning instead of a kernel crash, but it's also a bigger change that doesn't apply to older kernels as-is.

Ilia's original patch in this thread can be applied to older kernels as-is, and it also prevents the kernel crash from happening.

Should we get both applied, so Ilia's patch can be CC'd to stable ?

-- Pasi

...

--- linux-3.10.9-100.fc18.x86_64/drivers/gpu/drm/nouveau/nouveau_display.c.orig 2013-07-01 01:13:29.000000000 +0300 +++ linux-3.10.9-100.fc18.x86_64/drivers/gpu/drm/nouveau/nouveau_display.c 2013-08-23 19:56:52.038234281 +0300 @@ -137,18 +137,31 @@ { struct nouveau_framebuffer *nouveau_fb; struct drm_gem_object *gem;
struct nouveau_bo *nvbo; int ret;

gem = drm_gem_object_lookup(dev, file_priv, mode_cmd->handles[0]); if (!gem) return ERR_PTR(-ENOENT);

nvbo = nouveau_gem_object(gem);

if (!(nvbo->valid_domains & NOUVEAU_GEM_DOMAIN_VRAM)) {
nv_warn(nouveau_drm(dev), "Trying to create a fb in vram with"
	" valid_domains=%08x\n", nvbo->valid_domains);
ret = -EINVAL;
drm_gem_object_unreference(gem);
return ERR_PTR(ret);
}

nouveau_fb = kzalloc(sizeof(struct nouveau_framebuffer), GFP_KERNEL);
if (!nouveau_fb)
if (!nouveau_fb) {
drm_gem_object_unreference(gem);
return ERR_PTR(-ENOMEM);
}
ret = nouveau_framebuffer_init(dev, nouveau_fb, mode_cmd, nouveau_gem_object(gem));
ret = nouveau_framebuffer_init(dev, nouveau_fb, mode_cmd, nvbo); if (ret) {
kfree(nouveau_fb);
drm_gem_object_unreference(gem); return ERR_PTR(ret); }

...

dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel

Maarten Lankhorst

7:44 a.m.

Op 28-08-13 08:29, Pasi Kärkkäinen schreef:

...

On Fri, Aug 23, 2013 at 11:20:42PM +0300, Pasi Kärkkäinen wrote:

...
On Thu, Aug 22, 2013 at 09:12:40AM +0200, Maarten Lankhorst wrote:

...
Op 22-08-13 02:10, Ilia Mirkin schreef:

...
The code expects non-VRAM mem nodes to have a pages list. If that's not set, it will do a null deref down the line. Warn on that condition and return an error.

See https://bugs.freedesktop.org/show_bug.cgi?id=64774

Reported-by: Pasi Kärkkäinen pasik@iki.fi Tested-by: Pasi Kärkkäinen pasik@iki.fi Signed-off-by: Ilia Mirkin imirkin@alum.mit.edu Cc: stable@vger.kernel.org # 3.8+

I don't exactly understand what's going on, but this is just a straightforward way to avoid a null deref that you see happens in the bug. I haven't figured out the root cause of this, but it's getting well into the "I have no idea how TTM works" space. However this seems like a bit of defensive programming -- nouveau_vm_map_sg will pass node->pages as a list down, which will be dereferenced by nvc0_vm_map_sg. Perhaps the other arguments should make that dereferencing not happen, but it definitely was happening here, as you can see in the bug.

Ben/Maarten, I'll let you judge whether this check is appropriate, since like I hope I was able to convey above, I'm just not really sure :)

Not it really isn't appropriate..

You'd have to call call nouveau_vm_map_sg_table instead, the only place that doesn't handle that correctly is where it's not expected to be called.

Here, have a completely untested patch to fix things...

Maarten: I've been testing this stuff with Linux 3.10.x, so I had to modify the patch a bit to make it apply there.. I've attached the modified copy that applies to 3.10.9, hopefully I did the backport correctly.

With Linux 3.10.9 and the patch applied the kernel doesn't crash anymore, and I get this error in dmesg:

[ 76.105643] nouveau W[ DRM] Trying to create a fb in vram with valid_domains=00000004

Does that help?

Any comments?

Maarten's patch works for me, I get that warning instead of a kernel crash, but it's also a bigger change that doesn't apply to older kernels as-is.

Ilia's original patch in this thread can be applied to older kernels as-is, and it also prevents the kernel crash from happening.

Should we get both applied, so Ilia's patch can be CC'd to stable ?

You haven't understood the root cause then. You're trying to move an IMPORTED dma-buf into VRAM. Doing so would seem to work, but at that point it's no longer a dma-buf so changes done by the exporter would not show up in nouveau because they no longer refer to the same piece of memory.

Failing is the only right option here.

~Maarten

Pasi Kärkkäinen

3 Sep 3 Sep

2:20 p.m.

On Wed, Aug 28, 2013 at 09:44:15AM +0200, Maarten Lankhorst wrote:

...

Op 28-08-13 08:29, Pasi Kärkkäinen schreef:

...
On Fri, Aug 23, 2013 at 11:20:42PM +0300, Pasi Kärkkäinen wrote:

...
On Thu, Aug 22, 2013 at 09:12:40AM +0200, Maarten Lankhorst wrote:

...
Op 22-08-13 02:10, Ilia Mirkin schreef:

...
The code expects non-VRAM mem nodes to have a pages list. If that's not set, it will do a null deref down the line. Warn on that condition and return an error.

See https://bugs.freedesktop.org/show_bug.cgi?id=64774

Reported-by: Pasi Kärkkäinen pasik@iki.fi Tested-by: Pasi Kärkkäinen pasik@iki.fi Signed-off-by: Ilia Mirkin imirkin@alum.mit.edu Cc: stable@vger.kernel.org # 3.8+

I don't exactly understand what's going on, but this is just a straightforward way to avoid a null deref that you see happens in the bug. I haven't figured out the root cause of this, but it's getting well into the "I have no idea how TTM works" space. However this seems like a bit of defensive programming -- nouveau_vm_map_sg will pass node->pages as a list down, which will be dereferenced by nvc0_vm_map_sg. Perhaps the other arguments should make that dereferencing not happen, but it definitely was happening here, as you can see in the bug.

Ben/Maarten, I'll let you judge whether this check is appropriate, since like I hope I was able to convey above, I'm just not really sure :)

Not it really isn't appropriate..

You'd have to call call nouveau_vm_map_sg_table instead, the only place that doesn't handle that correctly is where it's not expected to be called.

Here, have a completely untested patch to fix things...

Maarten: I've been testing this stuff with Linux 3.10.x, so I had to modify the patch a bit to make it apply there.. I've attached the modified copy that applies to 3.10.9, hopefully I did the backport correctly.

With Linux 3.10.9 and the patch applied the kernel doesn't crash anymore, and I get this error in dmesg:

[ 76.105643] nouveau W[ DRM] Trying to create a fb in vram with valid_domains=00000004

Does that help?

Any comments?

Maarten's patch works for me, I get that warning instead of a kernel crash, but it's also a bigger change that doesn't apply to older kernels as-is.

Ilia's original patch in this thread can be applied to older kernels as-is, and it also prevents the kernel crash from happening.

Should we get both applied, so Ilia's patch can be CC'd to stable ?

You haven't understood the root cause then. You're trying to move an IMPORTED dma-buf into VRAM. Doing so would seem to work, but at that point it's no longer a dma-buf so changes done by the exporter would not show up in nouveau because they no longer refer to the same piece of memory.

Yes, that's true, I don't understand the root cause. That's mostly because I'm not familiar with the nouveau code/internals.

...

Failing is the only right option here.

Sorry but I'm not sure I understand that correctly.. what exactly do you suggest? What should fail?

Because I'm not familiar with the code (yet) understanding and finding the root cause will take time for me.. that's why I was suggesting to meanwhile apply Ilia's very simple patch from this thread which actually prevents the hard kernel crash from happening, because it seems like an unharmful fix to have to protect against this kind of bugs (the root cause) ? Or is that unacceptable?

(To recap: The kernel crash happens very often when trying to use the nouveau adapter in Optimus mode, with all kernels at least 3.8+, and it's very annoying that your laptop crashes when trying to enable a nouveau output. So Ilia's patch doesn't make the driver working properly, but at least it prevents the hard kernel crash from happening. The crash bug is in many kernel versions, including the long-term v3.10 tree. I've had the crash happening with 3.8.x, 3.9.x and 3.10.x)

All comments welcome.

Thanks,

-- Pasi

Maarten Lankhorst

2:34 p.m.

Op 03-09-13 16:20, Pasi Kärkkäinen schreef:

...

On Wed, Aug 28, 2013 at 09:44:15AM +0200, Maarten Lankhorst wrote:

...
Op 28-08-13 08:29, Pasi Kärkkäinen schreef:

...
On Fri, Aug 23, 2013 at 11:20:42PM +0300, Pasi Kärkkäinen wrote:

...
On Thu, Aug 22, 2013 at 09:12:40AM +0200, Maarten Lankhorst wrote:

...
Op 22-08-13 02:10, Ilia Mirkin schreef:

...
The code expects non-VRAM mem nodes to have a pages list. If that's not set, it will do a null deref down the line. Warn on that condition and return an error.

See https://bugs.freedesktop.org/show_bug.cgi?id=64774

Reported-by: Pasi Kärkkäinen pasik@iki.fi Tested-by: Pasi Kärkkäinen pasik@iki.fi Signed-off-by: Ilia Mirkin imirkin@alum.mit.edu Cc: stable@vger.kernel.org # 3.8+

I don't exactly understand what's going on, but this is just a straightforward way to avoid a null deref that you see happens in the bug. I haven't figured out the root cause of this, but it's getting well into the "I have no idea how TTM works" space. However this seems like a bit of defensive programming -- nouveau_vm_map_sg will pass node->pages as a list down, which will be dereferenced by nvc0_vm_map_sg. Perhaps the other arguments should make that dereferencing not happen, but it definitely was happening here, as you can see in the bug.

Ben/Maarten, I'll let you judge whether this check is appropriate, since like I hope I was able to convey above, I'm just not really sure :)

Not it really isn't appropriate..

You'd have to call call nouveau_vm_map_sg_table instead, the only place that doesn't handle that correctly is where it's not expected to be called.

Here, have a completely untested patch to fix things...

Maarten: I've been testing this stuff with Linux 3.10.x, so I had to modify the patch a bit to make it apply there.. I've attached the modified copy that applies to 3.10.9, hopefully I did the backport correctly.

With Linux 3.10.9 and the patch applied the kernel doesn't crash anymore, and I get this error in dmesg:

[ 76.105643] nouveau W[ DRM] Trying to create a fb in vram with valid_domains=00000004

Does that help?

Any comments?

Maarten's patch works for me, I get that warning instead of a kernel crash, but it's also a bigger change that doesn't apply to older kernels as-is.

Ilia's original patch in this thread can be applied to older kernels as-is, and it also prevents the kernel crash from happening.

Should we get both applied, so Ilia's patch can be CC'd to stable ?

You haven't understood the root cause then. You're trying to move an IMPORTED dma-buf into VRAM. Doing so would seem to work, but at that point it's no longer a dma-buf so changes done by the exporter would not show up in nouveau because they no longer refer to the same piece of memory.

Yes, that's true, I don't understand the root cause. That's mostly because I'm not familiar with the nouveau code/internals.

...
Failing is the only right option here.

Sorry but I'm not sure I understand that correctly.. what exactly do you suggest? What should fail?

Because I'm not familiar with the code (yet) understanding and finding the root cause will take time for me.. that's why I was suggesting to meanwhile apply Ilia's very simple patch from this thread which actually prevents the hard kernel crash from happening, because it seems like an unharmful fix to have to protect against this kind of bugs (the root cause) ? Or is that unacceptable?

(To recap: The kernel crash happens very often when trying to use the nouveau adapter in Optimus mode, with all kernels at least 3.8+, and it's very annoying that your laptop crashes when trying to enable a nouveau output. So Ilia's patch doesn't make the driver working properly, but at least it prevents the hard kernel crash from happening. The crash bug is in many kernel versions, including the long-term v3.10 tree. I've had the crash happening with 3.8.x, 3.9.x and 3.10.x)

The fix probably requires commit fdfb8332651db7a280851dfccfc4f0cff4bcd052 to apply cleanly "drm/nouveau: fix some error-path leaks in fbcon handling code"

...

All comments welcome.

Thanks,

-- Pasi

Pasi Kärkkäinen

2:48 p.m.

On Tue, Sep 03, 2013 at 04:34:56PM +0200, Maarten Lankhorst wrote:

...

Op 03-09-13 16:20, Pasi Kärkkäinen schreef:

...
On Wed, Aug 28, 2013 at 09:44:15AM +0200, Maarten Lankhorst wrote:

...
Op 28-08-13 08:29, Pasi Kärkkäinen schreef:

...
On Fri, Aug 23, 2013 at 11:20:42PM +0300, Pasi Kärkkäinen wrote:

...
On Thu, Aug 22, 2013 at 09:12:40AM +0200, Maarten Lankhorst wrote:

...
Op 22-08-13 02:10, Ilia Mirkin schreef: > The code expects non-VRAM mem nodes to have a pages list. If that's not > set, it will do a null deref down the line. Warn on that condition and > return an error. > > See https://bugs.freedesktop.org/show_bug.cgi?id=64774 > > Reported-by: Pasi Kärkkäinen pasik@iki.fi > Tested-by: Pasi Kärkkäinen pasik@iki.fi > Signed-off-by: Ilia Mirkin imirkin@alum.mit.edu > Cc: stable@vger.kernel.org # 3.8+ > --- > > I don't exactly understand what's going on, but this is just a > straightforward way to avoid a null deref that you see happens in the > bug. I haven't figured out the root cause of this, but it's getting > well into the "I have no idea how TTM works" space. However this seems > like a bit of defensive programming -- nouveau_vm_map_sg will pass > node->pages as a list down, which will be dereferenced by > nvc0_vm_map_sg. Perhaps the other arguments should make that > dereferencing not happen, but it definitely was happening here, as you > can see in the bug. > > Ben/Maarten, I'll let you judge whether this check is appropriate, > since like I hope I was able to convey above, I'm just not really sure :) Not it really isn't appropriate..

You'd have to call call nouveau_vm_map_sg_table instead, the only place that doesn't handle that correctly is where it's not expected to be called.

Here, have a completely untested patch to fix things...

Maarten: I've been testing this stuff with Linux 3.10.x, so I had to modify the patch a bit to make it apply there.. I've attached the modified copy that applies to 3.10.9, hopefully I did the backport correctly.

With Linux 3.10.9 and the patch applied the kernel doesn't crash anymore, and I get this error in dmesg:

[ 76.105643] nouveau W[ DRM] Trying to create a fb in vram with valid_domains=00000004

Does that help?

Any comments?

Maarten's patch works for me, I get that warning instead of a kernel crash, but it's also a bigger change that doesn't apply to older kernels as-is.

Ilia's original patch in this thread can be applied to older kernels as-is, and it also prevents the kernel crash from happening.

Should we get both applied, so Ilia's patch can be CC'd to stable ?

You haven't understood the root cause then. You're trying to move an IMPORTED dma-buf into VRAM. Doing so would seem to work, but at that point it's no longer a dma-buf so changes done by the exporter would not show up in nouveau because they no longer refer to the same piece of memory.

Yes, that's true, I don't understand the root cause. That's mostly because I'm not familiar with the nouveau code/internals.

...
Failing is the only right option here.

Sorry but I'm not sure I understand that correctly.. what exactly do you suggest? What should fail?

Because I'm not familiar with the code (yet) understanding and finding the root cause will take time for me.. that's why I was suggesting to meanwhile apply Ilia's very simple patch from this thread which actually prevents the hard kernel crash from happening, because it seems like an unharmful fix to have to protect against this kind of bugs (the root cause) ? Or is that unacceptable?

(To recap: The kernel crash happens very often when trying to use the nouveau adapter in Optimus mode, with all kernels at least 3.8+, and it's very annoying that your laptop crashes when trying to enable a nouveau output. So Ilia's patch doesn't make the driver working properly, but at least it prevents the hard kernel crash from happening. The crash bug is in many kernel versions, including the long-term v3.10 tree. I've had the crash happening with 3.8.x, 3.9.x and 3.10.x)

The fix probably requires commit fdfb8332651db7a280851dfccfc4f0cff4bcd052 to apply cleanly "drm/nouveau: fix some error-path leaks in fbcon handling code"

So what you mean is to use fdfb8332651db7a280851dfccfc4f0cff4bcd052 and your patch from this thread? and skip Ilia's patch?

-- Pasi

Pasi Kärkkäinen

5:58 p.m.

On Tue, Sep 03, 2013 at 05:48:48PM +0300, Pasi Kärkkäinen wrote:

...

...
...
...
...
...
> Not it really isn't appropriate.. > > You'd have to call call nouveau_vm_map_sg_table instead, the only place that doesn't handle that correctly > is where it's not expected to be called. > > Here, have a completely untested patch to fix things... > Maarten: I've been testing this stuff with Linux 3.10.x, so I had to modify the patch a bit to make it apply there.. I've attached the modified copy that applies to 3.10.9, hopefully I did the backport correctly.

With Linux 3.10.9 and the patch applied the kernel doesn't crash anymore, and I get this error in dmesg:

[ 76.105643] nouveau W[ DRM] Trying to create a fb in vram with valid_domains=00000004

Does that help?

Any comments?

Maarten's patch works for me, I get that warning instead of a kernel crash, but it's also a bigger change that doesn't apply to older kernels as-is.

Ilia's original patch in this thread can be applied to older kernels as-is, and it also prevents the kernel crash from happening.

Should we get both applied, so Ilia's patch can be CC'd to stable ?

You haven't understood the root cause then. You're trying to move an IMPORTED dma-buf into VRAM. Doing so would seem to work, but at that point it's no longer a dma-buf so changes done by the exporter would not show up in nouveau because they no longer refer to the same piece of memory.

Yes, that's true, I don't understand the root cause. That's mostly because I'm not familiar with the nouveau code/internals.

...
Failing is the only right option here.

Sorry but I'm not sure I understand that correctly.. what exactly do you suggest? What should fail?

Because I'm not familiar with the code (yet) understanding and finding the root cause will take time for me.. that's why I was suggesting to meanwhile apply Ilia's very simple patch from this thread which actually prevents the hard kernel crash from happening, because it seems like an unharmful fix to have to protect against this kind of bugs (the root cause) ? Or is that unacceptable?

(To recap: The kernel crash happens very often when trying to use the nouveau adapter in Optimus mode, with all kernels at least 3.8+, and it's very annoying that your laptop crashes when trying to enable a nouveau output. So Ilia's patch doesn't make the driver working properly, but at least it prevents the hard kernel crash from happening. The crash bug is in many kernel versions, including the long-term v3.10 tree. I've had the crash happening with 3.8.x, 3.9.x and 3.10.x)

The fix probably requires commit fdfb8332651db7a280851dfccfc4f0cff4bcd052 to apply cleanly "drm/nouveau: fix some error-path leaks in fbcon handling code"

So what you mean is to use fdfb8332651db7a280851dfccfc4f0cff4bcd052 and your patch from this thread? and skip Ilia's patch?

So I tested with Linux 3.10.10. I had to apply these two patches first:

1e2bd5f53b6282e711e9f074765911868f8e7dc1 drm/nouveau: fixup fbcon failure paths http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/patch/?id=1e2...

fdfb8332651db7a280851dfccfc4f0cff4bcd052 drm/nouveau: fix some error-path leaks in fbcon handling code http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/patch/?id=fdf...

And with those applied I was able to apply Maarten's patch cleanly (also attached to this email).

It's worth noting that I'm able to reproduce the kernel crash bug with Fedora 18 (which has xorg-1.13), but I'm not able to reproduce it with Fedora 19 (which has xorg-1.14). Both running the same kernel. Optimus enabled in BIOS on Lenovo T430 laptop. Nvidia GF108 Discreet GPU with Intel integrated GPU.

Maarten's patch fixes the kernel crash problem, and produces a warning message instead in dmesg. You can add my Tested-By to the patch, assuming Maarten's patch is going to be merged?

Thanks,

-- Pasi

Ben Skeggs

4 Sep 4 Sep

3:41 a.m.

On Thu, Aug 22, 2013 at 5:12 PM, Maarten Lankhorst maarten.lankhorst@canonical.com wrote:

...

Op 22-08-13 02:10, Ilia Mirkin schreef:

...
The code expects non-VRAM mem nodes to have a pages list. If that's not set, it will do a null deref down the line. Warn on that condition and return an error.

See https://bugs.freedesktop.org/show_bug.cgi?id=64774

Reported-by: Pasi Kärkkäinen pasik@iki.fi Tested-by: Pasi Kärkkäinen pasik@iki.fi Signed-off-by: Ilia Mirkin imirkin@alum.mit.edu Cc: stable@vger.kernel.org # 3.8+

I don't exactly understand what's going on, but this is just a straightforward way to avoid a null deref that you see happens in the bug. I haven't figured out the root cause of this, but it's getting well into the "I have no idea how TTM works" space. However this seems like a bit of defensive programming -- nouveau_vm_map_sg will pass node->pages as a list down, which will be dereferenced by nvc0_vm_map_sg. Perhaps the other arguments should make that dereferencing not happen, but it definitely was happening here, as you can see in the bug.

Ben/Maarten, I'll let you judge whether this check is appropriate, since like I hope I was able to convey above, I'm just not really sure :)

Not it really isn't appropriate..

You'd have to call call nouveau_vm_map_sg_table instead, the only place that doesn't handle that correctly is where it's not expected to be called.

Here, have a completely untested patch to fix things...

diff --git a/drivers/gpu/drm/nouveau/nouveau_display.c b/drivers/gpu/drm/nouveau/nouveau_display.c --- a/drivers/gpu/drm/nouveau/nouveau_display.c +++ b/drivers/gpu/drm/nouveau/nouveau_display.c @@ -138,17 +143,26 @@ nouveau_user_framebuffer_create(struct drm_device *dev, { struct nouveau_framebuffer *nouveau_fb; struct drm_gem_object *gem;
  struct nouveau_bo *nvbo;
  int ret = -ENOMEM;

  gem = drm_gem_object_lookup(dev, file_priv, mode_cmd->handles[0]);
  if (!gem)
          return ERR_PTR(-ENOENT);
  nvbo = nouveau_gem_object(gem);
  if (!(nvbo->valid_domains & NOUVEAU_GEM_DOMAIN_VRAM)) {
          nv_warn(nouveau_drm(dev), "Trying to create a fb in vram with"
                  " valid_domains=%08x\n", nvbo->valid_domains);
          ret = -EINVAL;
          goto err_unref;
  }

Definitely the right idea, we can't handle this case right now. However, we may someday want/need to be able to scan out of system memory, so this is the wrong place.

I suspect the correct thing to do (which'll also handle the "defensive" part) is to bail in nouveau_bo_move() on attempts to move a DMA-BUF backed object into VRAM.

Sound OK?

Ben.

...

    nouveau_fb = kzalloc(sizeof(struct nouveau_framebuffer), GFP_KERNEL);
    if (!nouveau_fb)
            goto err_unref;

  ret = nouveau_framebuffer_init(dev, nouveau_fb, mode_cmd, nouveau_gem_object(gem));

  ret = nouveau_framebuffer_init(dev, nouveau_fb, mode_cmd, nvbo);
  if (ret)
          goto err;

dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel

Maarten Lankhorst

6:59 a.m.

Op 04-09-13 05:41, Ben Skeggs schreef:

...

On Thu, Aug 22, 2013 at 5:12 PM, Maarten Lankhorst maarten.lankhorst@canonical.com wrote:

...
Op 22-08-13 02:10, Ilia Mirkin schreef:

...
The code expects non-VRAM mem nodes to have a pages list. If that's not set, it will do a null deref down the line. Warn on that condition and return an error.

See https://bugs.freedesktop.org/show_bug.cgi?id=64774

Reported-by: Pasi Kärkkäinen pasik@iki.fi Tested-by: Pasi Kärkkäinen pasik@iki.fi Signed-off-by: Ilia Mirkin imirkin@alum.mit.edu Cc: stable@vger.kernel.org # 3.8+

I don't exactly understand what's going on, but this is just a straightforward way to avoid a null deref that you see happens in the bug. I haven't figured out the root cause of this, but it's getting well into the "I have no idea how TTM works" space. However this seems like a bit of defensive programming -- nouveau_vm_map_sg will pass node->pages as a list down, which will be dereferenced by nvc0_vm_map_sg. Perhaps the other arguments should make that dereferencing not happen, but it definitely was happening here, as you can see in the bug.

Ben/Maarten, I'll let you judge whether this check is appropriate, since like I hope I was able to convey above, I'm just not really sure :)

Not it really isn't appropriate..

You'd have to call call nouveau_vm_map_sg_table instead, the only place that doesn't handle that correctly is where it's not expected to be called.

Here, have a completely untested patch to fix things...

diff --git a/drivers/gpu/drm/nouveau/nouveau_display.c b/drivers/gpu/drm/nouveau/nouveau_display.c --- a/drivers/gpu/drm/nouveau/nouveau_display.c +++ b/drivers/gpu/drm/nouveau/nouveau_display.c @@ -138,17 +143,26 @@ nouveau_user_framebuffer_create(struct drm_device *dev, { struct nouveau_framebuffer *nouveau_fb; struct drm_gem_object *gem;
  struct nouveau_bo *nvbo;
  int ret = -ENOMEM;

  gem = drm_gem_object_lookup(dev, file_priv, mode_cmd->handles[0]);
  if (!gem)
          return ERR_PTR(-ENOENT);
  nvbo = nouveau_gem_object(gem);
  if (!(nvbo->valid_domains & NOUVEAU_GEM_DOMAIN_VRAM)) {
          nv_warn(nouveau_drm(dev), "Trying to create a fb in vram with"
                  " valid_domains=%08x\n", nvbo->valid_domains);
          ret = -EINVAL;
          goto err_unref;
  }
Definitely the right idea, we can't handle this case right now. However, we may someday want/need to be able to scan out of system memory, so this is the wrong place.

I suspect the correct thing to do (which'll also handle the "defensive" part) is to bail in nouveau_bo_move() on attempts to move a DMA-BUF backed object into VRAM.

Sound OK?

If it has a WARN_ON or something that would be ok, I didn't find any other places that attempt to move buffers to VRAM though, so it's probably harmless.

When looking into this bug I noticed that nouveau_bo_vma_add needs to have a check for nvbo->page_shift == vma->vm->vmm->spg_shift, and only if the check is true it should map the page in TTM_PL_TT. Patch below. Should probably also be cc'd to stable.

~Maarten

diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c index 89b992e..355a1b7 100644 --- a/drivers/gpu/drm/nouveau/nouveau_bo.c +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c @@ -1560,7 +1560,8 @@ nouveau_bo_vma_add(struct nouveau_bo *nvbo, struct nouveau_vm *vm,

if (nvbo->bo.mem.mem_type == TTM_PL_VRAM) nouveau_vm_map(vma, nvbo->bo.mem.mm_node); - else if (nvbo->bo.mem.mem_type == TTM_PL_TT) { + else if (nvbo->bo.mem.mem_type == TTM_PL_TT && + nvbo->page_shift == vma->vm->vmm->spg_shift) { if (node->sg) nouveau_vm_map_sg_table(vma, 0, size, node); else

Pasi Kärkkäinen

10 Sep 10 Sep

10:39 a.m.

On Wed, Sep 04, 2013 at 08:59:13AM +0200, Maarten Lankhorst wrote:

...

Op 04-09-13 05:41, Ben Skeggs schreef:

...
On Thu, Aug 22, 2013 at 5:12 PM, Maarten Lankhorst maarten.lankhorst@canonical.com wrote:

...
Op 22-08-13 02:10, Ilia Mirkin schreef:

...
The code expects non-VRAM mem nodes to have a pages list. If that's not set, it will do a null deref down the line. Warn on that condition and return an error.

See https://bugs.freedesktop.org/show_bug.cgi?id=64774

Reported-by: Pasi Kärkkäinen pasik@iki.fi Tested-by: Pasi Kärkkäinen pasik@iki.fi Signed-off-by: Ilia Mirkin imirkin@alum.mit.edu Cc: stable@vger.kernel.org # 3.8+

I don't exactly understand what's going on, but this is just a straightforward way to avoid a null deref that you see happens in the bug. I haven't figured out the root cause of this, but it's getting well into the "I have no idea how TTM works" space. However this seems like a bit of defensive programming -- nouveau_vm_map_sg will pass node->pages as a list down, which will be dereferenced by nvc0_vm_map_sg. Perhaps the other arguments should make that dereferencing not happen, but it definitely was happening here, as you can see in the bug.

Ben/Maarten, I'll let you judge whether this check is appropriate, since like I hope I was able to convey above, I'm just not really sure :)

Not it really isn't appropriate..

You'd have to call call nouveau_vm_map_sg_table instead, the only place that doesn't handle that correctly is where it's not expected to be called.

Here, have a completely untested patch to fix things...

diff --git a/drivers/gpu/drm/nouveau/nouveau_display.c b/drivers/gpu/drm/nouveau/nouveau_display.c --- a/drivers/gpu/drm/nouveau/nouveau_display.c +++ b/drivers/gpu/drm/nouveau/nouveau_display.c @@ -138,17 +143,26 @@ nouveau_user_framebuffer_create(struct drm_device *dev, { struct nouveau_framebuffer *nouveau_fb; struct drm_gem_object *gem;
  struct nouveau_bo *nvbo;
  int ret = -ENOMEM;

  gem = drm_gem_object_lookup(dev, file_priv, mode_cmd->handles[0]);
  if (!gem)
          return ERR_PTR(-ENOENT);
  nvbo = nouveau_gem_object(gem);
  if (!(nvbo->valid_domains & NOUVEAU_GEM_DOMAIN_VRAM)) {
          nv_warn(nouveau_drm(dev), "Trying to create a fb in vram with"
                  " valid_domains=%08x\n", nvbo->valid_domains);
          ret = -EINVAL;
          goto err_unref;
  }
Definitely the right idea, we can't handle this case right now. However, we may someday want/need to be able to scan out of system memory, so this is the wrong place.

I suspect the correct thing to do (which'll also handle the "defensive" part) is to bail in nouveau_bo_move() on attempts to move a DMA-BUF backed object into VRAM.

Sound OK?
If it has a WARN_ON or something that would be ok, I didn't find any other places that attempt to move buffers to VRAM though, so it's probably harmless.

So hmm.. I guess another patch is needed for the original issue in this thread. Is someone going to submit that?

...

When looking into this bug I noticed that nouveau_bo_vma_add needs to have a check for nvbo->page_shift == vma->vm->vmm->spg_shift, and only if the check is true it should map the page in TTM_PL_TT. Patch below. Should probably also be cc'd to stable.

Thanks! Is this patch ready to be merged?

-- Pasi

...

~Maarten

diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c index 89b992e..355a1b7 100644 --- a/drivers/gpu/drm/nouveau/nouveau_bo.c +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c @@ -1560,7 +1560,8 @@ nouveau_bo_vma_add(struct nouveau_bo *nvbo, struct nouveau_vm *vm,

if (nvbo->bo.mem.mem_type == TTM_PL_VRAM) nouveau_vm_map(vma, nvbo->bo.mem.mm_node);

else if (nvbo->bo.mem.mem_type == TTM_PL_TT) {
else if (nvbo->bo.mem.mem_type == TTM_PL_TT &&
 nvbo->page_shift == vma->vm->vmm->spg_shift) {
if (node->sg) nouveau_vm_map_sg_table(vma, 0, size, node); else

Pasi Kärkkäinen

25 Sep 25 Sep

2:41 p.m.

Hello,

On Wed, Sep 04, 2013 at 08:59:13AM +0200, Maarten Lankhorst wrote:

...

Op 04-09-13 05:41, Ben Skeggs schreef:

...
On Thu, Aug 22, 2013 at 5:12 PM, Maarten Lankhorst maarten.lankhorst@canonical.com wrote:

...
Op 22-08-13 02:10, Ilia Mirkin schreef:

...
The code expects non-VRAM mem nodes to have a pages list. If that's not set, it will do a null deref down the line. Warn on that condition and return an error.

See https://bugs.freedesktop.org/show_bug.cgi?id=64774

Reported-by: Pasi Kärkkäinen pasik@iki.fi Tested-by: Pasi Kärkkäinen pasik@iki.fi Signed-off-by: Ilia Mirkin imirkin@alum.mit.edu Cc: stable@vger.kernel.org # 3.8+

I don't exactly understand what's going on, but this is just a straightforward way to avoid a null deref that you see happens in the bug. I haven't figured out the root cause of this, but it's getting well into the "I have no idea how TTM works" space. However this seems like a bit of defensive programming -- nouveau_vm_map_sg will pass node->pages as a list down, which will be dereferenced by nvc0_vm_map_sg. Perhaps the other arguments should make that dereferencing not happen, but it definitely was happening here, as you can see in the bug.

Ben/Maarten, I'll let you judge whether this check is appropriate, since like I hope I was able to convey above, I'm just not really sure :)

Not it really isn't appropriate..

You'd have to call call nouveau_vm_map_sg_table instead, the only place that doesn't handle that correctly is where it's not expected to be called.

Here, have a completely untested patch to fix things...

diff --git a/drivers/gpu/drm/nouveau/nouveau_display.c b/drivers/gpu/drm/nouveau/nouveau_display.c --- a/drivers/gpu/drm/nouveau/nouveau_display.c +++ b/drivers/gpu/drm/nouveau/nouveau_display.c @@ -138,17 +143,26 @@ nouveau_user_framebuffer_create(struct drm_device *dev, { struct nouveau_framebuffer *nouveau_fb; struct drm_gem_object *gem;
  struct nouveau_bo *nvbo;
  int ret = -ENOMEM;

  gem = drm_gem_object_lookup(dev, file_priv, mode_cmd->handles[0]);
  if (!gem)
          return ERR_PTR(-ENOENT);
  nvbo = nouveau_gem_object(gem);
  if (!(nvbo->valid_domains & NOUVEAU_GEM_DOMAIN_VRAM)) {
          nv_warn(nouveau_drm(dev), "Trying to create a fb in vram with"
                  " valid_domains=%08x\n", nvbo->valid_domains);
          ret = -EINVAL;
          goto err_unref;
  }
Definitely the right idea, we can't handle this case right now. However, we may someday want/need to be able to scan out of system memory, so this is the wrong place.

I suspect the correct thing to do (which'll also handle the "defensive" part) is to bail in nouveau_bo_move() on attempts to move a DMA-BUF backed object into VRAM.

Sound OK?
If it has a WARN_ON or something that would be ok, I didn't find any other places that attempt to move buffers to VRAM though, so it's probably harmless.

Ben/Maarten: Are you guys planning to take a look at this and submit another patch, or.. ?

I tested the two earlier patches from this thread, and they both fixed the problem (hard kernel crash). I'm hoping this bug could be finally solved in the kernel..

Thanks,

-- Pasi

Ben Skeggs

4:48 p.m.

On Thu, Sep 26, 2013 at 12:41 AM, Pasi Kärkkäinen pasik@iki.fi wrote:

...

Hello,

On Wed, Sep 04, 2013 at 08:59:13AM +0200, Maarten Lankhorst wrote:

...
Op 04-09-13 05:41, Ben Skeggs schreef:

...
On Thu, Aug 22, 2013 at 5:12 PM, Maarten Lankhorst maarten.lankhorst@canonical.com wrote:

...
Op 22-08-13 02:10, Ilia Mirkin schreef:

...
The code expects non-VRAM mem nodes to have a pages list. If that's not set, it will do a null deref down the line. Warn on that condition and return an error.

See https://bugs.freedesktop.org/show_bug.cgi?id=64774

Reported-by: Pasi Kärkkäinen pasik@iki.fi Tested-by: Pasi Kärkkäinen pasik@iki.fi Signed-off-by: Ilia Mirkin imirkin@alum.mit.edu Cc: stable@vger.kernel.org # 3.8+

I don't exactly understand what's going on, but this is just a straightforward way to avoid a null deref that you see happens in the bug. I haven't figured out the root cause of this, but it's getting well into the "I have no idea how TTM works" space. However this seems like a bit of defensive programming -- nouveau_vm_map_sg will pass node->pages as a list down, which will be dereferenced by nvc0_vm_map_sg. Perhaps the other arguments should make that dereferencing not happen, but it definitely was happening here, as you can see in the bug.

Ben/Maarten, I'll let you judge whether this check is appropriate, since like I hope I was able to convey above, I'm just not really sure :)

Not it really isn't appropriate..

You'd have to call call nouveau_vm_map_sg_table instead, the only place that doesn't handle that correctly is where it's not expected to be called.

Here, have a completely untested patch to fix things...

diff --git a/drivers/gpu/drm/nouveau/nouveau_display.c b/drivers/gpu/drm/nouveau/nouveau_display.c --- a/drivers/gpu/drm/nouveau/nouveau_display.c +++ b/drivers/gpu/drm/nouveau/nouveau_display.c @@ -138,17 +143,26 @@ nouveau_user_framebuffer_create(struct drm_device *dev, { struct nouveau_framebuffer *nouveau_fb; struct drm_gem_object *gem;
  struct nouveau_bo *nvbo;
  int ret = -ENOMEM;

  gem = drm_gem_object_lookup(dev, file_priv, mode_cmd->handles[0]);
  if (!gem)
          return ERR_PTR(-ENOENT);
  nvbo = nouveau_gem_object(gem);
  if (!(nvbo->valid_domains & NOUVEAU_GEM_DOMAIN_VRAM)) {
          nv_warn(nouveau_drm(dev), "Trying to create a fb in vram with"
                  " valid_domains=%08x\n", nvbo->valid_domains);
          ret = -EINVAL;
          goto err_unref;
  }
Definitely the right idea, we can't handle this case right now. However, we may someday want/need to be able to scan out of system memory, so this is the wrong place.

I suspect the correct thing to do (which'll also handle the "defensive" part) is to bail in nouveau_bo_move() on attempts to move a DMA-BUF backed object into VRAM.

Sound OK?
If it has a WARN_ON or something that would be ok, I didn't find any other places that attempt to move buffers to VRAM though, so it's probably harmless.
Ben/Maarten: Are you guys planning to take a look at this and submit another patch, or.. ?

I tested the two earlier patches from this thread, and they both fixed the problem (hard kernel crash). I'm hoping this bug could be finally solved in the kernel..

I shall be looking at it properly once I'm back from XDC next week.

Thanks, Ben.

...

Thanks,

-- Pasi

Pasi Kärkkäinen

4:53 p.m.

On Thu, Sep 26, 2013 at 02:48:49AM +1000, Ben Skeggs wrote:

...

On Thu, Sep 26, 2013 at 12:41 AM, Pasi Kärkkäinen pasik@iki.fi wrote:

...
Hello,

On Wed, Sep 04, 2013 at 08:59:13AM +0200, Maarten Lankhorst wrote:

...
Op 04-09-13 05:41, Ben Skeggs schreef:

...
On Thu, Aug 22, 2013 at 5:12 PM, Maarten Lankhorst maarten.lankhorst@canonical.com wrote:

...
Op 22-08-13 02:10, Ilia Mirkin schreef:

...
The code expects non-VRAM mem nodes to have a pages list. If that's not set, it will do a null deref down the line. Warn on that condition and return an error.

See https://bugs.freedesktop.org/show_bug.cgi?id=64774

Reported-by: Pasi Kärkkäinen pasik@iki.fi Tested-by: Pasi Kärkkäinen pasik@iki.fi Signed-off-by: Ilia Mirkin imirkin@alum.mit.edu Cc: stable@vger.kernel.org # 3.8+

I don't exactly understand what's going on, but this is just a straightforward way to avoid a null deref that you see happens in the bug. I haven't figured out the root cause of this, but it's getting well into the "I have no idea how TTM works" space. However this seems like a bit of defensive programming -- nouveau_vm_map_sg will pass node->pages as a list down, which will be dereferenced by nvc0_vm_map_sg. Perhaps the other arguments should make that dereferencing not happen, but it definitely was happening here, as you can see in the bug.

Ben/Maarten, I'll let you judge whether this check is appropriate, since like I hope I was able to convey above, I'm just not really sure :)

Not it really isn't appropriate..

You'd have to call call nouveau_vm_map_sg_table instead, the only place that doesn't handle that correctly is where it's not expected to be called.

Here, have a completely untested patch to fix things...

diff --git a/drivers/gpu/drm/nouveau/nouveau_display.c b/drivers/gpu/drm/nouveau/nouveau_display.c --- a/drivers/gpu/drm/nouveau/nouveau_display.c +++ b/drivers/gpu/drm/nouveau/nouveau_display.c @@ -138,17 +143,26 @@ nouveau_user_framebuffer_create(struct drm_device *dev, { struct nouveau_framebuffer *nouveau_fb; struct drm_gem_object *gem;
  struct nouveau_bo *nvbo;
  int ret = -ENOMEM;

  gem = drm_gem_object_lookup(dev, file_priv, mode_cmd->handles[0]);
  if (!gem)
          return ERR_PTR(-ENOENT);
  nvbo = nouveau_gem_object(gem);
  if (!(nvbo->valid_domains & NOUVEAU_GEM_DOMAIN_VRAM)) {
          nv_warn(nouveau_drm(dev), "Trying to create a fb in vram with"
                  " valid_domains=%08x\n", nvbo->valid_domains);
          ret = -EINVAL;
          goto err_unref;
  }
Definitely the right idea, we can't handle this case right now. However, we may someday want/need to be able to scan out of system memory, so this is the wrong place.

I suspect the correct thing to do (which'll also handle the "defensive" part) is to bail in nouveau_bo_move() on attempts to move a DMA-BUF backed object into VRAM.

Sound OK?
If it has a WARN_ON or something that would be ok, I didn't find any other places that attempt to move buffers to VRAM though, so it's probably harmless.
Ben/Maarten: Are you guys planning to take a look at this and submit another patch, or.. ?

I tested the two earlier patches from this thread, and they both fixed the problem (hard kernel crash). I'm hoping this bug could be finally solved in the kernel..
I shall be looking at it properly once I'm back from XDC next week.

Great, thanks!

-- Pasi

...

Thanks, Ben.

Pasi Kärkkäinen

18 Oct 18 Oct

2:44 p.m.

On Thu, Sep 26, 2013 at 02:48:49AM +1000, Ben Skeggs wrote:

...

...
Ben/Maarten: Are you guys planning to take a look at this and submit another patch, or.. ?

I tested the two earlier patches from this thread, and they both fixed the problem (hard kernel crash). I'm hoping this bug could be finally solved in the kernel..

I shall be looking at it properly once I'm back from XDC next week.

Any thoughts about the patch?

-- Pasi

...

Thanks, Ben.

...
Thanks,

-- Pasi

Pasi Kärkkäinen

3 Jan 3 Jan

6:32 p.m.

On Fri, Oct 18, 2013 at 05:44:45PM +0300, Pasi Kärkkäinen wrote:

...

On Thu, Sep 26, 2013 at 02:48:49AM +1000, Ben Skeggs wrote:

...
...
Ben/Maarten: Are you guys planning to take a look at this and submit another patch, or.. ?

I tested the two earlier patches from this thread, and they both fixed the problem (hard kernel crash). I'm hoping this bug could be finally solved in the kernel..

I shall be looking at it properly once I'm back from XDC next week.

Any thoughts about the patch?

Ben: ping again?

This nouveau bug is still causing hard kernel crashes out there.. recent bug report against Fedora 20: https://bugzilla.redhat.com/show_bug.cgi?id=1047169

Thanks,

-- Pasi

...

...
Thanks, Ben.

Pasi Kärkkäinen

25 Sep 25 Sep

2:42 p.m.

On Wed, Sep 04, 2013 at 08:59:13AM +0200, Maarten Lankhorst wrote:

...

When looking into this bug I noticed that nouveau_bo_vma_add needs to have a check for nvbo->page_shift == vma->vm->vmm->spg_shift, and only if the check is true it should map the page in TTM_PL_TT. Patch below. Should probably also be cc'd to stable.

How about this patch? Is it ready to go in?

Thanks,

-- Pasi

...

~Maarten

diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c index 89b992e..355a1b7 100644 --- a/drivers/gpu/drm/nouveau/nouveau_bo.c +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c @@ -1560,7 +1560,8 @@ nouveau_bo_vma_add(struct nouveau_bo *nvbo, struct nouveau_vm *vm,

if (nvbo->bo.mem.mem_type == TTM_PL_VRAM) nouveau_vm_map(vma, nvbo->bo.mem.mm_node);

else if (nvbo->bo.mem.mem_type == TTM_PL_TT) {
else if (nvbo->bo.mem.mem_type == TTM_PL_TT &&
 nvbo->page_shift == vma->vm->vmm->spg_shift) {
if (node->sg) nouveau_vm_map_sg_table(vma, 0, size, node); else

Pasi Kärkkäinen

18 Oct 18 Oct

2:45 p.m.

On Wed, Sep 25, 2013 at 05:42:46PM +0300, Pasi Kärkkäinen wrote:

...

On Wed, Sep 04, 2013 at 08:59:13AM +0200, Maarten Lankhorst wrote:

...
When looking into this bug I noticed that nouveau_bo_vma_add needs to have a check for nvbo->page_shift == vma->vm->vmm->spg_shift, and only if the check is true it should map the page in TTM_PL_TT. Patch below. Should probably also be cc'd to stable.

How about this patch? Is it ready to go in?

Ping on this patch aswell..

-- Pasi

...

Thanks,

-- Pasi

...
~Maarten

diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c index 89b992e..355a1b7 100644 --- a/drivers/gpu/drm/nouveau/nouveau_bo.c +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c @@ -1560,7 +1560,8 @@ nouveau_bo_vma_add(struct nouveau_bo *nvbo, struct nouveau_vm *vm,

if (nvbo->bo.mem.mem_type == TTM_PL_VRAM) nouveau_vm_map(vma, nvbo->bo.mem.mm_node);

else if (nvbo->bo.mem.mem_type == TTM_PL_TT) {
else if (nvbo->bo.mem.mem_type == TTM_PL_TT &&
 nvbo->page_shift == vma->vm->vmm->spg_shift) {
if (node->sg) nouveau_vm_map_sg_table(vma, 0, size, node); else

4133

Age (days ago)

4267

Last active (days ago)

dri-devel@lists.freedesktop.org

21 comments

4 participants

tags (0)

participants (4)

Ben Skeggs
Ilia Mirkin
Maarten Lankhorst
Pasi Kärkkäinen