From: Janusz Krzysztofik janusz.krzysztofik@intel.com
In case the driver gets unbound while a device is open, kernel panic may be forced if a list of allocated context IDs is not empty.
When a device is open, the list may happen to be not empty because a context ID, once allocated by a context ID allocator to a context assosiated with that open file descriptor, is released as late as on device close.
On the other hand, there is a need to release all allocated context IDs and destroy the context ID allocator on driver unbind, even if a device is open, in order to free memory resources consumed and prevent from memory leaks. The purpose of the forced kernel panic was to protect the context ID allocator from being silently destroyed if not all allocated IDs had been released.
Before forcing the kernel panic on non-empty list of allocated context IDs, do that unlikely on non-empty list of contexts that should be freed by preceding drain of work queue (there must be another bug if that list happens to be not empty). If empty, we may assume that remaining contexts are idle (not pinned) and their IDs can be safely released.
Once done, release context IDs of each of those remaining contexts unless it happens a context is unlikely pinned. Force kernel panic in that case, there must be still another bug in the driver code.
Now the kernel panic protecting the allocator should not pop up as the list it checks should be empty. If it unlikely happens to be not empty, there must be still another bug.
Signed-off-by: Janusz Krzysztofik janusz.krzysztofik@intel.com --- drivers/gpu/drm/i915/i915_gem_context.c | 10 ++++++++++ 1 file changed, 10 insertions(+)
diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c index 280813a4bf82..18d004d94e43 100644 --- a/drivers/gpu/drm/i915/i915_gem_context.c +++ b/drivers/gpu/drm/i915/i915_gem_context.c @@ -611,6 +611,8 @@ void i915_gem_contexts_lost(struct drm_i915_private *dev_priv)
void i915_gem_contexts_fini(struct drm_i915_private *i915) { + struct i915_gem_context *ctx, *cn; + lockdep_assert_held(&i915->drm.struct_mutex);
if (i915->preempt_context) @@ -618,6 +620,14 @@ void i915_gem_contexts_fini(struct drm_i915_private *i915) destroy_kernel_context(&i915->kernel_context);
/* Must free all deferred contexts (via flush_workqueue) first */ + GEM_BUG_ON(!llist_empty(&i915->contexts.free_list)); + + /* Release all remaining HW IDs before ID allocator is destroyed */ + list_for_each_entry_safe(ctx, cn, &i915->contexts.hw_id_list, + hw_id_link) { + GEM_BUG_ON(atomic_read(&ctx->hw_id_pin_count)); + release_hw_id(ctx); + } GEM_BUG_ON(!list_empty(&i915->contexts.hw_id_list)); ida_destroy(&i915->contexts.hw_ida); }
Quoting Janusz Krzysztofik (2019-04-04 11:24:45)
From: Janusz Krzysztofik janusz.krzysztofik@intel.com
In case the driver gets unbound while a device is open, kernel panic may be forced if a list of allocated context IDs is not empty.
When a device is open, the list may happen to be not empty because a context ID, once allocated by a context ID allocator to a context assosiated with that open file descriptor, is released as late as on device close.
On the other hand, there is a need to release all allocated context IDs and destroy the context ID allocator on driver unbind, even if a device is open, in order to free memory resources consumed and prevent from memory leaks. The purpose of the forced kernel panic was to protect the context ID allocator from being silently destroyed if not all allocated IDs had been released.
Those open fd are still pointing into kernel memory where the driver used to be. The panic is entirely correct, we should not be unloading the module before those dangling pointers have been made safe.
This is papering over the symptom. How is the module being unloaded with open fd? If all the fd have been closed, how have we failed to flush and retire all requests (thereby unpinning the contexts and all other pointers). -Chris
On Thu, 2019-04-04 at 11:28 +0100, Chris Wilson wrote:
Quoting Janusz Krzysztofik (2019-04-04 11:24:45)
From: Janusz Krzysztofik janusz.krzysztofik@intel.com
In case the driver gets unbound while a device is open, kernel panic may be forced if a list of allocated context IDs is not empty.
When a device is open, the list may happen to be not empty because a context ID, once allocated by a context ID allocator to a context assosiated with that open file descriptor, is released as late as on device close.
On the other hand, there is a need to release all allocated context IDs and destroy the context ID allocator on driver unbind, even if a device is open, in order to free memory resources consumed and prevent from memory leaks. The purpose of the forced kernel panic was to protect the context ID allocator from being silently destroyed if not all allocated IDs had been released.
Those open fd are still pointing into kernel memory where the driver used to be. The panic is entirely correct, we should not be unloading the module before those dangling pointers have been made safe.
This is papering over the symptom. How is the module being unloaded with open fd?
A user can play with the driver unbind or device remove sysfs interface.
Thanks, Janusz
If all the fd have been closed, how have we failed to flush and retire all requests (thereby unpinning the contexts and all other pointers). -Chris _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Quoting Janusz Krzysztofik (2019-04-04 11:40:24)
On Thu, 2019-04-04 at 11:28 +0100, Chris Wilson wrote:
Quoting Janusz Krzysztofik (2019-04-04 11:24:45)
From: Janusz Krzysztofik janusz.krzysztofik@intel.com
In case the driver gets unbound while a device is open, kernel panic may be forced if a list of allocated context IDs is not empty.
When a device is open, the list may happen to be not empty because a context ID, once allocated by a context ID allocator to a context assosiated with that open file descriptor, is released as late as on device close.
On the other hand, there is a need to release all allocated context IDs and destroy the context ID allocator on driver unbind, even if a device is open, in order to free memory resources consumed and prevent from memory leaks. The purpose of the forced kernel panic was to protect the context ID allocator from being silently destroyed if not all allocated IDs had been released.
Those open fd are still pointing into kernel memory where the driver used to be. The panic is entirely correct, we should not be unloading the module before those dangling pointers have been made safe.
This is papering over the symptom. How is the module being unloaded with open fd?
A user can play with the driver unbind or device remove sysfs interface.
Sure, but we must still follow all the steps before _unloading_ the module or else the user is left pointing into reused kernel memory. -Chris
On Thu, 2019-04-04 at 11:43 +0100, Chris Wilson wrote:
Quoting Janusz Krzysztofik (2019-04-04 11:40:24)
On Thu, 2019-04-04 at 11:28 +0100, Chris Wilson wrote:
Quoting Janusz Krzysztofik (2019-04-04 11:24:45)
From: Janusz Krzysztofik janusz.krzysztofik@intel.com
In case the driver gets unbound while a device is open, kernel panic may be forced if a list of allocated context IDs is not empty.
When a device is open, the list may happen to be not empty because a context ID, once allocated by a context ID allocator to a context assosiated with that open file descriptor, is released as late as on device close.
On the other hand, there is a need to release all allocated context IDs and destroy the context ID allocator on driver unbind, even if a device is open, in order to free memory resources consumed and prevent from memory leaks. The purpose of the forced kernel panic was to protect the context ID allocator from being silently destroyed if not all allocated IDs had been released.
Those open fd are still pointing into kernel memory where the driver used to be. The panic is entirely correct, we should not be unloading the module before those dangling pointers have been made safe.
This is papering over the symptom. How is the module being unloaded with open fd?
A user can play with the driver unbind or device remove sysfs interface.
Sure, but we must still follow all the steps before _unloading_ the module or else the user is left pointing into reused kernel memory.
I'm not talking about unloading the module, that is prevented by open fds. The driver still exists after being unbound from a device and may just respond with -ENODEV.
Janusz
-Chris _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Quoting Janusz Krzysztofik (2019-04-04 11:50:14)
On Thu, 2019-04-04 at 11:43 +0100, Chris Wilson wrote:
Quoting Janusz Krzysztofik (2019-04-04 11:40:24)
On Thu, 2019-04-04 at 11:28 +0100, Chris Wilson wrote:
Quoting Janusz Krzysztofik (2019-04-04 11:24:45)
From: Janusz Krzysztofik janusz.krzysztofik@intel.com
In case the driver gets unbound while a device is open, kernel panic may be forced if a list of allocated context IDs is not empty.
When a device is open, the list may happen to be not empty because a context ID, once allocated by a context ID allocator to a context assosiated with that open file descriptor, is released as late as on device close.
On the other hand, there is a need to release all allocated context IDs and destroy the context ID allocator on driver unbind, even if a device is open, in order to free memory resources consumed and prevent from memory leaks. The purpose of the forced kernel panic was to protect the context ID allocator from being silently destroyed if not all allocated IDs had been released.
Those open fd are still pointing into kernel memory where the driver used to be. The panic is entirely correct, we should not be unloading the module before those dangling pointers have been made safe.
This is papering over the symptom. How is the module being unloaded with open fd?
A user can play with the driver unbind or device remove sysfs interface.
Sure, but we must still follow all the steps before _unloading_ the module or else the user is left pointing into reused kernel memory.
I'm not talking about unloading the module, that is prevented by open fds. The driver still exists after being unbound from a device and may just respond with -ENODEV.
i915_gem_contexts_fini() *is* module unload. -Chris
On Thu, 04 Apr 2019, Chris Wilson chris@chris-wilson.co.uk wrote:
Quoting Janusz Krzysztofik (2019-04-04 11:50:14)
On Thu, 2019-04-04 at 11:43 +0100, Chris Wilson wrote:
Quoting Janusz Krzysztofik (2019-04-04 11:40:24)
On Thu, 2019-04-04 at 11:28 +0100, Chris Wilson wrote:
Quoting Janusz Krzysztofik (2019-04-04 11:24:45)
From: Janusz Krzysztofik janusz.krzysztofik@intel.com
In case the driver gets unbound while a device is open, kernel panic may be forced if a list of allocated context IDs is not empty.
When a device is open, the list may happen to be not empty because a context ID, once allocated by a context ID allocator to a context assosiated with that open file descriptor, is released as late as on device close.
On the other hand, there is a need to release all allocated context IDs and destroy the context ID allocator on driver unbind, even if a device is open, in order to free memory resources consumed and prevent from memory leaks. The purpose of the forced kernel panic was to protect the context ID allocator from being silently destroyed if not all allocated IDs had been released.
Those open fd are still pointing into kernel memory where the driver used to be. The panic is entirely correct, we should not be unloading the module before those dangling pointers have been made safe.
This is papering over the symptom. How is the module being unloaded with open fd?
A user can play with the driver unbind or device remove sysfs interface.
Sure, but we must still follow all the steps before _unloading_ the module or else the user is left pointing into reused kernel memory.
I'm not talking about unloading the module, that is prevented by open fds. The driver still exists after being unbound from a device and may just respond with -ENODEV.
i915_gem_contexts_fini() *is* module unload.
Janusz, please describe what you're doing exactly.
BR, Jani.
dri-devel@lists.freedesktop.org