[PATCH 0/5] drm/i915: Fix pre-g4x GPU reset, again

List overview All Threads
Download

newer

older

[PATCH] drm/fb-helper: Remove...

[PATCH 1/3] drm/vblank: Introduce...

ville.syrjala＠linux.intel.com

29 Jun 2017 29 Jun '17

1:49 p.m.

From: Ville Syrjälä ville.syrjala@linux.intel.com

I set out to fix the pre-g4x GPU reset by protecting display commits with an rw_semaphore. I originally went all out and added infrastructure to track the committed state (as opposed the latest swapped state), but Daniel suggested that we want to backport the thing so I simplified it to just use obj->state instead. I will be posting the committed state handling as a followup as it will also DTRT if/when we will start allowing queueing multiple commits per-crtc.

Not sure if we want to put the "committed" state stuf into the atomic helper or I should just pull it all into i915. Suggestions welcome.

Series available here: git://github.com/vsyrjala/linux.git reset_commit_rwsem_norefactor_2,

Ville Syrjälä (5): drm/atomic: Refactor drm_atomic_state_realloc_connectors() drm/atomic: Introduce drm_atomic_helper_duplicate_committed_state() drm/i915% Store vma gtt offset in plane state drm/i915: Refactor __intel_atomic_commit_tail() drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore

drivers/gpu/drm/drm_atomic.c | 43 ++++-- drivers/gpu/drm/drm_atomic_helper.c | 123 +++++++++++++++++ drivers/gpu/drm/i915/i915_drv.h | 2 + drivers/gpu/drm/i915/i915_irq.c | 44 +----- drivers/gpu/drm/i915/intel_display.c | 251 +++++++++++++++++++++++------------ drivers/gpu/drm/i915/intel_drv.h | 6 +- drivers/gpu/drm/i915/intel_sprite.c | 8 +- include/drm/drm_atomic.h | 5 + include/drm/drm_atomic_helper.h | 4 + 9 files changed, 338 insertions(+), 148 deletions(-)

-- 2.13.0

Show replies by date

ville.syrjala＠linux.intel.com

29 Jun 29 Jun

1:49 p.m.

New subject: [PATCH 1/5] drm/atomic: Refactor drm_atomic_state_realloc_connectors()

From: Ville Syrjälä ville.syrjala@linux.intel.com

Pull the code to reallocate the state->connectors[] array into a helper function. We'll have another use for this later.

Signed-off-by: Ville Syrjälä ville.syrjala@linux.intel.com --- drivers/gpu/drm/drm_atomic.c | 43 +++++++++++++++++++++++++++++-------------- include/drm/drm_atomic.h | 5 +++++ 2 files changed, 34 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/drm_atomic.c b/drivers/gpu/drm/drm_atomic.c index 095e87278a88..a9f02b214fc6 100644 --- a/drivers/gpu/drm/drm_atomic.c +++ b/drivers/gpu/drm/drm_atomic.c @@ -1043,6 +1043,32 @@ drm_atomic_get_private_obj_state(struct drm_atomic_state *state, void *obj, } EXPORT_SYMBOL(drm_atomic_get_private_obj_state);

+int drm_atomic_state_realloc_connectors(struct drm_device *dev, + struct drm_atomic_state *state, + int index) +{ + struct drm_mode_config *config = &dev->mode_config; + struct __drm_connnectors_state *c; + int alloc = max(index + 1, config->num_connector); + + if (index < state->num_connector) + return 0; + + c = krealloc(state->connectors, + alloc * sizeof(*state->connectors), GFP_KERNEL); + if (!c) + return -ENOMEM; + + state->connectors = c; + memset(&state->connectors[state->num_connector], 0, + sizeof(*state->connectors) * (alloc - state->num_connector)); + + state->num_connector = alloc; + + return 0; +} +EXPORT_SYMBOL(drm_atomic_state_realloc_connectors); + /** * drm_atomic_get_connector_state - get connector state * @state: global atomic state object @@ -1074,20 +1100,9 @@ drm_atomic_get_connector_state(struct drm_atomic_state *state,

index = drm_connector_index(connector);

- if (index >= state->num_connector) { - struct __drm_connnectors_state *c; - int alloc = max(index + 1, config->num_connector); - - c = krealloc(state->connectors, alloc * sizeof(*state->connectors), GFP_KERNEL); - if (!c) - return ERR_PTR(-ENOMEM); - - state->connectors = c; - memset(&state->connectors[state->num_connector], 0, - sizeof(*state->connectors) * (alloc - state->num_connector)); - - state->num_connector = alloc; - } + ret = drm_atomic_state_realloc_connectors(connector->dev, state, index); + if (ret) + return ERR_PTR(ret);

if (state->connectors[index].state) return state->connectors[index].state; diff --git a/include/drm/drm_atomic.h b/include/drm/drm_atomic.h index 0196f264a418..5596ad58bcdc 100644 --- a/include/drm/drm_atomic.h +++ b/include/drm/drm_atomic.h @@ -324,6 +324,11 @@ drm_atomic_get_private_obj_state(struct drm_atomic_state *state, void *obj, const struct drm_private_state_funcs *funcs);

+int __must_check +drm_atomic_state_realloc_connectors(struct drm_device *dev, + struct drm_atomic_state *state, + int index); + /** * drm_atomic_get_existing_crtc_state - get crtc state, if it exists * @state: global atomic state object

-- 2.13.0

Daniel Vetter

30 Jun 30 Jun

1:15 p.m.

New subject: [Intel-gfx] [PATCH 1/5] drm/atomic: Refactor drm_atomic_state_realloc_connectors()

On Thu, Jun 29, 2017 at 04:49:44PM +0300, ville.syrjala@linux.intel.com wrote:

...

From: Ville Syrjälä ville.syrjala@linux.intel.com

Pull the code to reallocate the state->connectors[] array into a helper function. We'll have another use for this later.

Signed-off-by: Ville Syrjälä ville.syrjala@linux.intel.com

drivers/gpu/drm/drm_atomic.c | 43 +++++++++++++++++++++++++++++-------------- include/drm/drm_atomic.h | 5 +++++ 2 files changed, 34 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/drm_atomic.c b/drivers/gpu/drm/drm_atomic.c index 095e87278a88..a9f02b214fc6 100644 --- a/drivers/gpu/drm/drm_atomic.c +++ b/drivers/gpu/drm/drm_atomic.c @@ -1043,6 +1043,32 @@ drm_atomic_get_private_obj_state(struct drm_atomic_state *state, void *obj, } EXPORT_SYMBOL(drm_atomic_get_private_obj_state);

+int drm_atomic_state_realloc_connectors(struct drm_device *dev,
			struct drm_atomic_state *state,
			int index)

Needs some pretty kerneldoc, best with some explanation why/when drivers might want to use this (i.e. when they're constructing their own state for special reasons like hw state read-out or recovery after a hw reset or whatever). Commit message should explain that too, but better to stuff it into the kerneldoc. With that addressed:

Reviewed-by: Daniel Vetter daniel.vetter@ffwll.ch

...

+{
struct drm_mode_config *config = &dev->mode_config;

struct __drm_connnectors_state *c;

int alloc = max(index + 1, config->num_connector);

if (index < state->num_connector)
return 0;
c = krealloc(state->connectors,
     alloc * sizeof(*state->connectors), GFP_KERNEL);
if (!c)
return -ENOMEM;
state->connectors = c;

memset(&state->connectors[state->num_connector], 0,
      sizeof(*state->connectors) * (alloc - state->num_connector));
state->num_connector = alloc;

return 0;
+} +EXPORT_SYMBOL(drm_atomic_state_realloc_connectors);

/**

drm_atomic_get_connector_state - get connector state

@state: global atomic state object

@@ -1074,20 +1100,9 @@ drm_atomic_get_connector_state(struct drm_atomic_state *state,

index = drm_connector_index(connector);
if (index >= state->num_connector) {
struct __drm_connnectors_state *c;
int alloc = max(index + 1, config->num_connector);
c = krealloc(state->connectors, alloc * sizeof(*state->connectors), GFP_KERNEL);
if (!c)
	return ERR_PTR(-ENOMEM);
state->connectors = c;
memset(&state->connectors[state->num_connector], 0,
       sizeof(*state->connectors) * (alloc - state->num_connector));
state->num_connector = alloc;
}
ret = drm_atomic_state_realloc_connectors(connector->dev, state, index);

if (ret)
return ERR_PTR(ret);
if (state->connectors[index].state) return state->connectors[index].state;
diff --git a/include/drm/drm_atomic.h b/include/drm/drm_atomic.h index 0196f264a418..5596ad58bcdc 100644 --- a/include/drm/drm_atomic.h +++ b/include/drm/drm_atomic.h @@ -324,6 +324,11 @@ drm_atomic_get_private_obj_state(struct drm_atomic_state *state, void *obj, const struct drm_private_state_funcs *funcs);

+int __must_check +drm_atomic_state_realloc_connectors(struct drm_device *dev,
		    struct drm_atomic_state *state,
		    int index);
/**

drm_atomic_get_existing_crtc_state - get crtc state, if it exists

@state: global atomic state object

-- 2.13.0

Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

ville.syrjala＠linux.intel.com

29 Jun 29 Jun

1:49 p.m.

New subject: [PATCH 2/5] drm/atomic: Introduce drm_atomic_helper_duplicate_committed_state()

From: Ville Syrjälä ville.syrjala@linux.intel.com

For i915 GPU reset handling we'll want to be able to duplicate the state that was last committed to the hardware. For that purpose let's provide a helper function that is supposed to duplicate the state last committed to the hardware. For now we'll actually just duplicate the last swapped state for each object. That isn't quite correct but being able to duplicate the actaully committed state will require larger refactoring.

Since we will access obj->state without the protection of the appropriate locks there is a small chance that this might blow up. That problem too will get solved once we start dealing with the committed state correctly.

Note that we duplicates the current tate to to both old_state and new_state. For the purposes of i915 GPU reset we would only need one of them, but we actually need two top level states; one for disabling everything (which would need the duplicated state to be old_state), and another to reenable everything (which would need the duplicated state to be new_state). So to make it less comples I figured I'd just always duplicate both. Might want to rethink this if for no other reason that reducing the chances of memory allocation failure. Due to the double state duplication we need drm_atomic_helper_clean_committed_state() to clean up the duplicated old_state since that's not handled by the normal drm_atomic_state cleanup code.

TODO: do we want this in the helper, or maybe it should be just in i915?

v2: s/commited/committed/ everywhere (checkpatch) Handle state duplication errors better

Signed-off-by: Ville Syrjälä ville.syrjala@linux.intel.com --- drivers/gpu/drm/drm_atomic_helper.c | 123 ++++++++++++++++++++++++++++++++++++ include/drm/drm_atomic_helper.h | 4 ++ 2 files changed, 127 insertions(+)

diff --git a/drivers/gpu/drm/drm_atomic_helper.c b/drivers/gpu/drm/drm_atomic_helper.c index 2f269e4267da..a6ee0d16f723 100644 --- a/drivers/gpu/drm/drm_atomic_helper.c +++ b/drivers/gpu/drm/drm_atomic_helper.c @@ -3596,6 +3596,129 @@ drm_atomic_helper_duplicate_state(struct drm_device *dev, } EXPORT_SYMBOL(drm_atomic_helper_duplicate_state);

+struct drm_atomic_state * +drm_atomic_helper_duplicate_committed_state(struct drm_device *dev) +{ + struct drm_atomic_state *state; + struct drm_connector_list_iter conn_iter; + struct drm_connector *conn; + struct drm_plane *plane; + struct drm_crtc *crtc; + int err = 0; + + state = drm_atomic_state_alloc(dev); + if (!state) + return ERR_PTR(-ENOMEM); + + drm_for_each_plane(plane, dev) { + int i = drm_plane_index(plane); + + state->planes[i].ptr = plane; + state->planes[i].state = + plane->funcs->atomic_duplicate_state(plane); + state->planes[i].new_state = state->planes[i].state; + state->planes[i].old_state = + plane->funcs->atomic_duplicate_state(plane); + + if (!state->planes[i].state || + !state->planes[i].old_state) { + err = -ENOMEM; + goto free; + } + + state->planes[i].old_state->state = state; + } + + drm_for_each_crtc(crtc, dev) { + int i = drm_crtc_index(crtc); + + state->crtcs[i].ptr = crtc; + state->crtcs[i].state = + crtc->funcs->atomic_duplicate_state(crtc); + state->crtcs[i].new_state = state->crtcs[i].state; + state->crtcs[i].old_state = + crtc->funcs->atomic_duplicate_state(crtc); + + if (!state->crtcs[i].state || + !state->crtcs[i].old_state) { + err = -ENOMEM; + goto free; + } + + state->crtcs[i].old_state->state = state; + } + + drm_connector_list_iter_begin(dev, &conn_iter); + drm_for_each_connector_iter(conn, &conn_iter) { + int i = drm_connector_index(conn); + + err = drm_atomic_state_realloc_connectors(conn->dev, state, i); + if (err) + break; + + drm_connector_get(conn); + state->connectors[i].ptr = conn; + state->connectors[i].state = conn->funcs->atomic_duplicate_state(conn); + state->connectors[i].new_state = state->connectors[i].state; + state->connectors[i].old_state = conn->funcs->atomic_duplicate_state(conn); + + if (!state->connectors[i].state || + !state->connectors[i].old_state) { + err = -ENOMEM; + break; + } + + state->connectors[i].old_state->state = state; + } + drm_connector_list_iter_end(&conn_iter); + +free: + if (err < 0) { + drm_atomic_helper_clean_committed_state(state); + drm_atomic_state_put(state); + state = ERR_PTR(err); + } + + return state; +} +EXPORT_SYMBOL(drm_atomic_helper_duplicate_committed_state); + +void +drm_atomic_helper_clean_committed_state(struct drm_atomic_state *state) +{ + struct drm_plane *plane; + struct drm_crtc *crtc; + struct drm_connector *conn; + struct drm_plane_state *plane_state; + struct drm_crtc_state *crtc_state; + struct drm_connector_state *conn_state; + int i; + + /* restore the correct state->state */ + for_each_new_plane_in_state(state, plane, plane_state, i) { + if (!state->planes[i].old_state) + continue; + plane->funcs->atomic_destroy_state(plane, + state->planes[i].old_state); + state->planes[i].old_state = NULL; + } + for_each_new_crtc_in_state(state, crtc, crtc_state, i) { + if (!state->crtcs[i].old_state) + continue; + crtc->funcs->atomic_destroy_state(crtc, + state->crtcs[i].old_state); + state->crtcs[i].old_state = NULL; + } + for_each_new_connector_in_state(state, conn, conn_state, i) { + if (!state->connectors[i].old_state) + continue; + conn->funcs->atomic_destroy_state(conn, + state->connectors[i].old_state); + state->connectors[i].old_state = NULL; + } +} +EXPORT_SYMBOL(drm_atomic_helper_clean_committed_state); + /** * __drm_atomic_helper_connector_destroy_state - release connector state * @state: connector state object to release diff --git a/include/drm/drm_atomic_helper.h b/include/drm/drm_atomic_helper.h index 3bfeb2b2f746..70167c387501 100644 --- a/include/drm/drm_atomic_helper.h +++ b/include/drm/drm_atomic_helper.h @@ -173,6 +173,10 @@ drm_atomic_helper_connector_duplicate_state(struct drm_connector *connector); struct drm_atomic_state * drm_atomic_helper_duplicate_state(struct drm_device *dev, struct drm_modeset_acquire_ctx *ctx); +struct drm_atomic_state * +drm_atomic_helper_duplicate_committed_state(struct drm_device *dev); +void +drm_atomic_helper_clean_committed_state(struct drm_atomic_state *state); void __drm_atomic_helper_connector_destroy_state(struct drm_connector_state *state); void drm_atomic_helper_connector_destroy_state(struct drm_connector *connector,

-- 2.13.0

ville.syrjala＠linux.intel.com

1:49 p.m.

New subject: [PATCH 3/5] drm/i915% Store vma gtt offset in plane state

From: Ville Syrjälä ville.syrjala@linux.intel.com

To avoid having to deference plane_state->vma during the commit phase of plane updates, let's store the vma gtt offset (or the bus address when we need it) in the plane state. This is crucial for doing the modeset operations during GPU reset as as plane_state->vma gets cleared when we duplicate the state and we won't be calling .prepare_fb() during GPU reset plane commits.

Signed-off-by: Ville Syrjälä ville.syrjala@linux.intel.com --- drivers/gpu/drm/i915/intel_display.c | 38 +++++++++++++++++++++--------------- drivers/gpu/drm/i915/intel_drv.h | 6 +----- drivers/gpu/drm/i915/intel_sprite.c | 8 ++++---- 3 files changed, 27 insertions(+), 25 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c index 4e03ca6c946f..4ce81a694bd2 100644 --- a/drivers/gpu/drm/i915/intel_display.c +++ b/drivers/gpu/drm/i915/intel_display.c @@ -2744,7 +2744,7 @@ intel_find_initial_plane_obj(struct intel_crtc *intel_crtc, if (!state->vma) continue;

- if (intel_plane_ggtt_offset(state) == plane_config->base) { + if (state->gtt_offset == plane_config->base) { fb = c->primary->fb; drm_framebuffer_reference(fb); goto valid_fb; @@ -2771,6 +2771,8 @@ intel_find_initial_plane_obj(struct intel_crtc *intel_crtc, mutex_lock(&dev->struct_mutex); intel_state->vma = intel_pin_and_fence_fb_obj(fb, primary->state->rotation); + intel_state->gtt_offset = i915_ggtt_offset(intel_state->vma); + mutex_unlock(&dev->struct_mutex); if (IS_ERR(intel_state->vma)) { DRM_ERROR("failed to pin boot fb on pipe %d: %li\n", @@ -3122,19 +3124,16 @@ static void i9xx_update_primary_plane(struct intel_plane *primary, I915_WRITE_FW(DSPSTRIDE(plane), fb->pitches[0]); if (IS_HASWELL(dev_priv) || IS_BROADWELL(dev_priv)) { I915_WRITE_FW(DSPSURF(plane), - intel_plane_ggtt_offset(plane_state) + - crtc->dspaddr_offset); + plane_state->gtt_offset + crtc->dspaddr_offset); I915_WRITE_FW(DSPOFFSET(plane), (y << 16) | x); } else if (INTEL_GEN(dev_priv) >= 4) { I915_WRITE_FW(DSPSURF(plane), - intel_plane_ggtt_offset(plane_state) + - crtc->dspaddr_offset); + plane_state->gtt_offset + crtc->dspaddr_offset); I915_WRITE_FW(DSPTILEOFF(plane), (y << 16) | x); I915_WRITE_FW(DSPLINOFF(plane), linear_offset); } else { I915_WRITE_FW(DSPADDR(plane), - intel_plane_ggtt_offset(plane_state) + - crtc->dspaddr_offset); + plane_state->gtt_offset + crtc->dspaddr_offset); } POSTING_READ_FW(reg);

@@ -3395,7 +3394,7 @@ static void skylake_update_primary_plane(struct intel_plane *plane, }

I915_WRITE_FW(PLANE_SURF(pipe, plane_id), - intel_plane_ggtt_offset(plane_state) + surf_addr); + plane_state->gtt_offset + surf_addr);

POSTING_READ_FW(PLANE_SURF(pipe, plane_id));

@@ -9185,15 +9184,9 @@ static u32 intel_cursor_base(const struct intel_plane_state *plane_state) struct drm_i915_private *dev_priv = to_i915(plane_state->base.plane->dev); const struct drm_framebuffer *fb = plane_state->base.fb; - const struct drm_i915_gem_object *obj = intel_fb_obj(fb); u32 base;

- if (INTEL_INFO(dev_priv)->cursor_needs_physical) - base = obj->phys_handle->busaddr; - else - base = intel_plane_ggtt_offset(plane_state); - - base += plane_state->main.offset; + base = plane_state->gtt_offset + plane_state->main.offset;

/* ILK+ do this automagically */ if (HAS_GMCH_DISPLAY(dev_priv) && @@ -10846,8 +10839,11 @@ static int intel_crtc_page_flip(struct drm_crtc *crtc,

work->old_vma = to_intel_plane_state(primary->state)->vma; to_intel_plane_state(primary->state)->vma = vma; + to_intel_plane_state(primary->state)->gtt_offset = + i915_ggtt_offset(vma);

- work->gtt_offset = i915_ggtt_offset(vma) + intel_crtc->dspaddr_offset; + work->gtt_offset = to_intel_plane_state(primary->state)->gtt_offset + + intel_crtc->dspaddr_offset; work->rotation = crtc->primary->state->rotation;

/* @@ -10903,6 +10899,8 @@ static int intel_crtc_page_flip(struct drm_crtc *crtc, i915_add_request(request); cleanup_unpin: to_intel_plane_state(primary->state)->vma = work->old_vma; + to_intel_plane_state(primary->state)->gtt_offset = + i915_ggtt_offset(work->old_vma); intel_unpin_fb_vma(vma); cleanup_pending: atomic_dec(&intel_crtc->unpin_work_count); @@ -13344,6 +13342,8 @@ intel_prepare_plane_fb(struct drm_plane *plane, DRM_DEBUG_KMS("failed to attach phys object\n"); return ret; } + to_intel_plane_state(new_state)->gtt_offset = + obj->phys_handle->busaddr; } else { struct i915_vma *vma;

@@ -13354,6 +13354,8 @@ intel_prepare_plane_fb(struct drm_plane *plane, }

to_intel_plane_state(new_state)->vma = vma; + to_intel_plane_state(new_state)->gtt_offset = + i915_ggtt_offset(vma); } }

@@ -13657,6 +13659,8 @@ intel_legacy_cursor_update(struct drm_plane *plane, DRM_DEBUG_KMS("failed to attach phys object\n"); goto out_unlock; } + to_intel_plane_state(new_plane_state)->gtt_offset = + intel_fb_obj(fb)->phys_handle->busaddr; } else { struct i915_vma *vma;

@@ -13669,6 +13673,8 @@ intel_legacy_cursor_update(struct drm_plane *plane, }

to_intel_plane_state(new_plane_state)->vma = vma; + to_intel_plane_state(new_plane_state)->gtt_offset = + i915_ggtt_offset(vma); }

old_fb = old_plane_state->fb; diff --git a/drivers/gpu/drm/i915/intel_drv.h b/drivers/gpu/drm/i915/intel_drv.h index d17a32437f07..67efbc7de559 100644 --- a/drivers/gpu/drm/i915/intel_drv.h +++ b/drivers/gpu/drm/i915/intel_drv.h @@ -405,6 +405,7 @@ struct intel_plane_state { struct drm_plane_state base; struct drm_rect clip; struct i915_vma *vma; + u32 gtt_offset; /* GGTT offset, or bus address */

struct { u32 offset; @@ -1482,11 +1483,6 @@ void intel_mode_from_pipe_config(struct drm_display_mode *mode, int skl_update_scaler_crtc(struct intel_crtc_state *crtc_state); int skl_max_scale(struct intel_crtc *crtc, struct intel_crtc_state *crtc_state);

-static inline u32 intel_plane_ggtt_offset(const struct intel_plane_state *state) -{ - return i915_ggtt_offset(state->vma); -} - u32 skl_plane_ctl(const struct intel_crtc_state *crtc_state, const struct intel_plane_state *plane_state); u32 skl_plane_stride(const struct drm_framebuffer *fb, int plane, diff --git a/drivers/gpu/drm/i915/intel_sprite.c b/drivers/gpu/drm/i915/intel_sprite.c index 0c650c2cbca8..eced0772ed03 100644 --- a/drivers/gpu/drm/i915/intel_sprite.c +++ b/drivers/gpu/drm/i915/intel_sprite.c @@ -300,7 +300,7 @@ skl_update_plane(struct intel_plane *plane,

I915_WRITE_FW(PLANE_CTL(pipe, plane_id), plane_ctl); I915_WRITE_FW(PLANE_SURF(pipe, plane_id), - intel_plane_ggtt_offset(plane_state) + surf_addr); + plane_state->gtt_offset + surf_addr); POSTING_READ_FW(PLANE_SURF(pipe, plane_id));

spin_unlock_irqrestore(&dev_priv->uncore.lock, irqflags); @@ -477,7 +477,7 @@ vlv_update_plane(struct intel_plane *plane, I915_WRITE_FW(SPSIZE(pipe, plane_id), (crtc_h << 16) | crtc_w); I915_WRITE_FW(SPCNTR(pipe, plane_id), sprctl); I915_WRITE_FW(SPSURF(pipe, plane_id), - intel_plane_ggtt_offset(plane_state) + sprsurf_offset); + plane_state->gtt_offset + sprsurf_offset); POSTING_READ_FW(SPSURF(pipe, plane_id));

spin_unlock_irqrestore(&dev_priv->uncore.lock, irqflags); @@ -615,7 +615,7 @@ ivb_update_plane(struct intel_plane *plane, I915_WRITE_FW(SPRSCALE(pipe), sprscale); I915_WRITE_FW(SPRCTL(pipe), sprctl); I915_WRITE_FW(SPRSURF(pipe), - intel_plane_ggtt_offset(plane_state) + sprsurf_offset); + plane_state->gtt_offset + sprsurf_offset); POSTING_READ_FW(SPRSURF(pipe));

spin_unlock_irqrestore(&dev_priv->uncore.lock, irqflags); @@ -747,7 +747,7 @@ g4x_update_plane(struct intel_plane *plane, I915_WRITE_FW(DVSSCALE(pipe), dvsscale); I915_WRITE_FW(DVSCNTR(pipe), dvscntr); I915_WRITE_FW(DVSSURF(pipe), - intel_plane_ggtt_offset(plane_state) + dvssurf_offset); + plane_state->gtt_offset + dvssurf_offset); POSTING_READ_FW(DVSSURF(pipe));

spin_unlock_irqrestore(&dev_priv->uncore.lock, irqflags);

-- 2.13.0

ville.syrjala＠linux.intel.com

1:49 p.m.

New subject: [PATCH 4/5] drm/i915: Refactor __intel_atomic_commit_tail()

From: Ville Syrjälä ville.syrjala@linux.intel.com

Split intel_atomic_commit_tail() into a lower level function that does the actual commit, and a higher level one that waits for the dependencies and signals the commit as done. We'll reuse the lower level function to perform commits during GPU resets.

Signed-off-by: Ville Syrjälä ville.syrjala@linux.intel.com --- drivers/gpu/drm/i915/intel_display.c | 18 +++++++++++++----- 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c index 4ce81a694bd2..7cdd6ec97f80 100644 --- a/drivers/gpu/drm/i915/intel_display.c +++ b/drivers/gpu/drm/i915/intel_display.c @@ -13004,7 +13004,7 @@ static void intel_atomic_helper_free_state_worker(struct work_struct *work) intel_atomic_helper_free_state(dev_priv); }

-static void intel_atomic_commit_tail(struct drm_atomic_state *state) +static void __intel_atomic_commit_tail(struct drm_atomic_state *state) { struct drm_device *dev = state->dev; struct intel_atomic_state *intel_state = to_intel_atomic_state(state); @@ -13017,8 +13017,6 @@ static void intel_atomic_commit_tail(struct drm_atomic_state *state) unsigned crtc_vblank_mask = 0; int i;

- drm_atomic_helper_wait_for_dependencies(state); - if (intel_state->modeset) intel_display_power_get(dev_priv, POWER_DOMAIN_MODESET);

@@ -13143,8 +13141,6 @@ static void intel_atomic_commit_tail(struct drm_atomic_state *state) if (intel_state->modeset && intel_can_enable_sagv(state)) intel_enable_sagv(dev_priv);

- drm_atomic_helper_commit_hw_done(state); - if (intel_state->modeset) { /* As one of the primary mmio accessors, KMS has a high * likelihood of triggering bugs in unclaimed access. After we @@ -13155,6 +13151,18 @@ static void intel_atomic_commit_tail(struct drm_atomic_state *state) intel_uncore_arm_unclaimed_mmio_detection(dev_priv); intel_display_power_put(dev_priv, POWER_DOMAIN_MODESET); } +} + +static void intel_atomic_commit_tail(struct drm_atomic_state *state) +{ + struct drm_device *dev = state->dev; + struct drm_i915_private *dev_priv = to_i915(dev); + + drm_atomic_helper_wait_for_dependencies(state); + + __intel_atomic_commit_tail(state); + + drm_atomic_helper_commit_hw_done(state);

mutex_lock(&dev->struct_mutex); drm_atomic_helper_cleanup_planes(dev, state);

-- 2.13.0

ville.syrjala＠linux.intel.com

1:49 p.m.

New subject: [PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore

From: Ville Syrjälä ville.syrjala@linux.intel.com

Introduce an rw_semaphore to protect the display commits. All normal commits use down_read() and hence can proceed in parallel, but GPU reset will use down_write() making sure no other commits are in progress when we have to pull the plug on the display engine on pre-g4x platforms. There are no modeset/gem locks taken inside __intel_atomic_commit_tail() itself, and we wait for all dependencies before the down_read(), and thus there is no chance of deadlocks with this scheme.

During reset we should be recommiting the state that was committed last. But for now we'll settle for recommiting the last state for each object. Hence we may commit things a bit too soon when a GPU reset occurs. The rw_semaphore should guarantee that whatever state we observe in obj->state during reset sticks around while we do the commit. The obj->state pointer might change for some objects if another swap_state() occurs while we do our thing, so there migth be some theoretical mismatch which I tink we could avoid by grabbing the rw_semaphore also around the swap_state(), but for now I didn't do that.

Another source of mismatch can happen because we sometimes use the intel_crtc->config during the actual commit, and that only gets updated when we do the commuit. Hence we may get some state via ->state, some via the duplicated ->state, and some via ->config. We'll want to fix this all by tracking the commited state properly, but that will some bigger refactroing so for now we'll just choose to accept these potential mismatches.

I left out the state readout from the post-reset display reinitialization as that still likes to clobber crtc->state etc. If we make it use a free standing atomic state and mke sure it doesn't need any locks we could reintroduce it. For now I just mark the post-reset display state as dirty as possible to make sure we reinitialize everything.

One other potential issue remains in the form of display detection. Either we need to protect that with the same rw_semaphore as well, or perhaps it would be enough to force gmbus into bitbanging mode while the reset is happening and we don't have interrupts, and just across the actual hardware GPU reset we could hold the gmbus mutex.

v2: Keep intel_prepare/finish_reset() outside struct_mutex (Chris) v3: Drop all the committed_state refactoring to make this less obnoxious to backport (Daniel)

Cc: stable@vger.kernel.org # for the brave Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: Chris Wilson chris@chris-wilson.co.uk Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=101597 Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99093 Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=101600 Fixes: 4680816be336 ("drm/i915: Wait first for submission, before waiting for request completion") Fixes: 221fe7994554 ("drm/i915: Perform a direct reset of the GPU from the waiter") Signed-off-by: Ville Syrjälä ville.syrjala@linux.intel.com --- drivers/gpu/drm/i915/i915_drv.h | 2 + drivers/gpu/drm/i915/i915_irq.c | 44 +------- drivers/gpu/drm/i915/intel_display.c | 199 ++++++++++++++++++++++++----------- 3 files changed, 139 insertions(+), 106 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h index effbe4f72a64..88ddd27f61b0 100644 --- a/drivers/gpu/drm/i915/i915_drv.h +++ b/drivers/gpu/drm/i915/i915_drv.h @@ -2237,6 +2237,8 @@ struct drm_i915_private { struct drm_atomic_state *modeset_restore_state; struct drm_modeset_acquire_ctx reset_ctx;

+ struct rw_semaphore commit_sem; + struct list_head vm_list; /* Global list of all address spaces */ struct i915_ggtt ggtt; /* VM representing the global address space */

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c index eb4f1dca2077..9d591f17fda3 100644 --- a/drivers/gpu/drm/i915/i915_irq.c +++ b/drivers/gpu/drm/i915/i915_irq.c @@ -2587,46 +2587,6 @@ static irqreturn_t gen8_irq_handler(int irq, void *arg) return ret; }

-struct wedge_me { - struct delayed_work work; - struct drm_i915_private *i915; - const char *name; -}; - -static void wedge_me(struct work_struct *work) -{ - struct wedge_me *w = container_of(work, typeof(*w), work.work); - - dev_err(w->i915->drm.dev, - "%s timed out, cancelling all in-flight rendering.\n", - w->name); - i915_gem_set_wedged(w->i915); -} - -static void __init_wedge(struct wedge_me *w, - struct drm_i915_private *i915, - long timeout, - const char *name) -{ - w->i915 = i915; - w->name = name; - - INIT_DELAYED_WORK_ONSTACK(&w->work, wedge_me); - schedule_delayed_work(&w->work, timeout); -} - -static void __fini_wedge(struct wedge_me *w) -{ - cancel_delayed_work_sync(&w->work); - destroy_delayed_work_on_stack(&w->work); - w->i915 = NULL; -} - -#define i915_wedge_on_timeout(W, DEV, TIMEOUT) \ - for (__init_wedge((W), (DEV), (TIMEOUT), __func__); \ - (W)->i915; \ - __fini_wedge((W))) - /** * i915_reset_device - do process context error handling work * @dev_priv: i915 device private @@ -2640,15 +2600,13 @@ static void i915_reset_device(struct drm_i915_private *dev_priv) char *error_event[] = { I915_ERROR_UEVENT "=1", NULL }; char *reset_event[] = { I915_RESET_UEVENT "=1", NULL }; char *reset_done_event[] = { I915_ERROR_UEVENT "=0", NULL }; - struct wedge_me w;

kobject_uevent_env(kobj, KOBJ_CHANGE, error_event);

DRM_DEBUG_DRIVER("resetting chip\n"); kobject_uevent_env(kobj, KOBJ_CHANGE, reset_event);

- /* Use a watchdog to ensure that our reset completes */ - i915_wedge_on_timeout(&w, dev_priv, 5*HZ) { + { intel_prepare_reset(dev_priv);

/* Signal that locked waiters should reset the GPU */ diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c index 7cdd6ec97f80..f13c7d81d4a9 100644 --- a/drivers/gpu/drm/i915/intel_display.c +++ b/drivers/gpu/drm/i915/intel_display.c @@ -123,6 +123,7 @@ static void ironlake_pfit_enable(struct intel_crtc *crtc); static void intel_modeset_setup_hw_state(struct drm_device *dev, struct drm_modeset_acquire_ctx *ctx); static void intel_pre_disable_primary_noatomic(struct drm_crtc *crtc); +static void __intel_atomic_commit_tail(struct drm_atomic_state *state, bool is_reset);

struct intel_limit { struct { @@ -3491,27 +3492,85 @@ static bool gpu_reset_clobbers_display(struct drm_i915_private *dev_priv) INTEL_GEN(dev_priv) < 5 && !IS_G4X(dev_priv); }

-void intel_prepare_reset(struct drm_i915_private *dev_priv) +static void init_intel_state(struct intel_atomic_state *state) +{ + struct drm_crtc *crtc; + struct drm_crtc_state *old_crtc_state, *new_crtc_state; + int i; + + state->modeset = true; + + for_each_oldnew_crtc_in_state(&state->base, crtc, + old_crtc_state, new_crtc_state, i) { + if (new_crtc_state->active) + state->active_crtcs |= 1 << i; + else + state->active_crtcs &= ~(1 << i); + + if (old_crtc_state->active != new_crtc_state->active) + state->active_pipe_changes |= drm_crtc_mask(crtc); + } +} + +static struct drm_atomic_state * +intel_duplicate_committed_state(struct drm_i915_private *dev_priv) { - struct drm_device *dev = &dev_priv->drm; - struct drm_modeset_acquire_ctx *ctx = &dev_priv->reset_ctx; struct drm_atomic_state *state; - int ret; + struct drm_crtc *crtc; + struct drm_crtc_state *crtc_state; + int i;

- /* - * Need mode_config.mutex so that we don't - * trample ongoing ->detect() and whatnot. - */ - mutex_lock(&dev->mode_config.mutex); - drm_modeset_acquire_init(ctx, 0); - while (1) { - ret = drm_modeset_lock_all_ctx(dev, ctx); - if (ret != -EDEADLK) - break; + state = drm_atomic_helper_duplicate_committed_state(&dev_priv->drm); + if (IS_ERR(state)) { + DRM_ERROR("Duplicating state failed with %ld\n", + PTR_ERR(state)); + return NULL; + } + + to_intel_atomic_state(state)->active_crtcs = 0; + to_intel_atomic_state(state)->cdclk.logical = dev_priv->cdclk.hw; + to_intel_atomic_state(state)->cdclk.actual = dev_priv->cdclk.hw; + + init_intel_state(to_intel_atomic_state(state)); + + /* force a full update */ + for_each_new_crtc_in_state(state, crtc, crtc_state, i) { + struct intel_crtc_state *intel_crtc_state = + to_intel_crtc_state(crtc_state); + + if (!crtc_state->active) + continue;

- drm_modeset_backoff(ctx); + crtc_state->mode_changed = true; + crtc_state->active_changed = true; + crtc_state->planes_changed = true; + crtc_state->connectors_changed = true; + crtc_state->color_mgmt_changed = true; + crtc_state->zpos_changed = true; + + intel_crtc_state->update_pipe = true; + intel_crtc_state->disable_lp_wm = true; + intel_crtc_state->disable_cxsr = true; + intel_crtc_state->update_wm_post = true; + intel_crtc_state->fb_changed = true; + intel_crtc_state->fifo_changed = true; + intel_crtc_state->wm.need_postvbl_update = true; }

+ return state; +} + +void intel_prepare_reset(struct drm_i915_private *dev_priv) +{ + struct drm_atomic_state *disable_state, *restore_state = NULL; + struct drm_crtc *crtc; + struct drm_crtc_state *crtc_state; + struct drm_plane *plane; + struct drm_plane_state *plane_state; + int i; + + down_write(&dev_priv->commit_sem); + /* reset doesn't touch the display, but flips might get nuked anyway, */ if (!i915.force_reset_modeset_test && !gpu_reset_clobbers_display(dev_priv)) @@ -3521,30 +3580,40 @@ void intel_prepare_reset(struct drm_i915_private *dev_priv) * Disabling the crtcs gracefully seems nicer. Also the * g33 docs say we should at least disable all the planes. */ - state = drm_atomic_helper_duplicate_state(dev, ctx); - if (IS_ERR(state)) { - ret = PTR_ERR(state); - DRM_ERROR("Duplicating state failed with %i\n", ret); - return; - } + disable_state = intel_duplicate_committed_state(dev_priv); + if (IS_ERR(disable_state)) + goto out;

- ret = drm_atomic_helper_disable_all(dev, ctx); - if (ret) { - DRM_ERROR("Suspending crtc's failed with %i\n", ret); - drm_atomic_state_put(state); - return; - } + to_intel_atomic_state(disable_state)->active_crtcs = 0;

- dev_priv->modeset_restore_state = state; - state->acquire_ctx = ctx; + for_each_new_crtc_in_state(disable_state, crtc, crtc_state, i) + crtc_state->active = false; + for_each_new_plane_in_state(disable_state, plane, plane_state, i) + plane_state->visible = false; + + __intel_atomic_commit_tail(disable_state, true); + + drm_atomic_helper_clean_committed_state(disable_state); + drm_atomic_state_put(disable_state); + + restore_state = intel_duplicate_committed_state(dev_priv); + if (IS_ERR(restore_state)) + restore_state = NULL; + + for_each_old_crtc_in_state(restore_state, crtc, crtc_state, i) + crtc_state->active = false; + for_each_old_plane_in_state(restore_state, plane, plane_state, i) + plane_state->visible = false; + +out: + dev_priv->modeset_restore_state = restore_state; }

void intel_finish_reset(struct drm_i915_private *dev_priv) { struct drm_device *dev = &dev_priv->drm; - struct drm_modeset_acquire_ctx *ctx = &dev_priv->reset_ctx; - struct drm_atomic_state *state = dev_priv->modeset_restore_state; - int ret; + struct drm_atomic_state *restore_state = + dev_priv->modeset_restore_state;

/* * Flips in the rings will be nuked by the reset, @@ -3557,7 +3626,7 @@ void intel_finish_reset(struct drm_i915_private *dev_priv)

/* reset doesn't touch the display */ if (!gpu_reset_clobbers_display(dev_priv)) { - if (!state) { + if (!restore_state) { /* * Flips in the rings have been nuked by the reset, * so update the base address of all primary @@ -3569,11 +3638,11 @@ void intel_finish_reset(struct drm_i915_private *dev_priv) */ intel_update_primary_planes(dev); } else { - ret = __intel_display_resume(dev, state, ctx); - if (ret) - DRM_ERROR("Restoring old state failed with %i\n", ret); + __intel_atomic_commit_tail(restore_state, true); } } else { + i915_redisable_vga(dev_priv); + /* * The display has been reset as well, * so need a full re-initialization. @@ -3589,18 +3658,17 @@ void intel_finish_reset(struct drm_i915_private *dev_priv) dev_priv->display.hpd_irq_setup(dev_priv); spin_unlock_irq(&dev_priv->irq_lock);

- ret = __intel_display_resume(dev, state, ctx); - if (ret) - DRM_ERROR("Restoring old state failed with %i\n", ret); + __intel_atomic_commit_tail(restore_state, true);

intel_hpd_init(dev_priv); }

- if (state) - drm_atomic_state_put(state); - drm_modeset_drop_locks(ctx); - drm_modeset_acquire_fini(ctx); - mutex_unlock(&dev->mode_config.mutex); + if (restore_state) { + drm_atomic_helper_clean_committed_state(restore_state); + drm_atomic_state_put(restore_state); + } + + up_write(&dev_priv->commit_sem); }

static bool abort_flip_on_reset(struct intel_crtc *crtc) @@ -12592,29 +12660,18 @@ static int intel_modeset_checks(struct drm_atomic_state *state) { struct intel_atomic_state *intel_state = to_intel_atomic_state(state); struct drm_i915_private *dev_priv = to_i915(state->dev); - struct drm_crtc *crtc; - struct drm_crtc_state *old_crtc_state, *new_crtc_state; - int ret = 0, i; + int ret = 0;

if (!check_digital_port_conflicts(state)) { DRM_DEBUG_KMS("rejecting conflicting digital port configuration\n"); return -EINVAL; }

- intel_state->modeset = true; intel_state->active_crtcs = dev_priv->active_crtcs; intel_state->cdclk.logical = dev_priv->cdclk.logical; intel_state->cdclk.actual = dev_priv->cdclk.actual;

- for_each_oldnew_crtc_in_state(state, crtc, old_crtc_state, new_crtc_state, i) { - if (new_crtc_state->active) - intel_state->active_crtcs |= 1 << i; - else - intel_state->active_crtcs &= ~(1 << i); - - if (old_crtc_state->active != new_crtc_state->active) - intel_state->active_pipe_changes |= drm_crtc_mask(crtc); - } + init_intel_state(intel_state);

/* * See if the config requires any additional preparation, e.g. @@ -13004,7 +13061,7 @@ static void intel_atomic_helper_free_state_worker(struct work_struct *work) intel_atomic_helper_free_state(dev_priv); }

-static void __intel_atomic_commit_tail(struct drm_atomic_state *state) +static void __intel_atomic_commit_tail(struct drm_atomic_state *state, bool is_reset) { struct drm_device *dev = state->dev; struct intel_atomic_state *intel_state = to_intel_atomic_state(state); @@ -13068,10 +13125,18 @@ static void __intel_atomic_commit_tail(struct drm_atomic_state *state)

/* Only after disabling all output pipelines that will be changed can we * update the the output configuration. */ - intel_modeset_update_crtc_state(state); + if (!is_reset) + intel_modeset_update_crtc_state(state);

if (intel_state->modeset) { - drm_atomic_helper_update_legacy_modeset_state(state->dev, state); + if (!is_reset) { + drm_atomic_helper_update_legacy_modeset_state(state->dev, state); + } else { + for_each_new_crtc_in_state(state, crtc, new_crtc_state, i) { + if (new_crtc_state->enable) + drm_calc_timestamping_constants(crtc, &new_crtc_state->adjusted_mode); + } + }

intel_set_cdclk(dev_priv, &dev_priv->cdclk.actual);

@@ -13082,7 +13147,8 @@ static void __intel_atomic_commit_tail(struct drm_atomic_state *state) if (!intel_can_enable_sagv(state)) intel_disable_sagv(dev_priv);

- intel_modeset_verify_disabled(dev, state); + if (!is_reset) + intel_modeset_verify_disabled(dev, state); }

/* Complete the events for pipes that have now been disabled */ @@ -13135,7 +13201,8 @@ static void __intel_atomic_commit_tail(struct drm_atomic_state *state) if (put_domains[i]) modeset_put_power_domains(dev_priv, put_domains[i]);

- intel_modeset_verify_crtc(crtc, state, old_crtc_state, new_crtc_state); + if (!is_reset) + intel_modeset_verify_crtc(crtc, state, old_crtc_state, new_crtc_state); }

if (intel_state->modeset && intel_can_enable_sagv(state)) @@ -13160,10 +13227,14 @@ static void intel_atomic_commit_tail(struct drm_atomic_state *state)

drm_atomic_helper_wait_for_dependencies(state);

- __intel_atomic_commit_tail(state); + down_read(&dev_priv->commit_sem); + + __intel_atomic_commit_tail(state, false);

drm_atomic_helper_commit_hw_done(state);

+ up_read(&dev_priv->commit_sem); + mutex_lock(&dev->struct_mutex); drm_atomic_helper_cleanup_planes(dev, state); mutex_unlock(&dev->struct_mutex); @@ -15022,6 +15093,8 @@ int intel_modeset_init(struct drm_device *dev) INIT_WORK(&dev_priv->atomic_helper.free_work, intel_atomic_helper_free_state_worker);

+ init_rwsem(&dev_priv->commit_sem); + intel_init_quirks(dev);

intel_init_pm(dev_priv);

-- 2.13.0

Chris Wilson

1:56 p.m.

New subject: [PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore

Quoting ville.syrjala@linux.intel.com (2017-06-29 14:49:48)

...

@@ -2640,15 +2600,13 @@ static void i915_reset_device(struct drm_i915_private *dev_priv) char *error_event[] = { I915_ERROR_UEVENT "=1", NULL }; char *reset_event[] = { I915_RESET_UEVENT "=1", NULL }; char *reset_done_event[] = { I915_ERROR_UEVENT "=0", NULL };
  struct wedge_me w;

  kobject_uevent_env(kobj, KOBJ_CHANGE, error_event);

  DRM_DEBUG_DRIVER("resetting chip\n");
  kobject_uevent_env(kobj, KOBJ_CHANGE, reset_event);
  /* Use a watchdog to ensure that our reset completes */
  i915_wedge_on_timeout(&w, dev_priv, 5*HZ) {

Keep the wedge-if-reset times out mechanism. It is a nice safe guard against driver misbehaviour and not limited to breaking the previous deadlock. If it fires, we get an error in dmesg to go along with the lost data. I quickly grew fond of having this safe guard! -Chris

ville.syrjala＠linux.intel.com

2:36 p.m.

New subject: [PATCH v4 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore

From: Ville Syrjälä ville.syrjala@linux.intel.com

v2: Keep intel_prepare/finish_reset() outside struct_mutex (Chris) v3: Drop all the committed_state refactoring to make this less obnoxious to backport (Daniel) v4: Preserve the wedge timeout mechanism (Chris)

Cc: stable@vger.kernel.org # for the brave Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: Chris Wilson chris@chris-wilson.co.uk Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=101597 Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99093 Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=101600 Fixes: 4680816be336 ("drm/i915: Wait first for submission, before waiting for request completion") Fixes: 221fe7994554 ("drm/i915: Perform a direct reset of the GPU from the waiter") Signed-off-by: Ville Syrjälä ville.syrjala@linux.intel.com --- drivers/gpu/drm/i915/i915_drv.h | 2 + drivers/gpu/drm/i915/intel_display.c | 199 ++++++++++++++++++++++++----------- 2 files changed, 138 insertions(+), 63 deletions(-)

+ struct rw_semaphore commit_sem; + struct list_head vm_list; /* Global list of all address spaces */ struct i915_ggtt ggtt; /* VM representing the global address space */

diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c index 7cdd6ec97f80..f13c7d81d4a9 100644 --- a/drivers/gpu/drm/i915/intel_display.c +++ b/drivers/gpu/drm/i915/intel_display.c @@ -123,6 +123,7 @@ static void ironlake_pfit_enable(struct intel_crtc *crtc); static void intel_modeset_setup_hw_state(struct drm_device *dev, struct drm_modeset_acquire_ctx *ctx); static void intel_pre_disable_primary_noatomic(struct drm_crtc *crtc); +static void __intel_atomic_commit_tail(struct drm_atomic_state *state, bool is_reset);

struct intel_limit { struct { @@ -3491,27 +3492,85 @@ static bool gpu_reset_clobbers_display(struct drm_i915_private *dev_priv) INTEL_GEN(dev_priv) < 5 && !IS_G4X(dev_priv); }

/* * Flips in the rings will be nuked by the reset, @@ -3557,7 +3626,7 @@ void intel_finish_reset(struct drm_i915_private *dev_priv)

- ret = __intel_display_resume(dev, state, ctx); - if (ret) - DRM_ERROR("Restoring old state failed with %i\n", ret); + __intel_atomic_commit_tail(restore_state, true);

intel_hpd_init(dev_priv); }

if (!check_digital_port_conflicts(state)) { DRM_DEBUG_KMS("rejecting conflicting digital port configuration\n"); return -EINVAL; }

- intel_state->modeset = true; intel_state->active_crtcs = dev_priv->active_crtcs; intel_state->cdclk.logical = dev_priv->cdclk.logical; intel_state->cdclk.actual = dev_priv->cdclk.actual;

intel_set_cdclk(dev_priv, &dev_priv->cdclk.actual);

@@ -13082,7 +13147,8 @@ static void __intel_atomic_commit_tail(struct drm_atomic_state *state) if (!intel_can_enable_sagv(state)) intel_disable_sagv(dev_priv);

- intel_modeset_verify_disabled(dev, state); + if (!is_reset) + intel_modeset_verify_disabled(dev, state); }

- intel_modeset_verify_crtc(crtc, state, old_crtc_state, new_crtc_state); + if (!is_reset) + intel_modeset_verify_crtc(crtc, state, old_crtc_state, new_crtc_state); }

if (intel_state->modeset && intel_can_enable_sagv(state)) @@ -13160,10 +13227,14 @@ static void intel_atomic_commit_tail(struct drm_atomic_state *state)

drm_atomic_helper_wait_for_dependencies(state);

- __intel_atomic_commit_tail(state); + down_read(&dev_priv->commit_sem); + + __intel_atomic_commit_tail(state, false);

drm_atomic_helper_commit_hw_done(state);

+ init_rwsem(&dev_priv->commit_sem); + intel_init_quirks(dev);

intel_init_pm(dev_priv);

-- 2.13.0

Chris Wilson

5:57 p.m.

New subject: [PATCH v4 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore

Quoting ville.syrjala@linux.intel.com (2017-06-29 15:36:42)

...

From: Ville Syrjälä ville.syrjala@linux.intel.com

Introduce an rw_semaphore to protect the display commits. All normal commits use down_read() and hence can proceed in parallel, but GPU reset will use down_write() making sure no other commits are in progress when we have to pull the plug on the display engine on pre-g4x platforms. There are no modeset/gem locks taken inside __intel_atomic_commit_tail() itself, and we wait for all dependencies before the down_read(), and thus there is no chance of deadlocks with this scheme.

This matches what I thought should be done (I didn't think of using rwsem just a mutex, nice touch). The point I got stuck on was what should be done after the reset? I expected another modeset to return the state back or otherwise the inflight would get confused?

...

During reset we should be recommiting the state that was committed last. But for now we'll settle for recommiting the last state for each object.

Ah, I guess that explains the above. What is the complication with restoring the current state as opposed to the next state?

But I have to leave debating the merits of atomic internals to others. :| -Chris

Ville Syrjälä

7:26 p.m.

New subject: [PATCH v4 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore

On Thu, Jun 29, 2017 at 06:57:30PM +0100, Chris Wilson wrote:

...

Quoting ville.syrjala@linux.intel.com (2017-06-29 15:36:42)

...
From: Ville Syrjälä ville.syrjala@linux.intel.com

Introduce an rw_semaphore to protect the display commits. All normal commits use down_read() and hence can proceed in parallel, but GPU reset will use down_write() making sure no other commits are in progress when we have to pull the plug on the display engine on pre-g4x platforms. There are no modeset/gem locks taken inside __intel_atomic_commit_tail() itself, and we wait for all dependencies before the down_read(), and thus there is no chance of deadlocks with this scheme.

This matches what I thought should be done (I didn't think of using rwsem just a mutex, nice touch). The point I got stuck on was what should be done after the reset? I expected another modeset to return the state back or otherwise the inflight would get confused?

I guess that can happen. For instance, if we have a crtc_enable() in flight, and then we do a reset before it gets committed we would end up doing crtc_enable() twice in a row without a crtc_disable in between. For page flips and such this shouldn't be a big deal in general.

...

...
During reset we should be recommiting the state that was committed last. But for now we'll settle for recommiting the last state for each object.

Ah, I guess that explains the above. What is the complication with restoring the current state as opposed to the next state?

Well the main thing is just tracking which is the current state. That just needs refactoring the .atomic_duplicate_state() calling convention across the whole tree so that we can then duplicate the committed state rather than the latest state.

Also due to the commit_hw_done() being potentially done after the modeset locks have been dropped, I don't think we can be certain of it getting called in the same order as swap_state(), hence when we track the committed state in commit_hw_done() we'll have to have some way to figure out if our new state is in fact the latest committed state for each object or if the calls got reordered. We don't insert any dependencies between two commits unless they touch the same active crtc, thus this reordering seems very much possible. Dunno if we should add some way to add such dependeencies whenever the same object is part of two otherwise independent commits, or if we just want to try and work with the reordered calls. My idea for the latter was some kind of seqno/age counter on the object states that allows me to recongnize which state is more recent. The object states aren't refcounted so hanging on to the wrong pointer could cause an oops the next time we have to perform a GPU reset.

-- Ville Syrjälä Intel OTC

Daniel Vetter

30 Jun 30 Jun

1:35 p.m.

New subject: [PATCH v4 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore

On Thu, Jun 29, 2017 at 10:26:08PM +0300, Ville Syrjälä wrote:

...

On Thu, Jun 29, 2017 at 06:57:30PM +0100, Chris Wilson wrote:

...
Quoting ville.syrjala@linux.intel.com (2017-06-29 15:36:42)

...
From: Ville Syrjälä ville.syrjala@linux.intel.com

Introduce an rw_semaphore to protect the display commits. All normal commits use down_read() and hence can proceed in parallel, but GPU reset will use down_write() making sure no other commits are in progress when we have to pull the plug on the display engine on pre-g4x platforms. There are no modeset/gem locks taken inside __intel_atomic_commit_tail() itself, and we wait for all dependencies before the down_read(), and thus there is no chance of deadlocks with this scheme.

This matches what I thought should be done (I didn't think of using rwsem just a mutex, nice touch). The point I got stuck on was what should be done after the reset? I expected another modeset to return the state back or otherwise the inflight would get confused?

I guess that can happen. For instance, if we have a crtc_enable() in flight, and then we do a reset before it gets committed we would end up doing crtc_enable() twice in a row without a crtc_disable in between. For page flips and such this shouldn't be a big deal in general.

atomic commits are ordered. You have to wait for the previous ones to complete before you do a new one. If you don't do that, then all hell breaks loose.

What you really can't do with atomic (without rewriting everything once more) is cancel a commit. Pre-atomic we could do that on gen4 since the mmio flips died with the gpu, but that's the one design change we need to cope with (plus TDR insisting it can't force-complete requests anymore).

...

...
...
During reset we should be recommiting the state that was committed last. But for now we'll settle for recommiting the last state for each object.

Ah, I guess that explains the above. What is the complication with restoring the current state as opposed to the next state?

Well the main thing is just tracking which is the current state. That just needs refactoring the .atomic_duplicate_state() calling convention across the whole tree so that we can then duplicate the committed state rather than the latest state.

Also due to the commit_hw_done() being potentially done after the modeset locks have been dropped, I don't think we can be certain of it getting called in the same order as swap_state(), hence when we track the committed state in commit_hw_done() we'll have to have some way to figure out if our new state is in fact the latest committed state for each object or if the calls got reordered. We don't insert any dependencies between two commits unless they touch the same active crtc, thus this reordering seems very much possible. Dunno if we should add some way to add such dependeencies whenever the same object is part of two otherwise independent commits, or if we just want to try and work with the reordered calls. My idea for the latter was some kind of seqno/age counter on the object states that allows me to recongnize which state is more recent. The object states aren't refcounted so hanging on to the wrong pointer could cause an oops the next time we have to perform a GPU reset.

Atomic commits are strongly ordered on a given CRTC, so stuff can't be out-of-order within one. Across them the idea was to just add more CRTC states into the atomic commit to make sure stuff is ordered correctly.

Laurent just pointed out that for switching planes that doesn't happen atm, but for i915 we should be safe (I guess I thought too much about i915 when typing the commit tracking code). -Daniel

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Ville Syrjälä

1:53 p.m.

New subject: [PATCH v4 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore

On Fri, Jun 30, 2017 at 03:35:03PM +0200, Daniel Vetter wrote:

...

On Thu, Jun 29, 2017 at 10:26:08PM +0300, Ville Syrjälä wrote:

...
On Thu, Jun 29, 2017 at 06:57:30PM +0100, Chris Wilson wrote:

...
Quoting ville.syrjala@linux.intel.com (2017-06-29 15:36:42)

...
From: Ville Syrjälä ville.syrjala@linux.intel.com

Introduce an rw_semaphore to protect the display commits. All normal commits use down_read() and hence can proceed in parallel, but GPU reset will use down_write() making sure no other commits are in progress when we have to pull the plug on the display engine on pre-g4x platforms. There are no modeset/gem locks taken inside __intel_atomic_commit_tail() itself, and we wait for all dependencies before the down_read(), and thus there is no chance of deadlocks with this scheme.

This matches what I thought should be done (I didn't think of using rwsem just a mutex, nice touch). The point I got stuck on was what should be done after the reset? I expected another modeset to return the state back or otherwise the inflight would get confused?

I guess that can happen. For instance, if we have a crtc_enable() in flight, and then we do a reset before it gets committed we would end up doing crtc_enable() twice in a row without a crtc_disable in between. For page flips and such this shouldn't be a big deal in general.

atomic commits are ordered. You have to wait for the previous ones to complete before you do a new one. If you don't do that, then all hell breaks loose.

What we're effectively doing here is inserting two new commits in the middle of the timeline, one to disable everything, and another one to re-enable everything. At the end of the the re-enable we would want the hardware state should match exactly what was there before the disable, hence any commits still in the timeline should work correctly. That is, their old_state will match the hardware state when it comes time to commit them.

But we can only do that properly after we start to track the committed state. Without that tracking we can get into trouble wrt. the hardware state not matching the old state when it comes time to commit the new state.

...

What you really can't do with atomic (without rewriting everything once more) is cancel a commit. Pre-atomic we could do that on gen4 since the mmio flips died with the gpu, but that's the one design change we need to cope with (plus TDR insisting it can't force-complete requests anymore).

...
...
...
During reset we should be recommiting the state that was committed last. But for now we'll settle for recommiting the last state for each object.

Ah, I guess that explains the above. What is the complication with restoring the current state as opposed to the next state?

Well the main thing is just tracking which is the current state. That just needs refactoring the .atomic_duplicate_state() calling convention across the whole tree so that we can then duplicate the committed state rather than the latest state.

Also due to the commit_hw_done() being potentially done after the modeset locks have been dropped, I don't think we can be certain of it getting called in the same order as swap_state(), hence when we track the committed state in commit_hw_done() we'll have to have some way to figure out if our new state is in fact the latest committed state for each object or if the calls got reordered. We don't insert any dependencies between two commits unless they touch the same active crtc, thus this reordering seems very much possible. Dunno if we should add some way to add such dependeencies whenever the same object is part of two otherwise independent commits, or if we just want to try and work with the reordered calls. My idea for the latter was some kind of seqno/age counter on the object states that allows me to recongnize which state is more recent. The object states aren't refcounted so hanging on to the wrong pointer could cause an oops the next time we have to perform a GPU reset.

Atomic commits are strongly ordered on a given CRTC, so stuff can't be out-of-order within one. Across them the idea was to just add more CRTC states into the atomic commit to make sure stuff is ordered correctly.

And atomic commits not touching the same crtc will not be ordered in any way. Thus if they touch the same object (eg. disabled plane or connector) we can't currently tell if the commit_hw_done() calls happened in the same order as the swap_state() calls for that particular object.

-- Ville Syrjälä Intel OTC

Daniel Vetter

3:30 p.m.

New subject: [PATCH v4 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore

On Fri, Jun 30, 2017 at 04:53:19PM +0300, Ville Syrjälä wrote:

...

On Fri, Jun 30, 2017 at 03:35:03PM +0200, Daniel Vetter wrote:

...
On Thu, Jun 29, 2017 at 10:26:08PM +0300, Ville Syrjälä wrote:

...
On Thu, Jun 29, 2017 at 06:57:30PM +0100, Chris Wilson wrote:

...
Quoting ville.syrjala@linux.intel.com (2017-06-29 15:36:42)

...
From: Ville Syrjälä ville.syrjala@linux.intel.com

Introduce an rw_semaphore to protect the display commits. All normal commits use down_read() and hence can proceed in parallel, but GPU reset will use down_write() making sure no other commits are in progress when we have to pull the plug on the display engine on pre-g4x platforms. There are no modeset/gem locks taken inside __intel_atomic_commit_tail() itself, and we wait for all dependencies before the down_read(), and thus there is no chance of deadlocks with this scheme.

This matches what I thought should be done (I didn't think of using rwsem just a mutex, nice touch). The point I got stuck on was what should be done after the reset? I expected another modeset to return the state back or otherwise the inflight would get confused?

I guess that can happen. For instance, if we have a crtc_enable() in flight, and then we do a reset before it gets committed we would end up doing crtc_enable() twice in a row without a crtc_disable in between. For page flips and such this shouldn't be a big deal in general.

atomic commits are ordered. You have to wait for the previous ones to complete before you do a new one. If you don't do that, then all hell breaks loose.

What we're effectively doing here is inserting two new commits in the middle of the timeline, one to disable everything, and another one to re-enable everything. At the end of the the re-enable we would want the hardware state should match exactly what was there before the disable, hence any commits still in the timeline should work correctly. That is, their old_state will match the hardware state when it comes time to commit them.

But we can only do that properly after we start to track the committed state. Without that tracking we can get into trouble wrt. the hardware state not matching the old state when it comes time to commit the new state.

Yeah, but my point is that this here is an extremely fancy and fragile (and afaics in this form, incomplete) fix for something that in the past was fixed much, much simpler. Why do we need this massive amount of complexity now? Who's asking for all this (we don't even have anyone yet asking for fully queued atomic commits, much less on gen4)?

I mean rewriting the entire modeset code is fun and all, but for gen3-4?

...

...
What you really can't do with atomic (without rewriting everything once more) is cancel a commit. Pre-atomic we could do that on gen4 since the mmio flips died with the gpu, but that's the one design change we need to cope with (plus TDR insisting it can't force-complete requests anymore).

...
...
...
During reset we should be recommiting the state that was committed last. But for now we'll settle for recommiting the last state for each object.

Ah, I guess that explains the above. What is the complication with restoring the current state as opposed to the next state?

Well the main thing is just tracking which is the current state. That just needs refactoring the .atomic_duplicate_state() calling convention across the whole tree so that we can then duplicate the committed state rather than the latest state.

Also due to the commit_hw_done() being potentially done after the modeset locks have been dropped, I don't think we can be certain of it getting called in the same order as swap_state(), hence when we track the committed state in commit_hw_done() we'll have to have some way to figure out if our new state is in fact the latest committed state for each object or if the calls got reordered. We don't insert any dependencies between two commits unless they touch the same active crtc, thus this reordering seems very much possible. Dunno if we should add some way to add such dependeencies whenever the same object is part of two otherwise independent commits, or if we just want to try and work with the reordered calls. My idea for the latter was some kind of seqno/age counter on the object states that allows me to recongnize which state is more recent. The object states aren't refcounted so hanging on to the wrong pointer could cause an oops the next time we have to perform a GPU reset.

Atomic commits are strongly ordered on a given CRTC, so stuff can't be out-of-order within one. Across them the idea was to just add more CRTC states into the atomic commit to make sure stuff is ordered correctly.

And atomic commits not touching the same crtc will not be ordered in any way. Thus if they touch the same object (eg. disabled plane or connector) we can't currently tell if the commit_hw_done() calls happened in the same order as the swap_state() calls for that particular object.

Yes, but I think for i915 it's ok, because we don't have planes that move around (not supported at least), and for other shared stuff we just grab them all. It is a bug in general though, and for i915 it would probably be best if we'd add the various drm_crtc_commit waits to the i915_sw_fence. -Daniel

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Chris Wilson

3:39 p.m.

New subject: [PATCH v4 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore

Quoting Daniel Vetter (2017-06-30 16:30:59)

...

On Fri, Jun 30, 2017 at 04:53:19PM +0300, Ville Syrjälä wrote:

...
On Fri, Jun 30, 2017 at 03:35:03PM +0200, Daniel Vetter wrote:

...
On Thu, Jun 29, 2017 at 10:26:08PM +0300, Ville Syrjälä wrote:

...
On Thu, Jun 29, 2017 at 06:57:30PM +0100, Chris Wilson wrote:

...
Quoting ville.syrjala@linux.intel.com (2017-06-29 15:36:42)

...
From: Ville Syrjälä ville.syrjala@linux.intel.com

Introduce an rw_semaphore to protect the display commits. All normal commits use down_read() and hence can proceed in parallel, but GPU reset will use down_write() making sure no other commits are in progress when we have to pull the plug on the display engine on pre-g4x platforms. There are no modeset/gem locks taken inside __intel_atomic_commit_tail() itself, and we wait for all dependencies before the down_read(), and thus there is no chance of deadlocks with this scheme.

This matches what I thought should be done (I didn't think of using rwsem just a mutex, nice touch). The point I got stuck on was what should be done after the reset? I expected another modeset to return the state back or otherwise the inflight would get confused?

I guess that can happen. For instance, if we have a crtc_enable() in flight, and then we do a reset before it gets committed we would end up doing crtc_enable() twice in a row without a crtc_disable in between. For page flips and such this shouldn't be a big deal in general.

atomic commits are ordered. You have to wait for the previous ones to complete before you do a new one. If you don't do that, then all hell breaks loose.

What we're effectively doing here is inserting two new commits in the middle of the timeline, one to disable everything, and another one to re-enable everything. At the end of the the re-enable we would want the hardware state should match exactly what was there before the disable, hence any commits still in the timeline should work correctly. That is, their old_state will match the hardware state when it comes time to commit them.

But we can only do that properly after we start to track the committed state. Without that tracking we can get into trouble wrt. the hardware state not matching the old state when it comes time to commit the new state.

Yeah, but my point is that this here is an extremely fancy and fragile (and afaics in this form, incomplete) fix for something that in the past was fixed much, much simpler. Why do we need this massive amount of complexity now? Who's asking for all this (we don't even have anyone yet asking for fully queued atomic commits, much less on gen4)?

It was never "fixed", it was always borked; broken by design. -Chris

Daniel Vetter

3 Jul 3 Jul

8:03 a.m.

New subject: [PATCH v4 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore

On Fri, Jun 30, 2017 at 5:39 PM, Chris Wilson chris@chris-wilson.co.uk wrote:

...

...
Yeah, but my point is that this here is an extremely fancy and fragile (and afaics in this form, incomplete) fix for something that in the past was fixed much, much simpler. Why do we need this massive amount of complexity now? Who's asking for all this (we don't even have anyone yet asking for fully queued atomic commits, much less on gen4)?

It was never "fixed", it was always borked; broken by design.

Hm, what was broken by design in gen3/4 reset? We never bothered to resubmit rendering when the gpu died, but besides that I'm not aware of a deisgn issue in that logic. We nuked in-flight pageflips (and restored those), and we stalled for any pending modesets (grabbing locks did that since all modesets where blocking), and that made sure the hw was in a consistent state.

We always leaked the vblank state to userspace, but this approach here also doesn't fix this. Plus broken rendering, but for these old platforms I'm not too worried about displaying a few wrong frames (with the new reset we will resubmit, so proper rendering should show up soonish) - after all gpu reset nukes the entire display, there's no way for the user to not notice that.

It would be neat to not have to do that, and Ville has a plan, but meanwhile we still have this regression at hand that seems to be the blocker for adding more machines to CI. I'd like to have the least complex path to get that address (but maybe not long-term fixed, I'm clear on that). If feasible.

If that's a unicorn, then let's go with Ville's approach, but then I think we need the full thing with the races properly closed. -Daniel

-- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch

Chris Wilson

8:16 a.m.

New subject: [PATCH v4 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore

Quoting Daniel Vetter (2017-07-03 09:03:36)

...

On Fri, Jun 30, 2017 at 5:39 PM, Chris Wilson chris@chris-wilson.co.uk wrote:

...
...
Yeah, but my point is that this here is an extremely fancy and fragile (and afaics in this form, incomplete) fix for something that in the past was fixed much, much simpler. Why do we need this massive amount of complexity now? Who's asking for all this (we don't even have anyone yet asking for fully queued atomic commits, much less on gen4)?

It was never "fixed", it was always borked; broken by design.

Hm, what was broken by design in gen3/4 reset? We never bothered to resubmit rendering when the gpu died, but besides that I'm not aware of a deisgn issue in that logic. We nuked in-flight pageflips (and restored those), and we stalled for any pending modesets (grabbing locks did that since all modesets where blocking), and that made sure the hw was in a consistent state.

KMS reset was taking mutexes within reset without any means of breaking the inherent deadlock, instead relying on something else to magically fix it. We only ever engineered around struct_mutex for reset, the remains of the deadlock upon struct_mutex within modeset is an issue but not the one causing trouble here.

Please do note the bugs that indicate that even without modeset reset, hw is not in a consistent state. -Chris

Daniel Vetter

4:42 p.m.

New subject: [PATCH v4 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore

On Mon, Jul 03, 2017 at 09:16:54AM +0100, Chris Wilson wrote:

...

Quoting Daniel Vetter (2017-07-03 09:03:36)

...
On Fri, Jun 30, 2017 at 5:39 PM, Chris Wilson chris@chris-wilson.co.uk wrote:

...
...
Yeah, but my point is that this here is an extremely fancy and fragile (and afaics in this form, incomplete) fix for something that in the past was fixed much, much simpler. Why do we need this massive amount of complexity now? Who's asking for all this (we don't even have anyone yet asking for fully queued atomic commits, much less on gen4)?

It was never "fixed", it was always borked; broken by design.

Hm, what was broken by design in gen3/4 reset? We never bothered to resubmit rendering when the gpu died, but besides that I'm not aware of a deisgn issue in that logic. We nuked in-flight pageflips (and restored those), and we stalled for any pending modesets (grabbing locks did that since all modesets where blocking), and that made sure the hw was in a consistent state.

KMS reset was taking mutexes within reset without any means of breaking the inherent deadlock, instead relying on something else to magically fix it. We only ever engineered around struct_mutex for reset, the remains of the deadlock upon struct_mutex within modeset is an issue but not the one causing trouble here.

So on the kms side we've had: - grab kms locks -> grab struct_mutex -> wait for rendering

- on the reset side we've had the same order afair, with the caveat that we've broken the "wait for rendering" complete before trying to grab any of the locks in the reset path.

I thought that the problem is that gpu reset stopping force-completing everything asap, which then lead the kms side to time-out at a point where it shouldn't and start to fall over.

So before

commit 221fe7994554cc3985fc5d761ed7e44dcae0fa52 Author: Chris Wilson chris@chris-wilson.co.uk Date: Fri Sep 9 14:11:51 2016 +0100

drm/i915: Perform a direct reset of the GPU from the waiter

what was the deadlock we've had? Besides breaking the "wait for rendering" I don't remember any inversions we've had. And for the breaking we've had a fairly complex dance of barriers and reset_in_progress and waking up waiters to make sure it would catch them all ...

...

Please do note the bugs that indicate that even without modeset reset, hw is not in a consistent state.

So there's more bugs, not sure how that is relevant for gpu reset? -Daniel

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Daniel Vetter

30 Jun 30 Jun

1:25 p.m.

New subject: [PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore

On Thu, Jun 29, 2017 at 04:49:48PM +0300, ville.syrjala@linux.intel.com wrote:

...

From: Ville Syrjälä ville.syrjala@linux.intel.com

Introduce an rw_semaphore to protect the display commits. All normal commits use down_read() and hence can proceed in parallel, but GPU reset will use down_write() making sure no other commits are in progress when we have to pull the plug on the display engine on pre-g4x platforms. There are no modeset/gem locks taken inside __intel_atomic_commit_tail() itself, and we wait for all dependencies before the down_read(), and thus there is no chance of deadlocks with this scheme.

How does this solve the deadlock? Afaiui the deadlock is that the gpu reset stopped unconditionally completing all requests before it did anything else, including anything with the hw or taking modeset locks.

This ensured that any outstanding flips (we only had pageflips, no atomic ones back then) could complete (although maybe displaying garbage). The only thing we had to do was grab the locks (to avoid concurrent modesets) and then we could happily nuke the hw (since the flips where lost in the CS anyway), and restore it afterwards.

Then the TDR rewrite came around and broke that, which now makes atomic time out waiting for the gpu to complete (because the gpu reset waits for the modeset to complete first). Which means if you want to fix this without breaking TDR, then you need to cancel the pending atomic commits. That seems somewhat what you're attempting here with trying to figure out what the latest hw-committed step is (but that function gets it wrong), and lots more locking and stuff on top.

Why exactly can't we just go back to the old world of force-completing dead requests on gen4 and earlier? That would be tons simpler imo instead of throwing a pile of hacks (which really needs a complete rewrite of the atomic code in i915) in as a work-around. We didn't have TDR on gen4 and earlier for years, going back to that isn't going to hurt anyone.

Making working gen4 gpu reset contigent on cancellable atomic commits and the full commit queue is imo nuts. -Daniel

...

During reset we should be recommiting the state that was committed last. But for now we'll settle for recommiting the last state for each object. Hence we may commit things a bit too soon when a GPU reset occurs. The rw_semaphore should guarantee that whatever state we observe in obj->state during reset sticks around while we do the commit. The obj->state pointer might change for some objects if another swap_state() occurs while we do our thing, so there migth be some theoretical mismatch which I tink we could avoid by grabbing the rw_semaphore also around the swap_state(), but for now I didn't do that.

Another source of mismatch can happen because we sometimes use the intel_crtc->config during the actual commit, and that only gets updated when we do the commuit. Hence we may get some state via ->state, some via the duplicated ->state, and some via ->config. We'll want to fix this all by tracking the commited state properly, but that will some bigger refactroing so for now we'll just choose to accept these potential mismatches.

I left out the state readout from the post-reset display reinitialization as that still likes to clobber crtc->state etc. If we make it use a free standing atomic state and mke sure it doesn't need any locks we could reintroduce it. For now I just mark the post-reset display state as dirty as possible to make sure we reinitialize everything.

One other potential issue remains in the form of display detection. Either we need to protect that with the same rw_semaphore as well, or perhaps it would be enough to force gmbus into bitbanging mode while the reset is happening and we don't have interrupts, and just across the actual hardware GPU reset we could hold the gmbus mutex.

v2: Keep intel_prepare/finish_reset() outside struct_mutex (Chris) v3: Drop all the committed_state refactoring to make this less obnoxious to backport (Daniel)

Cc: stable@vger.kernel.org # for the brave Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: Chris Wilson chris@chris-wilson.co.uk Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=101597 Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99093 Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=101600 Fixes: 4680816be336 ("drm/i915: Wait first for submission, before waiting for request completion") Fixes: 221fe7994554 ("drm/i915: Perform a direct reset of the GPU from the waiter") Signed-off-by: Ville Syrjälä ville.syrjala@linux.intel.com

drivers/gpu/drm/i915/i915_drv.h | 2 + drivers/gpu/drm/i915/i915_irq.c | 44 +------- drivers/gpu/drm/i915/intel_display.c | 199 ++++++++++++++++++++++++----------- 3 files changed, 139 insertions(+), 106 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h index effbe4f72a64..88ddd27f61b0 100644 --- a/drivers/gpu/drm/i915/i915_drv.h +++ b/drivers/gpu/drm/i915/i915_drv.h @@ -2237,6 +2237,8 @@ struct drm_i915_private { struct drm_atomic_state *modeset_restore_state; struct drm_modeset_acquire_ctx reset_ctx;

struct rw_semaphore commit_sem;

struct list_head vm_list; /* Global list of all address spaces */ struct i915_ggtt ggtt; /* VM representing the global address space */

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c index eb4f1dca2077..9d591f17fda3 100644 --- a/drivers/gpu/drm/i915/i915_irq.c +++ b/drivers/gpu/drm/i915/i915_irq.c @@ -2587,46 +2587,6 @@ static irqreturn_t gen8_irq_handler(int irq, void *arg) return ret; }

-struct wedge_me {

struct delayed_work work;

struct drm_i915_private *i915;

const char *name;

-};

-static void wedge_me(struct work_struct *work) -{
struct wedge_me *w = container_of(work, typeof(*w), work.work);

dev_err(w->i915->drm.dev,
"%s timed out, cancelling all in-flight rendering.\n",
w->name);
i915_gem_set_wedged(w->i915);
-}

-static void __init_wedge(struct wedge_me *w,
	 struct drm_i915_private *i915,
	 long timeout,
	 const char *name)
-{

w->i915 = i915;

w->name = name;

INIT_DELAYED_WORK_ONSTACK(&w->work, wedge_me);

schedule_delayed_work(&w->work, timeout);

-}

-static void __fini_wedge(struct wedge_me *w) -{

cancel_delayed_work_sync(&w->work);

destroy_delayed_work_on_stack(&w->work);

w->i915 = NULL;

-}

-#define i915_wedge_on_timeout(W, DEV, TIMEOUT) \
for (__init_wedge((W), (DEV), (TIMEOUT), __func__); \
    (W)->i915;							\
    __fini_wedge((W)))
/**

i915_reset_device - do process context error handling work

@dev_priv: i915 device private

@@ -2640,15 +2600,13 @@ static void i915_reset_device(struct drm_i915_private *dev_priv) char *error_event[] = { I915_ERROR_UEVENT "=1", NULL }; char *reset_event[] = { I915_RESET_UEVENT "=1", NULL }; char *reset_done_event[] = { I915_ERROR_UEVENT "=0", NULL };

struct wedge_me w;

kobject_uevent_env(kobj, KOBJ_CHANGE, error_event);

DRM_DEBUG_DRIVER("resetting chip\n"); kobject_uevent_env(kobj, KOBJ_CHANGE, reset_event);

/* Use a watchdog to ensure that our reset completes */

i915_wedge_on_timeout(&w, dev_priv, 5*HZ) {

{ intel_prepare_reset(dev_priv);

/* Signal that locked waiters should reset the GPU */

diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c index 7cdd6ec97f80..f13c7d81d4a9 100644 --- a/drivers/gpu/drm/i915/intel_display.c +++ b/drivers/gpu/drm/i915/intel_display.c @@ -123,6 +123,7 @@ static void ironlake_pfit_enable(struct intel_crtc *crtc); static void intel_modeset_setup_hw_state(struct drm_device *dev, struct drm_modeset_acquire_ctx *ctx); static void intel_pre_disable_primary_noatomic(struct drm_crtc *crtc); +static void __intel_atomic_commit_tail(struct drm_atomic_state *state, bool is_reset);

struct intel_limit { struct { @@ -3491,27 +3492,85 @@ static bool gpu_reset_clobbers_display(struct drm_i915_private *dev_priv) INTEL_GEN(dev_priv) < 5 && !IS_G4X(dev_priv); }

-void intel_prepare_reset(struct drm_i915_private *dev_priv) +static void init_intel_state(struct intel_atomic_state *state) +{
struct drm_crtc *crtc;

struct drm_crtc_state *old_crtc_state, *new_crtc_state;

int i;

state->modeset = true;

for_each_oldnew_crtc_in_state(&state->base, crtc,
		      old_crtc_state, new_crtc_state, i) {
if (new_crtc_state->active)
	state->active_crtcs |= 1 << i;
else
	state->active_crtcs &= ~(1 << i);
if (old_crtc_state->active != new_crtc_state->active)
	state->active_pipe_changes |= drm_crtc_mask(crtc);
}
+}

+static struct drm_atomic_state * +intel_duplicate_committed_state(struct drm_i915_private *dev_priv) {

struct drm_device *dev = &dev_priv->drm;

struct drm_modeset_acquire_ctx *ctx = &dev_priv->reset_ctx; struct drm_atomic_state *state;

int ret;

struct drm_crtc *crtc;

struct drm_crtc_state *crtc_state;

int i;
/*
* Need mode_config.mutex so that we don't
* trample ongoing ->detect() and whatnot.
*/
mutex_lock(&dev->mode_config.mutex);

drm_modeset_acquire_init(ctx, 0);

while (1) {
ret = drm_modeset_lock_all_ctx(dev, ctx);
if (ret != -EDEADLK)
	break;
state = drm_atomic_helper_duplicate_committed_state(&dev_priv->drm);

if (IS_ERR(state)) {
DRM_ERROR("Duplicating state failed with %ld\n",
	  PTR_ERR(state));
return NULL;
}

to_intel_atomic_state(state)->active_crtcs = 0;

to_intel_atomic_state(state)->cdclk.logical = dev_priv->cdclk.hw;

to_intel_atomic_state(state)->cdclk.actual = dev_priv->cdclk.hw;

init_intel_state(to_intel_atomic_state(state));

/* force a full update */

for_each_new_crtc_in_state(state, crtc, crtc_state, i) {
struct intel_crtc_state *intel_crtc_state =
	to_intel_crtc_state(crtc_state);
if (!crtc_state->active)
	continue;
drm_modeset_backoff(ctx);
crtc_state->mode_changed = true;
crtc_state->active_changed = true;
crtc_state->planes_changed = true;
crtc_state->connectors_changed = true;
crtc_state->color_mgmt_changed = true;
crtc_state->zpos_changed = true;
intel_crtc_state->update_pipe = true;
intel_crtc_state->disable_lp_wm = true;
intel_crtc_state->disable_cxsr = true;
intel_crtc_state->update_wm_post = true;
intel_crtc_state->fb_changed = true;
intel_crtc_state->fifo_changed = true;
intel_crtc_state->wm.need_postvbl_update = true;
}
return state;
+}

+void intel_prepare_reset(struct drm_i915_private *dev_priv) +{

struct drm_atomic_state *disable_state, *restore_state = NULL;

struct drm_crtc *crtc;

struct drm_crtc_state *crtc_state;

struct drm_plane *plane;

struct drm_plane_state *plane_state;

int i;

down_write(&dev_priv->commit_sem);

/* reset doesn't touch the display, but flips might get nuked anyway, */ if (!i915.force_reset_modeset_test && !gpu_reset_clobbers_display(dev_priv))

@@ -3521,30 +3580,40 @@ void intel_prepare_reset(struct drm_i915_private *dev_priv) * Disabling the crtcs gracefully seems nicer. Also the * g33 docs say we should at least disable all the planes. */
state = drm_atomic_helper_duplicate_state(dev, ctx);

if (IS_ERR(state)) {
ret = PTR_ERR(state);
DRM_ERROR("Duplicating state failed with %i\n", ret);
return;
}
disable_state = intel_duplicate_committed_state(dev_priv);

if (IS_ERR(disable_state))
goto out;
ret = drm_atomic_helper_disable_all(dev, ctx);

if (ret) {
DRM_ERROR("Suspending crtc's failed with %i\n", ret);
drm_atomic_state_put(state);
return;
}
to_intel_atomic_state(disable_state)->active_crtcs = 0;

dev_priv->modeset_restore_state = state;

state->acquire_ctx = ctx;
for_each_new_crtc_in_state(disable_state, crtc, crtc_state, i)
crtc_state->active = false;
for_each_new_plane_in_state(disable_state, plane, plane_state, i)
plane_state->visible = false;
__intel_atomic_commit_tail(disable_state, true);

drm_atomic_helper_clean_committed_state(disable_state);

drm_atomic_state_put(disable_state);

restore_state = intel_duplicate_committed_state(dev_priv);

if (IS_ERR(restore_state))
restore_state = NULL;
for_each_old_crtc_in_state(restore_state, crtc, crtc_state, i)
crtc_state->active = false;
for_each_old_plane_in_state(restore_state, plane, plane_state, i)
plane_state->visible = false;
+out:

dev_priv->modeset_restore_state = restore_state;

}

void intel_finish_reset(struct drm_i915_private *dev_priv) { struct drm_device *dev = &dev_priv->drm;

struct drm_modeset_acquire_ctx *ctx = &dev_priv->reset_ctx;

struct drm_atomic_state *state = dev_priv->modeset_restore_state;

int ret;
struct drm_atomic_state *restore_state =
dev_priv->modeset_restore_state;
/*

Flips in the rings will be nuked by the reset,
@@ -3557,7 +3626,7 @@ void intel_finish_reset(struct drm_i915_private *dev_priv)

/* reset doesn't touch the display */ if (!gpu_reset_clobbers_display(dev_priv)) {
if (!state) {
if (!restore_state) {
/*
 * Flips in the rings have been nuked by the reset,
 * so update the base address of all primary
@@ -3569,11 +3638,11 @@ void intel_finish_reset(struct drm_i915_private *dev_priv) */ intel_update_primary_planes(dev); } else {
	ret = __intel_display_resume(dev, state, ctx);
	if (ret)
		DRM_ERROR("Restoring old state failed with %i\n", ret);
	__intel_atomic_commit_tail(restore_state, true);
} } else {
i915_redisable_vga(dev_priv);
/*

The display has been reset as well,

so need a full re-initialization.
@@ -3589,18 +3658,17 @@ void intel_finish_reset(struct drm_i915_private *dev_priv) dev_priv->display.hpd_irq_setup(dev_priv); spin_unlock_irq(&dev_priv->irq_lock);
ret = __intel_display_resume(dev, state, ctx);
if (ret)
	DRM_ERROR("Restoring old state failed with %i\n", ret);
__intel_atomic_commit_tail(restore_state, true);
intel_hpd_init(dev_priv); }
if (state)
drm_atomic_state_put(state);
drm_modeset_drop_locks(ctx);

drm_modeset_acquire_fini(ctx);

mutex_unlock(&dev->mode_config.mutex);
if (restore_state) {
drm_atomic_helper_clean_committed_state(restore_state);
drm_atomic_state_put(restore_state);
}

up_write(&dev_priv->commit_sem);
}

static bool abort_flip_on_reset(struct intel_crtc *crtc) @@ -12592,29 +12660,18 @@ static int intel_modeset_checks(struct drm_atomic_state *state) { struct intel_atomic_state *intel_state = to_intel_atomic_state(state); struct drm_i915_private *dev_priv = to_i915(state->dev);

struct drm_crtc *crtc;

struct drm_crtc_state *old_crtc_state, *new_crtc_state;

int ret = 0, i;

int ret = 0;

if (!check_digital_port_conflicts(state)) { DRM_DEBUG_KMS("rejecting conflicting digital port configuration\n"); return -EINVAL; }
intel_state->modeset = true; intel_state->active_crtcs = dev_priv->active_crtcs; intel_state->cdclk.logical = dev_priv->cdclk.logical; intel_state->cdclk.actual = dev_priv->cdclk.actual;

for_each_oldnew_crtc_in_state(state, crtc, old_crtc_state, new_crtc_state, i) {
if (new_crtc_state->active)
	intel_state->active_crtcs |= 1 << i;
else
	intel_state->active_crtcs &= ~(1 << i);
if (old_crtc_state->active != new_crtc_state->active)
	intel_state->active_pipe_changes |= drm_crtc_mask(crtc);
}
init_intel_state(intel_state);

/*

See if the config requires any additional preparation, e.g.

@@ -13004,7 +13061,7 @@ static void intel_atomic_helper_free_state_worker(struct work_struct *work) intel_atomic_helper_free_state(dev_priv); }

-static void __intel_atomic_commit_tail(struct drm_atomic_state *state) +static void __intel_atomic_commit_tail(struct drm_atomic_state *state, bool is_reset) { struct drm_device *dev = state->dev; struct intel_atomic_state *intel_state = to_intel_atomic_state(state); @@ -13068,10 +13125,18 @@ static void __intel_atomic_commit_tail(struct drm_atomic_state *state)

/* Only after disabling all output pipelines that will be changed can we * update the the output configuration. */

intel_modeset_update_crtc_state(state);
if (!is_reset)
intel_modeset_update_crtc_state(state);
if (intel_state->modeset) {
drm_atomic_helper_update_legacy_modeset_state(state->dev, state);
if (!is_reset) {
	drm_atomic_helper_update_legacy_modeset_state(state->dev, state);
} else {
	for_each_new_crtc_in_state(state, crtc, new_crtc_state, i) {
		if (new_crtc_state->enable)
			drm_calc_timestamping_constants(crtc, &new_crtc_state->adjusted_mode);
	}
}
intel_set_cdclk(dev_priv, &dev_priv->cdclk.actual);
@@ -13082,7 +13147,8 @@ static void __intel_atomic_commit_tail(struct drm_atomic_state *state) if (!intel_can_enable_sagv(state)) intel_disable_sagv(dev_priv);
intel_modeset_verify_disabled(dev, state);
if (!is_reset)
	intel_modeset_verify_disabled(dev, state);
}

/* Complete the events for pipes that have now been disabled */
@@ -13135,7 +13201,8 @@ static void __intel_atomic_commit_tail(struct drm_atomic_state *state) if (put_domains[i]) modeset_put_power_domains(dev_priv, put_domains[i]);
intel_modeset_verify_crtc(crtc, state, old_crtc_state, new_crtc_state);
if (!is_reset)
	intel_modeset_verify_crtc(crtc, state, old_crtc_state, new_crtc_state);
}

if (intel_state->modeset && intel_can_enable_sagv(state))
@@ -13160,10 +13227,14 @@ static void intel_atomic_commit_tail(struct drm_atomic_state *state)

drm_atomic_helper_wait_for_dependencies(state);

__intel_atomic_commit_tail(state);

down_read(&dev_priv->commit_sem);

__intel_atomic_commit_tail(state, false);

drm_atomic_helper_commit_hw_done(state);

up_read(&dev_priv->commit_sem);

mutex_lock(&dev->struct_mutex); drm_atomic_helper_cleanup_planes(dev, state); mutex_unlock(&dev->struct_mutex);

@@ -15022,6 +15093,8 @@ int intel_modeset_init(struct drm_device *dev) INIT_WORK(&dev_priv->atomic_helper.free_work, intel_atomic_helper_free_state_worker);

init_rwsem(&dev_priv->commit_sem);

intel_init_quirks(dev);

intel_init_pm(dev_priv);

-- 2.13.0

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Daniel Vetter

1:30 p.m.

New subject: [PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore

On Fri, Jun 30, 2017 at 03:25:58PM +0200, Daniel Vetter wrote:

...

On Thu, Jun 29, 2017 at 04:49:48PM +0300, ville.syrjala@linux.intel.com wrote:

...
From: Ville Syrjälä ville.syrjala@linux.intel.com

Introduce an rw_semaphore to protect the display commits. All normal commits use down_read() and hence can proceed in parallel, but GPU reset will use down_write() making sure no other commits are in progress when we have to pull the plug on the display engine on pre-g4x platforms. There are no modeset/gem locks taken inside __intel_atomic_commit_tail() itself, and we wait for all dependencies before the down_read(), and thus there is no chance of deadlocks with this scheme.

How does this solve the deadlock? Afaiui the deadlock is that the gpu reset stopped unconditionally completing all requests before it did anything else, including anything with the hw or taking modeset locks.

This ensured that any outstanding flips (we only had pageflips, no atomic ones back then) could complete (although maybe displaying garbage). The only thing we had to do was grab the locks (to avoid concurrent modesets) and then we could happily nuke the hw (since the flips where lost in the CS anyway), and restore it afterwards.

Then the TDR rewrite came around and broke that, which now makes atomic time out waiting for the gpu to complete (because the gpu reset waits for the modeset to complete first). Which means if you want to fix this without breaking TDR, then you need to cancel the pending atomic commits. That seems somewhat what you're attempting here with trying to figure out what the latest hw-committed step is (but that function gets it wrong), and lots more locking and stuff on top.

Why exactly can't we just go back to the old world of force-completing dead requests on gen4 and earlier? That would be tons simpler imo instead of throwing a pile of hacks (which really needs a complete rewrite of the atomic code in i915) in as a work-around. We didn't have TDR on gen4 and earlier for years, going back to that isn't going to hurt anyone.

Making working gen4 gpu reset contigent on cancellable atomic commits and the full commit queue is imo nuts.

And if the GEM folks insist the old behavior can't be restored, then we just need a tailor-made get-out-of-jail card for gen4 gpu reset somewhere in i915_sw_fence. Force-completing all render requests atomic updates depend upon is imo the simplest solution to this, and we've had a driver that worked like that for years.

And as long as TDR resubmits the batches the render-glitch will at most be temporary, but that's totally fine: We're killing the entire display block in the reset anyway, there's no way the user won't notice _that_ kind of glitch.

Either way, let's please not over-engineer something for a dead-old platform when something much, much, much simpler worked for years, and should keep on working for another few years. -Daniel

PS: One issue with atomic is that there's the small matter of having to wait for pending atomic commits to complete before we nuke the display, to avoid upsetting the display code. We could do that with a dummy commit, or just having a special wait_for_depencies that waits for all CRTC to complete their pending atomic updates.

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Ville Syrjälä

3:44 p.m.

New subject: [PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore

On Fri, Jun 30, 2017 at 03:30:33PM +0200, Daniel Vetter wrote:

...

On Fri, Jun 30, 2017 at 03:25:58PM +0200, Daniel Vetter wrote:

...
On Thu, Jun 29, 2017 at 04:49:48PM +0300, ville.syrjala@linux.intel.com wrote:

...
From: Ville Syrjälä ville.syrjala@linux.intel.com

Introduce an rw_semaphore to protect the display commits. All normal commits use down_read() and hence can proceed in parallel, but GPU reset will use down_write() making sure no other commits are in progress when we have to pull the plug on the display engine on pre-g4x platforms. There are no modeset/gem locks taken inside __intel_atomic_commit_tail() itself, and we wait for all dependencies before the down_read(), and thus there is no chance of deadlocks with this scheme.

How does this solve the deadlock? Afaiui the deadlock is that the gpu reset stopped unconditionally completing all requests before it did anything else, including anything with the hw or taking modeset locks.

This ensured that any outstanding flips (we only had pageflips, no atomic ones back then) could complete (although maybe displaying garbage). The only thing we had to do was grab the locks (to avoid concurrent modesets) and then we could happily nuke the hw (since the flips where lost in the CS anyway), and restore it afterwards.

Then the TDR rewrite came around and broke that, which now makes atomic time out waiting for the gpu to complete (because the gpu reset waits for the modeset to complete first). Which means if you want to fix this without breaking TDR, then you need to cancel the pending atomic commits. That seems somewhat what you're attempting here with trying to figure out what the latest hw-committed step is (but that function gets it wrong), and lots more locking and stuff on top.

Why exactly can't we just go back to the old world of force-completing dead requests on gen4 and earlier? That would be tons simpler imo instead of throwing a pile of hacks (which really needs a complete rewrite of the atomic code in i915) in as a work-around. We didn't have TDR on gen4 and earlier for years, going back to that isn't going to hurt anyone.

Making working gen4 gpu reset contigent on cancellable atomic commits and the full commit queue is imo nuts.

And if the GEM folks insist the old behavior can't be restored, then we just need a tailor-made get-out-of-jail card for gen4 gpu reset somewhere in i915_sw_fence. Force-completing all render requests atomic updates depend upon is imo the simplest solution to this, and we've had a driver that worked like that for years.

And it used to break all the time. I think we've had to fix it at least three times by now. So I tend to think it's better to fix it in a way that won't break so easily.

-- Ville Syrjälä Intel OTC

Daniel Vetter

6:23 p.m.

New subject: [PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore

On Fri, Jun 30, 2017 at 5:44 PM, Ville Syrjälä ville.syrjala@linux.intel.com wrote:

...

...
And if the GEM folks insist the old behavior can't be restored, then we just need a tailor-made get-out-of-jail card for gen4 gpu reset somewhere in i915_sw_fence. Force-completing all render requests atomic updates depend upon is imo the simplest solution to this, and we've had a driver that worked like that for years.

And it used to break all the time. I think we've had to fix it at least three times by now. So I tend to think it's better to fix it in a way that won't break so easily.

Why exactly is making the atomic code massive more tricky the easy fix? That's the part I don't get. Yes it got broken a bunch because no one runs CI and everyone forgets that gen3/4 reset the display in gpu reset, but in the end we do have a depency loop, and either the modeset side or the render side needs to bail out and cancel it's async stuff (whether that's a request or a nonblocking flip/atomic commit doesn't matter). In my opinion, cancelling the request (even if we're clever and only cancel the requests for the modeset waiters, which isn't to hard to pull off) seems about the simplest option. Especially since we need that code anyway, even TDR can't safe everything and resubmit under all circumstances (at least the buggy batch can't be resubmitted).

Cancelling any kind of atomic commit otoh looks like a lot more complexity. Why do you think this is the easier, or at least less fragile option? This patch series is full of FIXMEs, and even the more complete set seems to have a pile of holes. Plus we need to stop using obj->state, and I don't see any easy way to test for that (since the gen3/4 gpu reset case is the only corner cases that currently needs that).

So not seeing how this is easier or more robust at all. What do I miss?

Thanks, Daniel

-- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch

Ville Syrjälä

6:46 p.m.

New subject: [PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore

On Fri, Jun 30, 2017 at 08:23:58PM +0200, Daniel Vetter wrote:

...

On Fri, Jun 30, 2017 at 5:44 PM, Ville Syrjälä ville.syrjala@linux.intel.com wrote:

...
...
And if the GEM folks insist the old behavior can't be restored, then we just need a tailor-made get-out-of-jail card for gen4 gpu reset somewhere in i915_sw_fence. Force-completing all render requests atomic updates depend upon is imo the simplest solution to this, and we've had a driver that worked like that for years.

And it used to break all the time. I think we've had to fix it at least three times by now. So I tend to think it's better to fix it in a way that won't break so easily.

Why exactly is making the atomic code massive more tricky the easy fix?

I don't see what this massive trickyness is. Compared to the rest of atomic what I have is absolutely trivial. Just the duplicate_committed_state() and the '.committed_state = foo' assignments in hw_done(). That's it really.

...

That's the part I don't get. Yes it got broken a bunch because no one runs CI and everyone forgets that gen3/4 reset the display in gpu reset, but in the end we do have a depency loop, and either the modeset side or the render side needs to bail out and cancel it's async stuff (whether that's a request or a nonblocking flip/atomic commit doesn't matter). In my opinion, cancelling the request (even if we're clever and only cancel the requests for the modeset waiters, which isn't to hard to pull off) seems about the simplest option. Especially since we need that code anyway, even TDR can't safe everything and resubmit under all circumstances (at least the buggy batch can't be resubmitted).

Cancelling any kind of atomic commit otoh looks like a lot more complexity.

I'm not cancelling anything.

...

Why do you think this is the easier, or at least less fragile option? This patch series is full of FIXMEs, and even the more complete set seems to have a pile of holes. Plus we need to stop using obj->state, and I don't see any easy way to test for that (since the gen3/4 gpu reset case is the only corner cases that currently needs that).

We need to fix that stuff anyway if we ever want to queue up multiple commits for the same crtc. The stuff I have that is specific to this reset stuff is actually very simple. The rest is just fixing up the huge mess we've already made.

-- Ville Syrjälä Intel OTC

Daniel Vetter

3 Jul 3 Jul

7:55 a.m.

New subject: [PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore

On Fri, Jun 30, 2017 at 09:46:36PM +0300, Ville Syrjälä wrote:

...

On Fri, Jun 30, 2017 at 08:23:58PM +0200, Daniel Vetter wrote:

...
On Fri, Jun 30, 2017 at 5:44 PM, Ville Syrjälä ville.syrjala@linux.intel.com wrote:

...
...
And if the GEM folks insist the old behavior can't be restored, then we just need a tailor-made get-out-of-jail card for gen4 gpu reset somewhere in i915_sw_fence. Force-completing all render requests atomic updates depend upon is imo the simplest solution to this, and we've had a driver that worked like that for years.

And it used to break all the time. I think we've had to fix it at least three times by now. So I tend to think it's better to fix it in a way that won't break so easily.

Why exactly is making the atomic code massive more tricky the easy fix?

I don't see what this massive trickyness is. Compared to the rest of atomic what I have is absolutely trivial. Just the duplicate_committed_state() and the '.committed_state = foo' assignments in hw_done(). That's it really.

...

From a quick look and your description it seems full of races. I'm not

sure it'll still be simple once those are fixed.

...

...
That's the part I don't get. Yes it got broken a bunch because no one runs CI and everyone forgets that gen3/4 reset the display in gpu reset, but in the end we do have a depency loop, and either the modeset side or the render side needs to bail out and cancel it's async stuff (whether that's a request or a nonblocking flip/atomic commit doesn't matter). In my opinion, cancelling the request (even if we're clever and only cancel the requests for the modeset waiters, which isn't to hard to pull off) seems about the simplest option. Especially since we need that code anyway, even TDR can't safe everything and resubmit under all circumstances (at least the buggy batch can't be resubmitted).

Cancelling any kind of atomic commit otoh looks like a lot more complexity.

I'm not cancelling anything.

Well by overtaking the in-flight commit you are at least fighting with that. Either you need to cancel that one, or insert the gpu reset commit at the right point (and with the right state). Current code drops that and instead seems to just hope it doesn't lead to tears too much.

...

...
Why do you think this is the easier, or at least less fragile option? This patch series is full of FIXMEs, and even the more complete set seems to have a pile of holes. Plus we need to stop using obj->state, and I don't see any easy way to test for that (since the gen3/4 gpu reset case is the only corner cases that currently needs that).

We need to fix that stuff anyway if we ever want to queue up multiple commits for the same crtc. The stuff I have that is specific to this reset stuff is actually very simple. The rest is just fixing up the huge mess we've already made.

Rewriting the world for a regression fix seems a bit much is all I'm saying. And I'm not sure your approach works without that "rewrite the world" step. Defacto what your current patches seem to result in is - we commit the final sw state in gpu reset - before we resubmit the rendering

That's much easier to pull of by simply force-completing all i915_sw_fences before we take any modeset locks in the gpu reset path. Note that we don't need to force-complete any i915_gem_request, we can just force-complete the i915_sw_fences the work item is blocked on. Needs some care to avoid races with a new atomic commit (since we need to force-complete before we grab locks one might sneak in), but that's a standard pattern.

Plus we then need to wait for all outstanding nonblocking commits once we do have all modeset locks, since with atomic holding the locks only syncs against synchronous hw commits (i.e do a fake synchronous commit before we nuke the display and we're good). A variant of wait_for_dependencies that waits for all crtc instead of just the crtc in an atomic commit would do I think.

None of this requires any of the prep work we need the fancy additional atomic stuff you plan to do. Which I think is really good for a regression fix.

So again, why do we need to rewrite the world (since these patches here seem to just be the racy poc) to fix reset on gen3/4?

I know you want to do all this, but tangling up a regression fix in a rewrite isn't a good idea in my opinion. I'm not against your long-term plans, I just think it'd be good if we can have them as orthogonal pieces if feasible. -Daniel

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

Ville Syrjälä

9:30 a.m.

New subject: [PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore

On Mon, Jul 03, 2017 at 09:55:48AM +0200, Daniel Vetter wrote:

...

On Fri, Jun 30, 2017 at 09:46:36PM +0300, Ville Syrjälä wrote:

...
On Fri, Jun 30, 2017 at 08:23:58PM +0200, Daniel Vetter wrote:

...
On Fri, Jun 30, 2017 at 5:44 PM, Ville Syrjälä ville.syrjala@linux.intel.com wrote:

...
...
And if the GEM folks insist the old behavior can't be restored, then we just need a tailor-made get-out-of-jail card for gen4 gpu reset somewhere in i915_sw_fence. Force-completing all render requests atomic updates depend upon is imo the simplest solution to this, and we've had a driver that worked like that for years.

And it used to break all the time. I think we've had to fix it at least three times by now. So I tend to think it's better to fix it in a way that won't break so easily.

Why exactly is making the atomic code massive more tricky the easy fix?

I don't see what this massive trickyness is. Compared to the rest of atomic what I have is absolutely trivial. Just the duplicate_committed_state() and the '.committed_state = foo' assignments in hw_done(). That's it really.

...
From a quick look and your description it seems full of races.

This "simple" version has problems, yes. The full versions has just the one potential race between swap_state() and hw_done(). That seems like pretty easy to handle.

...

I'm not sure it'll still be simple once those are fixed.

...
...
That's the part I don't get. Yes it got broken a bunch because no one runs CI and everyone forgets that gen3/4 reset the display in gpu reset, but in the end we do have a depency loop, and either the modeset side or the render side needs to bail out and cancel it's async stuff (whether that's a request or a nonblocking flip/atomic commit doesn't matter). In my opinion, cancelling the request (even if we're clever and only cancel the requests for the modeset waiters, which isn't to hard to pull off) seems about the simplest option. Especially since we need that code anyway, even TDR can't safe everything and resubmit under all circumstances (at least the buggy batch can't be resubmitted).

Cancelling any kind of atomic commit otoh looks like a lot more complexity.

I'm not cancelling anything.

Well by overtaking the in-flight commit you are at least fighting with that. Either you need to cancel that one, or insert the gpu reset commit at the right point (and with the right state). Current code drops that and instead seems to just hope it doesn't lead to tears too much.

The simple version is pretty much "cross our fingers and hope for the best" type of thing, yes. The full version I cooked up earlier doesn't rely on hope.

In fact I was able to come up with a testcase that does make the simple version explode, whereas the full version works just fine. So we should abandon the idea of using this simple version actually.

...

...
...
Why do you think this is the easier, or at least less fragile option? This patch series is full of FIXMEs, and even the more complete set seems to have a pile of holes. Plus we need to stop using obj->state, and I don't see any easy way to test for that (since the gen3/4 gpu reset case is the only corner cases that currently needs that).

We need to fix that stuff anyway if we ever want to queue up multiple commits for the same crtc. The stuff I have that is specific to this reset stuff is actually very simple. The rest is just fixing up the huge mess we've already made.

Rewriting the world for a regression fix seems a bit much is all I'm saying. And I'm not sure your approach works without that "rewrite the world" step. Defacto what your current patches seem to result in is

we commit the final sw state in gpu reset

before we resubmit the rendering

That's true. And we do appear to need the refactoring to make it actually work. Most of the refactoring amounts to a couple of small cocci scripts though so not a big deal really.

...

That's much easier to pull of by simply force-completing all i915_sw_fences before we take any modeset locks in the gpu reset path. Note that we don't need to force-complete any i915_gem_request, we can just force-complete the i915_sw_fences the work item is blocked on. Needs some care to avoid races with a new atomic commit (since we need to force-complete before we grab locks one might sneak in), but that's a standard pattern.

Plus we then need to wait for all outstanding nonblocking commits once we do have all modeset locks, since with atomic holding the locks only syncs against synchronous hw commits (i.e do a fake synchronous commit before we nuke the display and we're good). A variant of wait_for_dependencies that waits for all crtc instead of just the crtc in an atomic commit would do I think.

None of this requires any of the prep work we need the fancy additional atomic stuff you plan to do. Which I think is really good for a regression fix.

So again, why do we need to rewrite the world (since these patches here seem to just be the racy poc) to fix reset on gen3/4?

I know you want to do all this, but tangling up a regression fix in a rewrite isn't a good idea in my opinion. I'm not against your long-term plans, I just think it'd be good if we can have them as orthogonal pieces if feasible.

I've pretty much decided that I'll be happy if I can solve this for future kernels. If someone else wants to try and cook up some kind of simple regression fix I don't have any real objections.

-- Ville Syrjälä Intel OTC

Daniel Vetter

4:48 p.m.

New subject: [PATCH 5/5] drm/i915: Solve the GPU reset vs. modeset deadlocks with an rw_semaphore

On Mon, Jul 03, 2017 at 12:30:48PM +0300, Ville Syrjälä wrote:

...

I've pretty much decided that I'll be happy if I can solve this for future kernels. If someone else wants to try and cook up some kind of simple regression fix I don't have any real objections.

Imo fixing the regression isn't the job of the display side, since display isn't the one that broke the reset contract. Well apart from not syncing with pending non-blocking atomic modeset commits, which we could fix easily by dropping DRIVER_ATOMIC from gen2-4. But I'm still trying to figure out why Chris thinks this isn't a gem regression ... -Daniel

-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch

2857

Age (days ago)

2861

Last active (days ago)

dri-devel@lists.freedesktop.org

25 comments

4 participants

tags (0)

participants (4)

Chris Wilson
Daniel Vetter
Ville Syrjälä
ville.syrjala＠linux.intel.com