Re: [Intel-gfx] [RFC PATCH 00/97] Basic GuC submission support in the i915

3 Jun 2021

      On 03/06/2021 04:41, Matthew Brost wrote:
...
On Wed, Jun 02, 2021 at 08:57:02PM +0200, Daniel Vetter wrote:
...
On Wed, Jun 2, 2021 at 5:27 PM Tvrtko Ursulin
tvrtko.ursulin@linux.intel.com wrote:
...
On 25/05/2021 17:45, Matthew Brost wrote:
...
On Tue, May 25, 2021 at 11:32:26AM +0100, Tvrtko Ursulin wrote:
...

Context pinning code with it's magical two adds, subtract and cmpxchg is

dodgy as well.
Daniele tried to remove this and it proved quite difficult + created
even more races in the backend code. This was prior to the pre-pin and
post-unpin code which makes this even more difficult to fix as I believe
these functions would need to be removed first. Not saying we can't
revisit this someday but I personally really like it - it is a clever
way to avoid reentering the pin / unpin code while asynchronous things
are happening rather than some complex locking scheme. Lastly, this code
has proved incredibly stable as I don't think we've had to fix a single
thing in this area since we've been using this code internally.
Pretty much same as above. The code like:
static inline void __intel_context_unpin(struct intel_context *ce)
{
         if (!ce->ops->sched_disable) {
                 __intel_context_do_unpin(ce, 1);
         } else {
                 while (!atomic_add_unless(&ce->pin_count, -1, 1)) {
                         if (atomic_cmpxchg(&ce->pin_count, 1, 2) == 1) {
                                 ce->ops->sched_disable(ce);
                                 break;
                         }
                 }
         }
}
That's pretty much impenetrable for me and the only thing I can think of
here is **ALARM** must be broken! See what others think..
Yea, probably should add a comment:
/*

If the context has the sched_disable function, it isn't safe to unpin
until this function completes. This function is allowed to complete
asynchronously too. To avoid this function from being entered twice
and move ownership of the unpin to this function's completion, adjust
the pin count to 2 before it is entered. When this function completes
the context can call intel_context_sched_unpin which decrements the
pin count by 2 potentially resulting in an unpin.

A while loop is needed to ensure the atomicity of the pin count. e.g.
The below if / else statement has a race:

if (atomic_cmpxchg(&ce->pin_count, 1, 2) == 1)
ce->ops->sched_disable(ce);
else
atomic_dec(ce, 1);

Two threads could simultaneously fail the if clause resulting in the
pin_count going to 0 with scheduling enabled + the context pinned.

*/
I have many questions here..
How time bound is the busy loop?
In guc_context_sched_disable the case where someone pins after the magic 
2 has been set is handled.
But what is pin_count got to 2 legitimately, via the unpin and pin 
between the atomic_cmpxchg in __intel_context_unpin and relevant lines 
in guc_context_sched_disable get to execute?
Why is the pin_count dec in guc_context_sched_disable under the 
ce->guc_state.lock?
What is the point of:
enabled = context_enabled(ce);
    if (unlikely(!enabled || submission_disabled(guc))) {
    	if (!enabled)
    		clr_context_enabled(ce);
Reads like clearing the enabled bit if it is not set?!
Why is:
static inline void clr_context_enabled(struct intel_context *ce)
{
    atomic_and((u32)~SCHED_STATE_NO_LOCK_ENABLED,
    	   &ce->guc_sched_state_no_lock);
}
Operating on a field called "guc_sched_state_no_lock" (no lock!) while 
the caller is holding guc_state.lock while manipulating that lock.
Regards,
Tvrtko

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [Intel-gfx] [RFC PATCH 00/97] Basic GuC submission support in the i915