Ever since 0eafec6d3244 ("drm/i915: Enable lockless lookup of request tracking via RCU"), the i915 driver has used SLAB_TYPESAFE_BY_RCU (it was called SLAB_DESTROY_BY_RCU at the time) in order to allow RCU on i915_request. As nifty as SLAB_TYPESAFE_BY_RCU may be, it comes with some serious disclaimers. In particular, objects can get recycled while RCU readers are still in-flight. This can be ok if everyone who touches these objects knows about the disclaimers and is careful. However, because we've chosen to use SLAB_TYPESAFE_BY_RCU for i915_request and because i915_request contains a dma_fence, we've leaked SLAB_TYPESAFE_BY_RCU and its whole pile of disclaimers to every driver in the kernel which may consume a dma_fence.
We've tried to keep it somewhat contained by doing most of the hard work to prevent access of recycled objects via dma_fence_get_rcu_safe(). However, a quick grep of kernel sources says that, of the 30 instances of dma_fence_get_rcu*, only 11 of them use dma_fence_get_rcu_safe(). It's likely there bear traps in DRM and related subsystems just waiting for someone to accidentally step in them.
This patch series stops us using SLAB_TYPESAFE_BY_RCU for i915_request and, instead, does an RCU-safe slab free via rcu_call(). This should let us keep most of the perf benefits of slab allocation while avoiding the bear traps inherent in SLAB_TYPESAFE_BY_RCU. It then removes support for SLAB_TYPESAFE_BY_RCU from dma_fence entirely.
Note: The last patch is labled DONOTMERGE. This was at Daniel Vetter's request as we may want to let this bake for a couple releases before we rip out dma_fence_get_rcu_safe entirely.
Signed-off-by: Jason Ekstrand jason@jlekstrand.net Cc: Jon Bloomfield jon.bloomfield@intel.com Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: Christian König christian.koenig@amd.com Cc: Dave Airlie airlied@redhat.com Cc: Matthew Auld matthew.auld@intel.com Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com
Jason Ekstrand (5): drm/i915: Move intel_engine_free_request_pool to i915_request.c drm/i915: Use a simpler scheme for caching i915_request drm/i915: Stop using SLAB_TYPESAFE_BY_RCU for i915_request dma-buf: Stop using SLAB_TYPESAFE_BY_RCU in selftests DONOTMERGE: dma-buf: Get rid of dma_fence_get_rcu_safe
drivers/dma-buf/dma-fence-chain.c | 8 +- drivers/dma-buf/dma-resv.c | 4 +- drivers/dma-buf/st-dma-fence-chain.c | 24 +--- drivers/dma-buf/st-dma-fence.c | 27 +--- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 4 +- drivers/gpu/drm/i915/gt/intel_engine_cs.c | 8 -- drivers/gpu/drm/i915/i915_active.h | 4 +- drivers/gpu/drm/i915/i915_request.c | 147 ++++++++++++---------- drivers/gpu/drm/i915/i915_request.h | 2 - drivers/gpu/drm/i915/i915_vma.c | 4 +- include/drm/drm_syncobj.h | 4 +- include/linux/dma-fence.h | 50 -------- include/linux/dma-resv.h | 4 +- 13 files changed, 110 insertions(+), 180 deletions(-)
This appears to break encapsulation by moving an intel_engine_cs function to a i915_request file. However, this function is intrinsically tied to the lifetime rules and allocation scheme of i915_request and having it in intel_engine_cs.c leaks details of i915_request. We have an abstraction leak either way. Since i915_request's allocation scheme is far more subtle than the simple pointer that is intel_engine_cs.request_pool, it's probably better to keep i915_request's details to itself.
Signed-off-by: Jason Ekstrand jason@jlekstrand.net Cc: Jon Bloomfield jon.bloomfield@intel.com Cc: Daniel Vetter daniel.vetter@intel.com Cc: Matthew Auld matthew.auld@intel.com Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com --- drivers/gpu/drm/i915/gt/intel_engine_cs.c | 8 -------- drivers/gpu/drm/i915/i915_request.c | 7 +++++-- drivers/gpu/drm/i915/i915_request.h | 2 -- 3 files changed, 5 insertions(+), 12 deletions(-)
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c index 9ceddfbb1687d..df6b80ec84199 100644 --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c @@ -422,14 +422,6 @@ void intel_engines_release(struct intel_gt *gt) } }
-void intel_engine_free_request_pool(struct intel_engine_cs *engine) -{ - if (!engine->request_pool) - return; - - kmem_cache_free(i915_request_slab_cache(), engine->request_pool); -} - void intel_engines_free(struct intel_gt *gt) { struct intel_engine_cs *engine; diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c index 1014c71cf7f52..48c5f8527854b 100644 --- a/drivers/gpu/drm/i915/i915_request.c +++ b/drivers/gpu/drm/i915/i915_request.c @@ -106,9 +106,12 @@ static signed long i915_fence_wait(struct dma_fence *fence, timeout); }
-struct kmem_cache *i915_request_slab_cache(void) +void intel_engine_free_request_pool(struct intel_engine_cs *engine) { - return global.slab_requests; + if (!engine->request_pool) + return; + + kmem_cache_free(global.slab_requests, engine->request_pool); }
static void i915_fence_release(struct dma_fence *fence) diff --git a/drivers/gpu/drm/i915/i915_request.h b/drivers/gpu/drm/i915/i915_request.h index 270f6cd37650c..f84c38d29f988 100644 --- a/drivers/gpu/drm/i915/i915_request.h +++ b/drivers/gpu/drm/i915/i915_request.h @@ -300,8 +300,6 @@ static inline bool dma_fence_is_i915(const struct dma_fence *fence) return fence->ops == &i915_fence_ops; }
-struct kmem_cache *i915_request_slab_cache(void); - struct i915_request * __must_check __i915_request_create(struct intel_context *ce, gfp_t gfp); struct i915_request * __must_check
On 09/06/2021 22:29, Jason Ekstrand wrote:
Argument that the slab cache shouldn't be exported from i915_request.c sounds good to me.
But I think step better than simply reversing the break of encapsulation (And it's even worse because it leaks much higher level object!) could be to export a freeing helper from i915_request.c, engine pool would then use:
void __i915_request_free(...) { kmem_cache_free(...); }
?
Regards,
Tvrtko
On Thu, Jun 10, 2021 at 5:04 AM Tvrtko Ursulin tvrtko.ursulin@linux.intel.com wrote:
That was what I did at first. However, the semantics of how the pointer is touched/modified are really also part of i915_request. In particular, the use of xchg and cmpxchg. So I pulled the one other access (besides NULL initializing) into i915_request.c which meant pulling in intel_engine_free_request_pool.
Really, if we wanted proper encapsulation here, we'd have
struct i915_request_cache { struct i915_request *rq; };
void i915_request_cache_init(struct i915_request_cache *cache); void i915_request_cache_finish(struct i915_request_cache *cache);
all in i915_request.h and have all the gory details inside i915_request.c. Then all intel_engine_cs knows is that it has a request cache.
If we really want to go that far, we can, I suppose.
--Jason
On 10/06/2021 14:57, Jason Ekstrand wrote:
Hmmm in my view the only break of encapsulation at the moment is that intel_engine_cs.c knows requests have been allocated from a dedicated slab.
Semantics of how the request pool pointer is managed, so xchg and cmpxchg, already are in i915_request.c so I don't exactly follow what is the problem with wrapping the knowledge on how requests should be freed inside i915_request.c as well?
Unless you view the fact intel_engine_cs contains a pointer to i915_request a break as well? But even then... <continued below>
... with this scheme you'd have intel_engine_cs contain a pointer to i915_request_cache, which does not seem particularly exciting improvement for me since wrapping would be extremely thin with no fundamental changes.
So for me exporting new __i915_request_free() from i915_request.c makes things a bit better and I don't think we need to go further than that.
I mean there is the issue of i915_request.c knowing about engines having request pools, but I am not sure if with i915_request_cache you proposed to remove that knowledge and how?
From the design point of view, given request pool is used only for engine pm, clean design could be to manage this from engine pm. Like if parking cannot use GFP_KERNEL then check if unparking can and explicitly allocate a request from there to be consumed at parking time. It may require some splitting of the request creation path though. To allocate but not put it on the kernel timeline until park time.
Regards,
Tvrtko
On Thu, Jun 10, 2021 at 10:07 AM Tvrtko Ursulin tvrtko.ursulin@linux.intel.com wrote:
No, it would contain an i915_request_cache, not a pointer to one. It wouldn't fundamentally change any data structures; just add wrapping.
Yeah, it's not particularly exciting.
So for me exporting new __i915_request_free() from i915_request.c makes things a bit better and I don't think we need to go further than that.
I'm not sure it's necessary either. The thing that bothers me is that we have this pointer that's clearly managed by i915_request.c but is initialized and finished by intel_context_cs.c. Certainly adding an i915_request_free() is better than what we have today. I'm not sure it's enough better to really make me happy but, TBH, the whole request cache thing is a bit of a mess....
It doesn't, really. As long as we're stashing a request in the engine, there's still an encapsulation problem no matter what we do.
And now we're getting to the heart of things. :-) Daniel mentioned this too. Maybe if the real problem here is that engine parking can't allocate memory, we need to just fix engine parking to either not require an i915_request somehow or to do its own caching somehow. I'm going to look into this.
--Jason
Instead of attempting to recycle a request in to the cache when it retires, stuff a new one in the cache every time we allocate a request for some other reason.
Signed-off-by: Jason Ekstrand jason@jlekstrand.net Cc: Jon Bloomfield jon.bloomfield@intel.com Cc: Daniel Vetter daniel.vetter@intel.com Cc: Matthew Auld matthew.auld@intel.com Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com --- drivers/gpu/drm/i915/i915_request.c | 66 ++++++++++++++--------------- 1 file changed, 31 insertions(+), 35 deletions(-)
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c index 48c5f8527854b..e531c74f0b0e2 100644 --- a/drivers/gpu/drm/i915/i915_request.c +++ b/drivers/gpu/drm/i915/i915_request.c @@ -128,41 +128,6 @@ static void i915_fence_release(struct dma_fence *fence) i915_sw_fence_fini(&rq->submit); i915_sw_fence_fini(&rq->semaphore);
- /* - * Keep one request on each engine for reserved use under mempressure - * - * We do not hold a reference to the engine here and so have to be - * very careful in what rq->engine we poke. The virtual engine is - * referenced via the rq->context and we released that ref during - * i915_request_retire(), ergo we must not dereference a virtual - * engine here. Not that we would want to, as the only consumer of - * the reserved engine->request_pool is the power management parking, - * which must-not-fail, and that is only run on the physical engines. - * - * Since the request must have been executed to be have completed, - * we know that it will have been processed by the HW and will - * not be unsubmitted again, so rq->engine and rq->execution_mask - * at this point is stable. rq->execution_mask will be a single - * bit if the last and _only_ engine it could execution on was a - * physical engine, if it's multiple bits then it started on and - * could still be on a virtual engine. Thus if the mask is not a - * power-of-two we assume that rq->engine may still be a virtual - * engine and so a dangling invalid pointer that we cannot dereference - * - * For example, consider the flow of a bonded request through a virtual - * engine. The request is created with a wide engine mask (all engines - * that we might execute on). On processing the bond, the request mask - * is reduced to one or more engines. If the request is subsequently - * bound to a single engine, it will then be constrained to only - * execute on that engine and never returned to the virtual engine - * after timeslicing away, see __unwind_incomplete_requests(). Thus we - * know that if the rq->execution_mask is a single bit, rq->engine - * can be a physical engine with the exact corresponding mask. - */ - if (is_power_of_2(rq->execution_mask) && - !cmpxchg(&rq->engine->request_pool, NULL, rq)) - return; - kmem_cache_free(global.slab_requests, rq); }
@@ -869,6 +834,29 @@ static void retire_requests(struct intel_timeline *tl) break; }
+static void +ensure_cached_request(struct i915_request **rsvd, gfp_t gfp) +{ + struct i915_request *rq; + + /* Don't try to add to the cache if we don't allow blocking. That + * just increases the chance that the actual allocation will fail. + */ + if (gfpflags_allow_blocking(gfp)) + return; + + if (READ_ONCE(rsvd)) + return; + + rq = kmem_cache_alloc(global.slab_requests, + gfp | __GFP_RETRY_MAYFAIL | __GFP_NOWARN); + if (!rq) + return; /* Oops but nothing we can do */ + + if (cmpxchg(rsvd, NULL, rq)) + kmem_cache_free(global.slab_requests, rq); +} + static noinline struct i915_request * request_alloc_slow(struct intel_timeline *tl, struct i915_request **rsvd, @@ -937,6 +925,14 @@ __i915_request_create(struct intel_context *ce, gfp_t gfp) /* Check that the caller provided an already pinned context */ __intel_context_pin(ce);
+ /* Before we do anything, try to make sure we have at least one + * request in the engine's cache. If we get here with GPF_NOWAIT + * (this can happen when switching to a kernel context), we we want + * to try very hard to not fail and we fall back to this cache. + * Top it off with a fresh request whenever it's empty. + */ + ensure_cached_request(&ce->engine->request_pool, gfp); + /* * Beware: Dragons be flying overhead. *
On 09/06/2021 22:29, Jason Ekstrand wrote:
I supposed the "why?" is "simpler scheme" - but in what way it is simpler?
Linus has been known to rant passionately against this comment style so we actively try to never use it.
Rega4rds,
Tvrtko
On Thu, Jun 10, 2021 at 5:08 AM Tvrtko Ursulin tvrtko.ursulin@linux.intel.com wrote:
Maybe it's not simpler? One way in which it's simpler is that it doesn't require funny logic to try and figure out whether or not it's on a virtual engine. Everyone gets a request pool. Done.
Back to the "why". First, in my tome of a e-mail I just sent about dma_fence_get_rcu_safe() I mentioned that SLAB_TYPESAFE_BY_RCU isn't the only way you can end up with a recycled object where you don't want one. Any caching mechanism that isn't sufficiently careful can result in such recycled objects. In particular, this one can because we don't wait for an RCU grace period before stuffing the newly released fence into request_pool.
The other reason why I like this one better is that, if any request has been created for this engine since the last time request_pool was set to NULL, then we've attempted to re-fill request_pool. This is different from the current behavior where request_pool only gets refilled if something has retired since the last time it was set to NULL. AFAIUI, the fence pool is primarily used for switching to a kernel context for PM/MM stuff. That's only ever going to happen if a request has been submitted from userspace since the last time we did it and hence a fence is sitting there in the request_pool. While it's not 100% guaranteed, this should mean memory allocation failures on that path are less likely than with the fill-on-release scheme. No, I don't have numbers on this.
What comment style? It's a comment. You'll need to be more specific.
--Jason
Ever since 0eafec6d3244 ("drm/i915: Enable lockless lookup of request tracking via RCU"), the i915 driver has used SLAB_TYPESAFE_BY_RCU (it was called SLAB_DESTROY_BY_RCU at the time) in order to allow RCU on i915_request. As nifty as SLAB_TYPESAFE_BY_RCU may be, it comes with some serious disclaimers. In particular, objects can get recycled while RCU readers are still in-flight. This can be ok if everyone who touches these objects knows about the disclaimers and is careful. However, because we've chosen to use SLAB_TYPESAFE_BY_RCU for i915_request and because i915_request contains a dma_fence, we've leaked SLAB_TYPESAFE_BY_RCU and its whole pile of disclaimers to every driver in the kernel which may consume a dma_fence.
We've tried to keep it somewhat contained by doing most of the hard work to prevent access of recycled objects via dma_fence_get_rcu_safe(). However, a quick grep of kernel sources says that, of the 30 instances of dma_fence_get_rcu*, only 11 of them use dma_fence_get_rcu_safe(). It's likely there bear traps in DRM and related subsystems just waiting for someone to accidentally step in them.
This commit gets stops us using SLAB_TYPESAFE_BY_RCU for i915_request and, instead, does an RCU-safe slab free via rcu_call(). This should let us keep most of the perf benefits of slab allocation while avoiding the bear traps inherent in SLAB_TYPESAFE_BY_RCU.
Signed-off-by: Jason Ekstrand jason@jlekstrand.net Cc: Jon Bloomfield jon.bloomfield@intel.com Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: Christian König christian.koenig@amd.com Cc: Dave Airlie airlied@redhat.com Cc: Matthew Auld matthew.auld@intel.com Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com --- drivers/gpu/drm/i915/i915_request.c | 76 ++++++++++++++++------------- 1 file changed, 43 insertions(+), 33 deletions(-)
diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c index e531c74f0b0e2..55fa938126100 100644 --- a/drivers/gpu/drm/i915/i915_request.c +++ b/drivers/gpu/drm/i915/i915_request.c @@ -111,9 +111,44 @@ void intel_engine_free_request_pool(struct intel_engine_cs *engine) if (!engine->request_pool) return;
+ /* + * It's safe to free this right away because we always put a fresh + * i915_request in the cache that's never been touched by an RCU + * reader. + */ kmem_cache_free(global.slab_requests, engine->request_pool); }
+static void __i915_request_free(struct rcu_head *head) +{ + struct i915_request *rq = container_of(head, typeof(*rq), fence.rcu); + + kmem_cache_free(global.slab_requests, rq); +} + +static void i915_request_free_rcu(struct i915_request *rq) +{ + /* + * Because we're on a slab allocator, memory may be re-used the + * moment we free it. There is no kfree_rcu() equivalent for + * slabs. Instead, we hand-roll it here with call_rcu(). This + * gives us all the perf benefits to slab allocation while ensuring + * that we never release a request back to the slab until there are + * no more readers. + * + * We do have to be careful, though, when calling kmem_cache_destroy() + * as there may be outstanding free requests. This is solved by + * inserting an rcu_barrier() before kmem_cache_destroy(). An RCU + * barrier is sufficient and we don't need synchronize_rcu() + * because the call_rcu() here will wait on any outstanding RCU + * readers and the rcu_barrier() will wait on any outstanding + * call_rcu() callbacks. So, if there are any readers who once had + * valid references to a request, rcu_barrier() will end up waiting + * on them by transitivity. + */ + call_rcu(&rq->fence.rcu, __i915_request_free); +} + static void i915_fence_release(struct dma_fence *fence) { struct i915_request *rq = to_request(fence); @@ -127,8 +162,7 @@ static void i915_fence_release(struct dma_fence *fence) */ i915_sw_fence_fini(&rq->submit); i915_sw_fence_fini(&rq->semaphore); - - kmem_cache_free(global.slab_requests, rq); + i915_request_free_rcu(rq); }
const struct dma_fence_ops i915_fence_ops = { @@ -933,35 +967,6 @@ __i915_request_create(struct intel_context *ce, gfp_t gfp) */ ensure_cached_request(&ce->engine->request_pool, gfp);
- /* - * Beware: Dragons be flying overhead. - * - * We use RCU to look up requests in flight. The lookups may - * race with the request being allocated from the slab freelist. - * That is the request we are writing to here, may be in the process - * of being read by __i915_active_request_get_rcu(). As such, - * we have to be very careful when overwriting the contents. During - * the RCU lookup, we change chase the request->engine pointer, - * read the request->global_seqno and increment the reference count. - * - * The reference count is incremented atomically. If it is zero, - * the lookup knows the request is unallocated and complete. Otherwise, - * it is either still in use, or has been reallocated and reset - * with dma_fence_init(). This increment is safe for release as we - * check that the request we have a reference to and matches the active - * request. - * - * Before we increment the refcount, we chase the request->engine - * pointer. We must not call kmem_cache_zalloc() or else we set - * that pointer to NULL and cause a crash during the lookup. If - * we see the request is completed (based on the value of the - * old engine and seqno), the lookup is complete and reports NULL. - * If we decide the request is not completed (new engine or seqno), - * then we grab a reference and double check that it is still the - * active request - which it won't be and restart the lookup. - * - * Do not use kmem_cache_zalloc() here! - */ rq = kmem_cache_alloc(global.slab_requests, gfp | __GFP_RETRY_MAYFAIL | __GFP_NOWARN); if (unlikely(!rq)) { @@ -2116,6 +2121,12 @@ static void i915_global_request_shrink(void)
static void i915_global_request_exit(void) { + /* + * We need to rcu_barrier() before destroying slab_requests. See + * i915_request_free_rcu() for more details. + */ + rcu_barrier(); + kmem_cache_destroy(global.slab_execute_cbs); kmem_cache_destroy(global.slab_requests); } @@ -2132,8 +2143,7 @@ int __init i915_global_request_init(void) sizeof(struct i915_request), __alignof__(struct i915_request), SLAB_HWCACHE_ALIGN | - SLAB_RECLAIM_ACCOUNT | - SLAB_TYPESAFE_BY_RCU, + SLAB_RECLAIM_ACCOUNT, __i915_request_ctor); if (!global.slab_requests) return -ENOMEM;
The only real-world user of SLAB_TYPESAFE_BY_RCU was i915 and it doesn't use that anymore so there's no need to be testing it in selftests.
Signed-off-by: Jason Ekstrand jason@jlekstrand.net Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: Christian König christian.koenig@amd.com Cc: Matthew Auld matthew.auld@intel.com Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com --- drivers/dma-buf/st-dma-fence-chain.c | 24 ++++-------------------- drivers/dma-buf/st-dma-fence.c | 27 +++++---------------------- 2 files changed, 9 insertions(+), 42 deletions(-)
diff --git a/drivers/dma-buf/st-dma-fence-chain.c b/drivers/dma-buf/st-dma-fence-chain.c index 9525f7f561194..73010184559fe 100644 --- a/drivers/dma-buf/st-dma-fence-chain.c +++ b/drivers/dma-buf/st-dma-fence-chain.c @@ -19,36 +19,27 @@
#define CHAIN_SZ (4 << 10)
-static struct kmem_cache *slab_fences; - -static inline struct mock_fence { +struct mock_fence { struct dma_fence base; spinlock_t lock; -} *to_mock_fence(struct dma_fence *f) { - return container_of(f, struct mock_fence, base); -} +};
static const char *mock_name(struct dma_fence *f) { return "mock"; }
-static void mock_fence_release(struct dma_fence *f) -{ - kmem_cache_free(slab_fences, to_mock_fence(f)); -} - static const struct dma_fence_ops mock_ops = { .get_driver_name = mock_name, .get_timeline_name = mock_name, - .release = mock_fence_release, + .release = dma_fence_free, };
static struct dma_fence *mock_fence(void) { struct mock_fence *f;
- f = kmem_cache_alloc(slab_fences, GFP_KERNEL); + f = kmalloc(sizeof(*f), GFP_KERNEL); if (!f) return NULL;
@@ -701,14 +692,7 @@ int dma_fence_chain(void) pr_info("sizeof(dma_fence_chain)=%zu\n", sizeof(struct dma_fence_chain));
- slab_fences = KMEM_CACHE(mock_fence, - SLAB_TYPESAFE_BY_RCU | - SLAB_HWCACHE_ALIGN); - if (!slab_fences) - return -ENOMEM; - ret = subtests(tests, NULL);
- kmem_cache_destroy(slab_fences); return ret; } diff --git a/drivers/dma-buf/st-dma-fence.c b/drivers/dma-buf/st-dma-fence.c index c8a12d7ad71ab..ca98cb0b9525b 100644 --- a/drivers/dma-buf/st-dma-fence.c +++ b/drivers/dma-buf/st-dma-fence.c @@ -14,25 +14,16 @@
#include "selftest.h"
-static struct kmem_cache *slab_fences; - -static struct mock_fence { +struct mock_fence { struct dma_fence base; struct spinlock lock; -} *to_mock_fence(struct dma_fence *f) { - return container_of(f, struct mock_fence, base); -} +};
static const char *mock_name(struct dma_fence *f) { return "mock"; }
-static void mock_fence_release(struct dma_fence *f) -{ - kmem_cache_free(slab_fences, to_mock_fence(f)); -} - struct wait_cb { struct dma_fence_cb cb; struct task_struct *task; @@ -77,14 +68,14 @@ static const struct dma_fence_ops mock_ops = { .get_driver_name = mock_name, .get_timeline_name = mock_name, .wait = mock_wait, - .release = mock_fence_release, + .release = dma_fence_free, };
static struct dma_fence *mock_fence(void) { struct mock_fence *f;
- f = kmem_cache_alloc(slab_fences, GFP_KERNEL); + f = kmalloc(sizeof(*f), GFP_KERNEL); if (!f) return NULL;
@@ -463,7 +454,7 @@ static int thread_signal_callback(void *arg)
rcu_read_lock(); do { - f2 = dma_fence_get_rcu_safe(&t->fences[!t->id]); + f2 = dma_fence_get_rcu(t->fences[!t->id]); } while (!f2 && !kthread_should_stop()); rcu_read_unlock();
@@ -563,15 +554,7 @@ int dma_fence(void)
pr_info("sizeof(dma_fence)=%zu\n", sizeof(struct dma_fence));
- slab_fences = KMEM_CACHE(mock_fence, - SLAB_TYPESAFE_BY_RCU | - SLAB_HWCACHE_ALIGN); - if (!slab_fences) - return -ENOMEM; - ret = subtests(tests, NULL);
- kmem_cache_destroy(slab_fences); - return ret; }
Hi Jason,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on drm-intel/for-linux-next] [also build test WARNING on drm-tip/drm-tip drm-exynos/exynos-drm-next tegra-drm/drm/tegra/for-next linus/master v5.13-rc6 next-20210615] [cannot apply to drm/drm-next] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Jason-Ekstrand/dma-fence-i915-Stop-... base: git://anongit.freedesktop.org/drm-intel for-linux-next config: sparc-randconfig-s032-20210615 (attached as .config) compiler: sparc-linux-gcc (GCC) 9.3.0 reproduce: wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # apt-get install sparse # sparse version: v0.6.3-341-g8af24329-dirty # https://github.com/0day-ci/linux/commit/c889567ea79d1ce55ff8868bae789bbb3223... git remote add linux-review https://github.com/0day-ci/linux git fetch --no-tags linux-review Jason-Ekstrand/dma-fence-i915-Stop-allowing-SLAB_TYPESAFE_BY_RCU-for-dma_fence/20210616-154432 git checkout c889567ea79d1ce55ff8868bae789bbb3223503d # save the attached .config to linux build tree COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' W=1 ARCH=sparc
If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot lkp@intel.com
sparse warnings: (new ones prefixed by >>)
drivers/dma-buf/st-dma-fence.c:457:57: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct dma_fence *fence @@ got struct dma_fence [noderef] __rcu * @@
drivers/dma-buf/st-dma-fence.c:457:57: sparse: expected struct dma_fence *fence drivers/dma-buf/st-dma-fence.c:457:57: sparse: got struct dma_fence [noderef] __rcu *
vim +457 drivers/dma-buf/st-dma-fence.c
434 435 static int thread_signal_callback(void *arg) 436 { 437 const struct race_thread *t = arg; 438 unsigned long pass = 0; 439 unsigned long miss = 0; 440 int err = 0; 441 442 while (!err && !kthread_should_stop()) { 443 struct dma_fence *f1, *f2; 444 struct simple_cb cb; 445 446 f1 = mock_fence(); 447 if (!f1) { 448 err = -ENOMEM; 449 break; 450 } 451 452 rcu_assign_pointer(t->fences[t->id], f1); 453 smp_wmb(); 454 455 rcu_read_lock(); 456 do {
457 f2 = dma_fence_get_rcu(t->fences[!t->id]);
458 } while (!f2 && !kthread_should_stop()); 459 rcu_read_unlock(); 460 461 if (t->before) 462 dma_fence_signal(f1); 463 464 smp_store_mb(cb.seen, false); 465 if (!f2 || 466 dma_fence_add_callback(f2, &cb.cb, simple_callback)) { 467 miss++; 468 cb.seen = true; 469 } 470 471 if (!t->before) 472 dma_fence_signal(f1); 473 474 if (!cb.seen) { 475 dma_fence_wait(f2, false); 476 __wait_for_callbacks(f2); 477 } 478 479 if (!READ_ONCE(cb.seen)) { 480 pr_err("Callback not seen on thread %d, pass %lu (%lu misses), signaling %s add_callback; fence signaled? %s\n", 481 t->id, pass, miss, 482 t->before ? "before" : "after", 483 dma_fence_is_signaled(f2) ? "yes" : "no"); 484 err = -EINVAL; 485 } 486 487 dma_fence_put(f2); 488 489 rcu_assign_pointer(t->fences[t->id], NULL); 490 smp_wmb(); 491 492 dma_fence_put(f1); 493 494 pass++; 495 } 496 497 pr_info("%s[%d] completed %lu passes, %lu misses\n", 498 __func__, t->id, pass, miss); 499 return err; 500 } 501
--- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
This helper existed to handle the weird corner-cases caused by using SLAB_TYPESAFE_BY_RCU for backing dma_fence. Now that no one is using that anymore (i915 was the only real user), dma_fence_get_rcu is sufficient. The one slightly annoying thing we have to deal with here is that dma_fence_get_rcu_safe did an rcu_dereference as well as a SLAB_TYPESAFE_BY_RCU-safe dma_fence_get_rcu. This means each call site ends up being 3 lines instead of 1.
Signed-off-by: Jason Ekstrand jason@jlekstrand.net Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: Christian König christian.koenig@amd.com Cc: Matthew Auld matthew.auld@intel.com Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com --- drivers/dma-buf/dma-fence-chain.c | 8 ++-- drivers/dma-buf/dma-resv.c | 4 +- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 4 +- drivers/gpu/drm/i915/i915_active.h | 4 +- drivers/gpu/drm/i915/i915_vma.c | 4 +- include/drm/drm_syncobj.h | 4 +- include/linux/dma-fence.h | 50 ----------------------- include/linux/dma-resv.h | 4 +- 8 files changed, 23 insertions(+), 59 deletions(-)
diff --git a/drivers/dma-buf/dma-fence-chain.c b/drivers/dma-buf/dma-fence-chain.c index 7d129e68ac701..46dfc7d94d8ed 100644 --- a/drivers/dma-buf/dma-fence-chain.c +++ b/drivers/dma-buf/dma-fence-chain.c @@ -15,15 +15,17 @@ static bool dma_fence_chain_enable_signaling(struct dma_fence *fence); * dma_fence_chain_get_prev - use RCU to get a reference to the previous fence * @chain: chain node to get the previous node from * - * Use dma_fence_get_rcu_safe to get a reference to the previous fence of the - * chain node. + * Use rcu_dereference and dma_fence_get_rcu to get a reference to the + * previous fence of the chain node. */ static struct dma_fence *dma_fence_chain_get_prev(struct dma_fence_chain *chain) { struct dma_fence *prev;
rcu_read_lock(); - prev = dma_fence_get_rcu_safe(&chain->prev); + prev = rcu_dereference(chain->prev); + if (prev) + prev = dma_fence_get_rcu(prev); rcu_read_unlock(); return prev; } diff --git a/drivers/dma-buf/dma-resv.c b/drivers/dma-buf/dma-resv.c index f26c71747d43a..cfe0db3cca292 100644 --- a/drivers/dma-buf/dma-resv.c +++ b/drivers/dma-buf/dma-resv.c @@ -376,7 +376,9 @@ int dma_resv_copy_fences(struct dma_resv *dst, struct dma_resv *src) dst_list = NULL; }
- new = dma_fence_get_rcu_safe(&src->fence_excl); + new = rcu_dereference(src->fence_excl); + if (new) + new = dma_fence_get_rcu(new); rcu_read_unlock();
src_list = dma_resv_shared_list(dst); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c index 72d9b92b17547..0aeb6117f3893 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c @@ -161,7 +161,9 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, struct dma_fence **f, struct dma_fence *old;
rcu_read_lock(); - old = dma_fence_get_rcu_safe(ptr); + old = rcu_dereference(*ptr); + if (old) + old = dma_fence_get_rcu(old); rcu_read_unlock();
if (old) { diff --git a/drivers/gpu/drm/i915/i915_active.h b/drivers/gpu/drm/i915/i915_active.h index d0feda68b874f..bd89cfc806ca5 100644 --- a/drivers/gpu/drm/i915/i915_active.h +++ b/drivers/gpu/drm/i915/i915_active.h @@ -103,7 +103,9 @@ i915_active_fence_get(struct i915_active_fence *active) struct dma_fence *fence;
rcu_read_lock(); - fence = dma_fence_get_rcu_safe(&active->fence); + fence = rcu_dereference(active->fence); + if (fence) + fence = dma_fence_get_rcu(fence); rcu_read_unlock();
return fence; diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c index 0f227f28b2802..ed0388d99197e 100644 --- a/drivers/gpu/drm/i915/i915_vma.c +++ b/drivers/gpu/drm/i915/i915_vma.c @@ -351,7 +351,9 @@ int i915_vma_wait_for_bind(struct i915_vma *vma) struct dma_fence *fence;
rcu_read_lock(); - fence = dma_fence_get_rcu_safe(&vma->active.excl.fence); + fence = rcu_dereference(vma->active.excl.fence); + if (fence) + fence = dma_fence_get_rcu(fence); rcu_read_unlock(); if (fence) { err = dma_fence_wait(fence, MAX_SCHEDULE_TIMEOUT); diff --git a/include/drm/drm_syncobj.h b/include/drm/drm_syncobj.h index 6cf7243a1dc5e..6c45d52988bcc 100644 --- a/include/drm/drm_syncobj.h +++ b/include/drm/drm_syncobj.h @@ -105,7 +105,9 @@ drm_syncobj_fence_get(struct drm_syncobj *syncobj) struct dma_fence *fence;
rcu_read_lock(); - fence = dma_fence_get_rcu_safe(&syncobj->fence); + fence = rcu_dereference(syncobj->fence); + if (fence) + fence = dma_fence_get_rcu(syncobj->fence); rcu_read_unlock();
return fence; diff --git a/include/linux/dma-fence.h b/include/linux/dma-fence.h index 6ffb4b2c63715..f4a2ab2b1ae46 100644 --- a/include/linux/dma-fence.h +++ b/include/linux/dma-fence.h @@ -307,56 +307,6 @@ static inline struct dma_fence *dma_fence_get_rcu(struct dma_fence *fence) return NULL; }
-/** - * dma_fence_get_rcu_safe - acquire a reference to an RCU tracked fence - * @fencep: pointer to fence to increase refcount of - * - * Function returns NULL if no refcount could be obtained, or the fence. - * This function handles acquiring a reference to a fence that may be - * reallocated within the RCU grace period (such as with SLAB_TYPESAFE_BY_RCU), - * so long as the caller is using RCU on the pointer to the fence. - * - * An alternative mechanism is to employ a seqlock to protect a bunch of - * fences, such as used by struct dma_resv. When using a seqlock, - * the seqlock must be taken before and checked after a reference to the - * fence is acquired (as shown here). - * - * The caller is required to hold the RCU read lock. - */ -static inline struct dma_fence * -dma_fence_get_rcu_safe(struct dma_fence __rcu **fencep) -{ - do { - struct dma_fence *fence; - - fence = rcu_dereference(*fencep); - if (!fence) - return NULL; - - if (!dma_fence_get_rcu(fence)) - continue; - - /* The atomic_inc_not_zero() inside dma_fence_get_rcu() - * provides a full memory barrier upon success (such as now). - * This is paired with the write barrier from assigning - * to the __rcu protected fence pointer so that if that - * pointer still matches the current fence, we know we - * have successfully acquire a reference to it. If it no - * longer matches, we are holding a reference to some other - * reallocated pointer. This is possible if the allocator - * is using a freelist like SLAB_TYPESAFE_BY_RCU where the - * fence remains valid for the RCU grace period, but it - * may be reallocated. When using such allocators, we are - * responsible for ensuring the reference we get is to - * the right fence, as below. - */ - if (fence == rcu_access_pointer(*fencep)) - return rcu_pointer_handoff(fence); - - dma_fence_put(fence); - } while (1); -} - #ifdef CONFIG_LOCKDEP bool dma_fence_begin_signalling(void); void dma_fence_end_signalling(bool cookie); diff --git a/include/linux/dma-resv.h b/include/linux/dma-resv.h index 562b885cf9c3d..a38c021f379af 100644 --- a/include/linux/dma-resv.h +++ b/include/linux/dma-resv.h @@ -248,7 +248,9 @@ dma_resv_get_excl_unlocked(struct dma_resv *obj) return NULL;
rcu_read_lock(); - fence = dma_fence_get_rcu_safe(&obj->fence_excl); + fence = rcu_dereference(obj->fence_excl); + if (fence) + fence = dma_fence_get_rcu(fence); rcu_read_unlock();
return fence;
Am 09.06.21 um 23:29 schrieb Jason Ekstrand:
That's an outright NAK.
The loop in dma_fence_get_rcu_safe is necessary because the underlying fence object can be replaced while taking the reference.
This is completely unrelated to SLAB_TYPESAFE_BY_RCU. See the dma_fence_chain usage for reference.
What you can remove is the sequence number handling in dma-buf. That should make adding fences quite a bit quicker.
Regards, Christian.
On Thu, Jun 10, 2021 at 1:51 AM Christian König christian.koenig@amd.com wrote:
Right. I had missed a bit of that when I first read through it. I see the need for the loop now. But there are some other tricky bits in there besides just the loop.
I'll look at that and try to understand what's going on there.
--Jason
On Thu, Jun 10, 2021 at 3:59 PM Jason Ekstrand jason@jlekstrand.net wrote:
I thought that's what the kref_get_unless_zero was for in dma_fence_get_rcu? Otherwise I guess I'm not seeing why still have dma_fence_get_rcu around, since that should either be a kref_get or it's just unsafe to call it ...
Hm I thought the seqlock was to make sure we have a consistent set of fences across exclusive and all shared slot. Not to protect against the fence disappearing due to typesafe_by_rcu. -Daniel
On Thu, Jun 10, 2021 at 10:13 AM Daniel Vetter daniel.vetter@ffwll.ch wrote:
AFAICT, dma_fence_get_rcu is unsafe unless you somehow know that it's your fence and it's never recycled.
Where the loop comes in is if you have someone come along, under the RCU write lock or not, and swap out the pointer and unref it while you're trying to fetch it. In this case, if you just write the three lines I duplicated throughout this patch, you'll end up with NULL if you (partially) lose the race. The loop exists to ensure that you get either the old pointer or the new pointer and you only ever get NULL if somewhere during the mess, the pointer actually gets set to NULL.
I agree with Christian that that part of dma_fence_get_rcu_safe needs to stay. I was missing that until I did my giant "let's walk through the code" e-mail.
--Jason
On Thu, Jun 10, 2021 at 6:24 PM Jason Ekstrand jason@jlekstrand.net wrote:
It's not that easy. At least not for dma_resv.
The thing is, you can't just go in and replace the write fence with something else. There's supposed to be some ordering here (how much we actually still follow that or not is a bit another question, that I'm trying to answer with an audit of lots of drivers), which means if you replace e.g. the exclusive fence, the previous fence will _not_ just get freed. Because the next exclusive fence needs to wait for that to finish first.
Conceptually the refcount will _only_ go to 0 once all later dependencies have seen it get signalled, and once the fence itself has been signalled. A signalled fence might as well not exist, so if that's what happened in that tiny window, then yes a legal scenario is the following:
thread A: - rcu_dereference(resv->exclusive_fence);
thread B: - dma_fence signals, retires, drops refcount to 0 - sets the exclusive fence to NULL - creates a new dma_fence - sets the exclusive fence to that new fence
thread A: - kref_get_unless_zero fails, we report that the exclusive fence slot is NULL
Ofc normally we're fully pipeline, and we lazily clear slots, so no one ever writes the fence ptr to NULL. But conceptually it's totally fine, and an indistinguishable sequence of events from the point of view of thread A.
Ergo dma_fence_get_rcu is enough. If it's not, we've screwed up really big time. The only reason you need _unsafe is if you have typesafe_by_rcu, or maybe if you yolo your fence ordering a bit much and break the DAG property in a few cases.
Well if I'm wrong there's a _ton_ of broken code in upstream right now, even in dma-buf/dma-resv.c. We're using dma_fence_get_rcu a lot.
Also the timing is all backwards: get_rcu_safe was added as a fix for when i915 made its dma_fence typesafe_by_rcu. We didn't have any need for this beforehand. So I'm really not quite buying this story here yet you're all trying to sell me on. -Daniel
On Thu, Jun 10, 2021 at 11:38 AM Daniel Vetter daniel.vetter@ffwll.ch wrote:
How is reporting that the exclusive fence is NULL ok in that scenario? If someone comes along and calls dma_resv_get_excl_fence(), we want them to get either the old fence or the new fence but never NULL. NULL would imply that the object is idle which it probably isn't in any sort of pipelined world.
Yup. 19 times. What I'm trying to understand is how much of that code depends on properly catching a pointer-switch race and how much is ok with a NULL failure mode. This trybot seems to imply that most things are ok with the NULL failure mode:
https://patchwork.freedesktop.org/series/91267/
Of course, as we discussed on IRC, I'm not sure how much I trust proof-by-trybot here. :-)
Yeah, that's really concerning. It's possible that many of the uses of get_rcu_safe were added because someone had recycle bugs and others were added because of pointer chase bugs and people weren't entirely clear on which.
--Jason
On Thu, Jun 10, 2021 at 11:52:23AM -0500, Jason Ekstrand wrote:
The thing is, the kref_get_unless_zero _only_ fails when the object could have been idle meanwhile and it's exclusive fence slot NULL.
Maybe no one wrote that NULL, but from thread A's pov there's no difference between those. Therefore returning NULL in that case is totally fine.
It is _not_ possible for that kref_get_unless_zero to fail, while the fence isn't signalled yet.
I think we might need to go through this on irc a bit ... -Daniel
Am 10.06.21 um 18:37 schrieb Daniel Vetter:
I think that's the point where it breaks.
See IIRC radeon for example doesn't keep unsignaled fences around when nobody is interested in them. And I think noveau does it that way as well.
So for example you can have the following 1. Submission to 3D ring, this creates fence A. 2. Fence A is put as en exclusive fence in a dma_resv object. 3. Submission to 3D ring, this creates fence B. 4. Fence B is replacing fence A as the exclusive fence in the dma_resv object.
Fence A is replaced and therefore destroyed while it is not even close to be signaled. But the replacement is perfectly ok, since fence B is submitted to the same ring.
When somebody would use dma_fence_get_rcu on the exclusive fence and get NULL it would fail to wait for the submissions. You don't really need the SLAB_TYPESAFE_BY_RCU for this to blow up in your face.
We could change that rule of curse, amdgpu for example is always keeping fences around until they are signaled. But IIRC that's how it was for radeon like forever.
Regards, Christian.
On Thu, Jun 10, 2021 at 06:54:13PM +0200, Christian König wrote:
Uh that's wild ...
I thought that's impossible, but in dma_fence_release() we only complain if there's both waiters and the fence isn't signalled yet. I had no idea.
Yeah I think we could, but then we need to do a few things: - document that defactor only get_rcu_safe is ok to use - delete get_rcu, it's not really a safe thing to do anywhere
-Daniel
Hi Jason,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on drm-intel/for-linux-next] [also build test WARNING on drm-tip/drm-tip drm-exynos/exynos-drm-next tegra-drm/drm/tegra/for-next linus/master v5.13-rc6 next-20210616] [cannot apply to drm/drm-next] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Jason-Ekstrand/dma-fence-i915-Stop-... base: git://anongit.freedesktop.org/drm-intel for-linux-next config: i386-randconfig-s001-20210615 (attached as .config) compiler: gcc-9 (Debian 9.3.0-22) 9.3.0 reproduce: # apt-get install sparse # sparse version: v0.6.3-341-g8af24329-dirty # https://github.com/0day-ci/linux/commit/d718e3dba487fc068d793f6220ac2508c98d... git remote add linux-review https://github.com/0day-ci/linux git fetch --no-tags linux-review Jason-Ekstrand/dma-fence-i915-Stop-allowing-SLAB_TYPESAFE_BY_RCU-for-dma_fence/20210616-154432 git checkout d718e3dba487fc068d793f6220ac2508c98d0eef # save the attached .config to linux build tree make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' W=1 ARCH=i386
If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot lkp@intel.com
sparse warnings: (new ones prefixed by >>) drivers/gpu/drm/i915/gem/i915_gem_execbuffer.c: note: in included file:
include/drm/drm_syncobj.h:110:50: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct dma_fence *fence @@ got struct dma_fence [noderef] __rcu *fence @@
include/drm/drm_syncobj.h:110:50: sparse: expected struct dma_fence *fence include/drm/drm_syncobj.h:110:50: sparse: got struct dma_fence [noderef] __rcu *fence
include/drm/drm_syncobj.h:110:50: sparse: sparse: incorrect type in argument 1 (different address spaces) @@ expected struct dma_fence *fence @@ got struct dma_fence [noderef] __rcu *fence @@
include/drm/drm_syncobj.h:110:50: sparse: expected struct dma_fence *fence include/drm/drm_syncobj.h:110:50: sparse: got struct dma_fence [noderef] __rcu *fence
vim +110 include/drm/drm_syncobj.h
90 91 /** 92 * drm_syncobj_fence_get - get a reference to a fence in a sync object 93 * @syncobj: sync object. 94 * 95 * This acquires additional reference to &drm_syncobj.fence contained in @obj, 96 * if not NULL. It is illegal to call this without already holding a reference. 97 * No locks required. 98 * 99 * Returns: 100 * Either the fence of @obj or NULL if there's none. 101 */ 102 static inline struct dma_fence * 103 drm_syncobj_fence_get(struct drm_syncobj *syncobj) 104 { 105 struct dma_fence *fence; 106 107 rcu_read_lock(); 108 fence = rcu_dereference(syncobj->fence); 109 if (fence)
110 fence = dma_fence_get_rcu(syncobj->fence);
111 rcu_read_unlock(); 112 113 return fence; 114 } 115
--- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
On 09/06/2021 22:29, Jason Ekstrand wrote:
I don't think the part about leaking is true...
...because dma_fence_get_rcu_safe apears to be about whether the *pointer* to the fence itself is rcu protected, not about the fence object itself.
If one has a stable pointer to a fence dma_fence_get_rcu is I think enough to deal with SLAB_TYPESAFE_BY_RCU used by i915_request (as dma fence is a base object there). Unless you found a bug in rq field recycling. But access to the dma fence is all tightly controlled so I don't get what leaks.
According to the rationale behind SLAB_TYPESAFE_BY_RCU traditional RCU freeing can be a lot more costly so I think we need a clear justification on why this change is being considered.
Regards,
Tvrtko
Am 10.06.21 um 11:29 schrieb Tvrtko Ursulin:
Yes, exactly that.
The problem is that SLAB_TYPESAFE_BY_RCU requires that we use a sequence counter to make sure that we don't grab the reference to a reallocated dma_fence.
Updating the sequence counter every time we add a fence now means two additions writes and one additional barrier for an extremely hot path. The extra overhead of RCU freeing is completely negligible compared to that.
The good news is that I think if we are just a bit more clever about our handle we can both avoid the sequence counter and keep SLAB_TYPESAFE_BY_RCU around.
But this needs more code cleanup and abstracting the sequence counter usage in a macro.
Regards, Christian.
On Thu, Jun 10, 2021 at 11:39 AM Christian König christian.koenig@amd.com wrote:
We do leak, and badly. Any __rcu protected fence pointer where a shared fence could show up is affected. And the point of dma_fence is that they're shareable, and we're inventing ever more ways to do so (sync_file, drm_syncobj, implicit fencing maybe soon with import/export ioctl on top, in/out fences in CS ioctl, atomic ioctl, ...).
So without a full audit anything that uses the following pattern is probably busted:
rcu_read_lock(); fence = rcu_dereference(); fence = dma_fence_get_rcu(); rcu_read_lock();
/* use the fence now that we acquired a full reference */
And I don't mean "you might wait a bit too much" busted, but "this can lead to loops in the dma_fence dependency chain, resulting in deadlocks" kind of busted. What's worse, the standard rcu lockless access pattern is also busted completely:
rcu_read_lock(); fence = rcu_derefence(); /* locklessly check the state of fence */ rcu_read_unlock();
because once you have TYPESAFE_BY_RCU rcu_read_lock doesn't prevent a use-after-free anymore. The only thing it guarantees is that your fence pointer keeps pointing at either freed memory, or a fence, but nothing else. You have to wrap your rcu_derefence and code into a seqlock of some kind, either a real one like dma_resv, or an open-coded one like dma_fence_get_rcu_safe uses. And yes the latter is a specialized seqlock, except it fails to properly document in comments where all the required barriers are.
tldr; all the code using dma_fence_get_rcu needs to be assumed to be broken.
Heck this is fragile and tricky enough that i915 shot its own leg off routinely (there's a bugfix floating around just now), so not even internally we're very good at getting this right.
You still need a seqlock, or something else that's serving as your seqlock. dma_fence_list behind a single __rcu protected pointer, with all subsequent fence pointers _not_ being rcu protected (i.e. full reference, on every change we allocate might work. Which is a very funny way of implementing something like a seqlock.
And that only covers dma_resv, you _have_ to do this _everywhere_ in every driver. Except if you can proof that your __rcu fence pointer only ever points at your own driver's fences.
So unless you're volunteering to audit all the drivers, and constantly re-audit them (because rcu only guaranteeing type-safety but not actually preventing use-after-free is very unusual in the kernel) just fixing dma_resv doesn't solve the problem here at all.
But this needs more code cleanup and abstracting the sequence counter usage in a macro.
The other thing is that this doesn't even make sense for i915 anymore. The solution to the "userspace wants to submit bazillion requests" problem is direct userspace submit. Current hw doesn't have userspace ringbuffer, but we have a pretty clever trick in the works to make this possible with current hw, essentially by submitting a CS that loops on itself, and then inserting batches into this "ring" by latching a conditional branch in this CS. It's not pretty, but it gets the job done and outright removes the need for plaid mode throughput of i915_request dma fences. -Daniel
On Thu, Jun 10, 2021 at 1:29 PM Daniel Vetter daniel.vetter@ffwll.ch wrote:
To put it another way: I'm the guy who reviewed the patch which started this entire TYPESAFE_BY_RCU mess we got ourselves into:
commit 0eafec6d3244802d469712682b0f513963c23eff Author: Chris Wilson chris@chris-wilson.co.uk Date: Thu Aug 4 16:32:41 2016 +0100
drm/i915: Enable lockless lookup of request tracking via RCU
...
Signed-off-by: Chris Wilson chris@chris-wilson.co.uk Cc: Maarten Lankhorst maarten.lankhorst@linux.intel.com Cc: "Goel, Akash" akash.goel@intel.com Cc: Josh Triplett josh@joshtriplett.org Cc: Daniel Vetter daniel.vetter@ffwll.ch Reviewed-by: Daniel Vetter daniel.vetter@ffwll.ch Link: http://patchwork.freedesktop.org/patch/msgid/1470324762-2545-25-git-send-ema...
Looking back this was a mistake. The innocently labelled DESTROY_BY_RCU tricked me real bad, and we never had any real-world use-case to justify all the danger this brought not just to i915, but to any driver using __rcu protected dma_fence access. It's not worth it. -Daniel
On 10/06/2021 12:29, Daniel Vetter wrote:
What do you mean by _probably_ busted? This should either fail in kref_get_unless_zero for freed fences or it grabs the wrong fence which dma_fence_get_rcu_safe() is supposed to detect.
I don't have the story on dma_fence dependency chain deadlocks. Maybe put a few more words about that in the cover letter since it would be good to understand the real motivation behind the change.
Are there even bugs about deadlocks which should be mentioned?
Or why is not the act of fence being signaled removing the fence from the rcu protected containers, preventing the stale pointer problem?
My understanding is that lockless, eg. no reference taken, access can access a different fence or a freed fence, but won't cause use after free when under the rcu lock.
As long as dma fence users are going through the API entry points individual drivers should be able to handle things correctly.
Again, I think it can't be freed memory inside the rcu lock section. It can only be re-allocated or unused object.
Overall it looks like there is some complication involving the interaction between rcu protected pointers and SLAB_TYPESAFE_BY_RCU, rather than a simple statement i915 leaked/broke something for other drivers. And challenge of auditing drivers to make all use dma_fence_get_rcu_safe() when dealing with such storage.
To be clear I don't mind simplifications in principle as long as the problem statement is accurate.
And some benchmarks definitely need to be ran here. At least that was the usual thing in the past when such large changes were being proposed.
Regards,
Tvrtko
On Thu, Jun 10, 2021 at 6:30 AM Daniel Vetter daniel.vetter@ffwll.ch wrote:
The fact that both of you think this either means that I've completely missed what's going on with RCUs here (possible but, in this case, I think unlikely) or RCUs on dma fences should scare us all. Yes, it protects against races on the dma_fence pointer itself. However, whether or not that dma_fence pointer lives in RCU-protected memory is immaterial AFAICT. It also does magic to deal with SLAB_TYPESAFE_BY_RCU. Let's walk through it. Please tell me if/where I go off the rails.
First, let's set the scenario: The race this is protecting us against (I think) is where someone else comes along and swaps out the pointer we're trying to fetch for NULL or a different one and then drops the last reference.
First, before we get to dma_fence_get_rcu_safe(), the caller has taken an RCU read lock. Then we get into the function
fence = rcu_dereference(*fencep); if (!fence) return NULL;
First, we dereference fencep and grab the pointer. There's an rcu_dereference() here which does the usual RCU magic (which I don't fully understand yet) to turn an __rcu pointer into a "real" pointer. It's possible that the pointer is NULL, if so we bail. We may have lost the race or it could be the the pointer was NULL the whole time. Doesn't matter.
if (!dma_fence_get_rcu(fence)) continue;
This attempts to get a reference and, if it fails continues. More on the continue later. For now, let's dive into dma_fence_get()
if (kref_get_unless_zero(&fence->refcount)) return fence; else return NULL;
So we try to get a reference unless it's zero. This is a pretty standard pattern and, if the dma_fence was freed with kfree_rcu(), would be all we need. If the reference count on the dma_fence drops to 0 and then the dma_fence is freed with kfree_rcu, we're guaranteed that there is an RCU grace period between when the reference count hits 0 and the memory is reclaimed. Since all this happens inside the RCU read lock, if we raced with someone attempting to swap out the pointer and drop the reference count to zero, we have one of two cases:
1. We get the old pointer but successfully take a reference. In this case, it's the same as if we were called a few cycles earlier and straight-up won the race. We get the old pointer and, because we now have a reference, the object is never freed.
2. We get the old pointer but refcount is already zero by the time we get here. In this case, kref_get_unless_zero() returns false and dma_fence_get_rcu() returns NULL.
If these were the only two cases we cared about, all of dma_fence_get_rcu_safe() could be implemented as follows:
static inline struct dma_fence * dma_fence_get_rcu_safe(struct dma_fence **fencep) { struct dma_fence *fence;
fence = rcu_dereference(*fencep); if (fence) fence = dma_fence_get_rcu(fence);
return fence; }
and we we'd be done. The case the above code doesn't handle is if the thing we're racing with swaps it to a non-NULL pointer. To handle that case, we throw a loop around the whole thing as follows:
static inline struct dma_fence * dma_fence_get_rcu_safe(struct dma_fence **fencep) { struct dma_fence *fence;
do { fence = rcu_dereference(*fencep); if (!fence) return NULL;
fence = dma_fence_get_rcu(fence); } while (!fence);
return fence; }
Ok, great, we've got an implementation, right? Unfortunately, this is where SLAB_TYPESAFE_BY_RCU crashes the party. The giant disclaimer about SLAB_TYPESAFE_BY_RCU is that memory gets recycled immediately and doesn't wait for an RCU grace period. You're guaranteed that memory exists at that pointer so you won't get a nasty SEGFAULT and you're guaranteed that the memory is still a dma_fence, but you're not guaranteed anything else. In particular, there's a 3rd case:
3. We get an old pointer but it's been recycled and points to a totally different dma_fence whose reference count is non-zero. In this case, rcu_dereference returns non-null and kref_get_unless_zero() succeeds but we still managed to end up with the wrong fence.
To deal with 3, we do this:
/* The atomic_inc_not_zero() inside dma_fence_get_rcu() * provides a full memory barrier upon success (such as now). * This is paired with the write barrier from assigning * to the __rcu protected fence pointer so that if that * pointer still matches the current fence, we know we * have successfully acquire a reference to it. If it no * longer matches, we are holding a reference to some other * reallocated pointer. This is possible if the allocator * is using a freelist like SLAB_TYPESAFE_BY_RCU where the * fence remains valid for the RCU grace period, but it * may be reallocated. When using such allocators, we are * responsible for ensuring the reference we get is to * the right fence, as below. */ if (fence == rcu_access_pointer(*fencep)) return rcu_pointer_handoff(fence);
dma_fence_put(fence);
We dereference fencep one more time and check to ensure that the pointer we fetched at the start still matches. There are some serious memory barrier tricks going on here. In particular, we're depending on the fact that kref_get_unless_zero() does an atomic which means a memory barrier between when the other thread we're racing with swapped out the pointer and when the atomic happened. Assuming that the other thread swapped out the pointer BEFORE dropping the reference, we can detect the recycle race with this pointer check. If this last check succeeds, we return the fence. If it fails, then we ended up with the wrong dma_fence and we drop the reference we acquired above and try again.
Again, the important issue here that causes problems is that there's no RCU grace period between the kref hitting zero and the dma_fence being recycled. If a dma_fence is freed with kfree_rcu(), we have such a grace period and it's fine. If we recycling, we can end up in all sorts of weird corners if we're not careful to ensure that the fence we got is the fence we think we got.
Before I move on, there's one more important point: This can happen without SLAB_TYPESAFE_BY_RCU. Really, any dma_fence recycling scheme which doesn't ensure an RCU grace period between keref->zero and recycle will run afoul of this. SLAB_TYPESAFE_BY_RCU just happens to be the way i915 gets into this mess.
Yup.
Yeah, this one's broken too. It depends on what you're doing with that state just how busted and what that breakage costs you but it's definitely busted.
We're already trying to do handle cleverness as described above. But, as Daniel said and I put in some commit message, we're probably only doing it in about 1/3 of the places we need to be.
I'm not sure I'd go that far. Yes, we've got the ULLS hack but i915_request is going to stay around for a while. What's really overblown here is the bazillions of requests. GL drivers submit tens or maybe 100ish batches per frame. Media has to ping-pong a bit more but it should still be < 1000/second. If we're really dma_fence_release-bound, we're in a microbenchmark.
--Jason
On Thu, Jun 10, 2021 at 8:35 AM Jason Ekstrand jason@jlekstrand.net wrote:
Taking a step back for a second and ignoring SLAB_TYPESAFE_BY_RCU as such, I'd like to ask a slightly different question: What are the rules about what is allowed to be done under the RCU read lock and what guarantees does a driver need to provide?
I think so far that we've all agreed on the following:
1. Freeing an unsignaled fence is ok as long as it doesn't have any pending callbacks. (Callbacks should hold a reference anyway).
2. The pointer race solved by dma_fence_get_rcu_safe is real and requires the loop to sort out.
But let's say I have a dma_fence pointer that I got from, say, calling dma_resv_excl_fence() under rcu_read_lock(). What am I allowed to do with it under the RCU lock? What assumptions can I make? Is this code, for instance, ok?
rcu_read_lock(); fence = dma_resv_excl_fence(obj); idle = !fence || test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags); rcu_read_unlock();
This code very much looks correct under the following assumptions:
1. A valid fence pointer stays alive under the RCU read lock 2. SIGNALED_BIT is set-once (it's never unset after being set).
However, if it were, we wouldn't have dma_resv_test_singnaled(), now would we? :-)
The moment you introduce ANY dma_fence recycling that recycles a dma_fence within a single RCU grace period, all your assumptions break down. SLAB_TYPESAFE_BY_RCU is just one way that i915 does this. We also have a little i915_request recycler to try and help with memory pressure scenarios in certain critical sections that also doesn't respect RCU grace periods. And, as mentioned multiple times, our recycling leaks into every other driver because, thanks to i915's choice, the above 4-line code snippet isn't valid ANYWHERE in the kernel.
So the question I'm raising isn't so much about the rules today. Today, we live in the wild wild west where everything is YOLO. But where do we want to go? Do we like this wild west world? So we want more consistency under the RCU read lock? If so, what do we want the rules to be?
One option would be to accept the wild-west world we live in and say "The RCU read lock gains you nothing. If you want to touch the guts of a dma_fence, take a reference". But, at that point, we're eating two atomics for every time someone wants to look at a dma_fence. Do we want that?
Alternatively, and this what I think Daniel and I were trying to propose here, is that we place some constraints on dma_fence recycling. Specifically that, under the RCU read lock, the fence doesn't suddenly become a new fence. All of the immutability and once-mutability guarantees of various bits of dma_fence hold as long as you have the RCU read lock.
--Jason
On Thu, Jun 10, 2021 at 10:10 PM Jason Ekstrand jason@jlekstrand.net wrote:
Yeah this is suboptimal. Too many potential bugs, not enough benefits.
This entire __rcu business started so that there would be a lockless way to get at fences, or at least the exclusive one. That did not really pan out. I think we have a few options:
- drop the idea of rcu/lockless dma-fence access outright. A quick sequence of grabbing the lock, acquiring the dma_fence and then dropping your lock again is probably plenty good. There's a lot of call_rcu and other stuff we could probably delete. I have no idea what the perf impact across all the drivers would be.
- try to make all drivers follow some stricter rules. The trouble is that at least with radeon dma_fence callbacks aren't even very reliable (that's why it has its own dma_fence_wait implementation), so things are wobbly anyway.
- live with the current situation, but radically delete all unsafe interfaces. I.e. nothing is allowed to directly deref an rcu fence pointer, everything goes through dma_fence_get_rcu_safe. The kref_get_unless_zero would become an internal implementation detail. Our "fast" and "lockless" dma_resv fence access stays a pile of seqlock, retry loop and an a conditional atomic inc + atomic dec. The only thing that's slightly faster would be dma_resv_test_signaled()
- I guess minimally we should rename dma_fence_get_rcu to dma_fence_tryget. It has nothing to do with rcu really, and the use is very, very limited.
Not sure what's a good idea here tbh. -Daniel
Am 10.06.21 um 22:42 schrieb Daniel Vetter:
The question is maybe not the perf impact, but rather if that is possible over all.
IIRC we now have some cases in TTM where RCU is mandatory and we simply don't have any other choice than using it.
I think what we should do is to use RCU internally in the dma_resv object but disallow drivers/frameworks to mess with that directly.
In other words drivers should use one of the following: 1. dma_resv_wait_timeout() 2. dma_resv_test_signaled() 3. dma_resv_copy_fences() 4. dma_resv_get_fences() 5. dma_resv_for_each_fence() <- to be implemented 6. dma_resv_for_each_fence_unlocked() <- to be implemented
Inside those functions we then make sure that we only save ways of accessing the RCU protected data structures.
This way we only need to make sure that those accessor functions are sane and don't need to audit every driver individually.
I can tackle implementing for the dma_res_for_each_fence()/_unlocked(). Already got a large bunch of that coded out anyway.
Regards, Christian.
Not sure what's a good idea here tbh. -Daniel
On Fri, Jun 11, 2021 at 8:55 AM Christian König christian.koenig@amd.com wrote:
Adding Thomas Hellstrom.
Where is that stuff? If we end up with all the dma_resv locking complexity just for an oddball, then I think that would be rather big bummer.
Yeah better encapsulation for dma_resv sounds like a good thing, least for all the other issues we've been discussing recently. I guess your list is also missing the various "add/replace some more fences" functions, but we have them already.
I can tackle implementing for the dma_res_for_each_fence()/_unlocked(). Already got a large bunch of that coded out anyway.
When/where do we need ot iterate over fences unlocked? Given how much pain it is to get a consistent snapshot of the fences or fence state (I've read the dma-buf poll implementation, and it looks a bit buggy in that regard, but not sure, just as an example) and unlocked iterator sounds very dangerous to me. -Daniel
Am 11.06.21 um 09:20 schrieb Daniel Vetter:
This is during buffer destruction. See the call to dma_resv_copy_fences().
But that is basically just using a dma_resv function which accesses the object without taking a lock.
This is to make implementation of the other functions easier. Currently they basically each roll their own loop implementation which at least for dma_resv_test_signaled() looks a bit questionable to me.
Additionally to those we we have one more case in i915 and the unlocked polling implementation which I agree is a bit questionable as well.
My idea is to have the problematic logic in the iterator and only give back fence which have a reference and are 100% sure the right one.
Probably best if I show some code around to explain what I mean.
Regards, Christian.
-Daniel
On Fri, Jun 11, 2021 at 09:42:07AM +0200, Christian König wrote:
Ok yeah that's tricky.
The way solved this in i915 is with a trylock and punting to a worker queue if the trylock fails. And the worker queue would also be flushed from the shrinker (once we get there at least).
So this looks fixable.
But that is basically just using a dma_resv function which accesses the object without taking a lock.
The other one I've found is the ghost object, but that one is locked fully.
Yeah, the more I look at any of these lockless loop things the more I'm worried. 90% sure the one in dma_buf_poll is broken too.
My gut feeling is that we should just try and convert them all over to taking the dma_resv_lock. And if there is really a contention issue with that, then either try to shrink it, or make it an rwlock or similar. But just the more I read a lot of the implementations the more I see bugs and have questions.
Maybe at the end a few will be left over, and then we can look at these individually in detail. Like the ttm_bo_individualize_resv situation.
Am 11.06.21 um 11:33 schrieb Daniel Vetter:
That's what we already had done here as well, but the worker is exactly what we wanted to avoid by this.
So this looks fixable.
I'm not sure of that. We had really good reasons to remove the worker.
How about we abstract all that funny rcu dance inside the iterator instead?
I mean when we just have one walker function which is well documented and understood then the rest becomes relatively easy.
Christian.
Maybe at the end a few will be left over, and then we can look at these individually in detail. Like the ttm_bo_individualize_resv situation.
On Fri, Jun 11, 2021 at 12:03:31PM +0200, Christian König wrote:
I've looked around, and I didn't see any huge changes around the delayed_delete work. There's lots of changes on how the lru is handled to optimize that.
And even today we still have the delayed deletion thing.
So essentially what I had in mind is instead of just ttm_bo_cleanup_refs you first check whether the resv is individualized already, and if not do that first.
This means there's a slight delay when a bo is deleted between when the refcount drops, and when we actually individualize the fences.
What was the commit that removed another worker here? -Daniel
dri-devel@lists.freedesktop.org