Hi Andrew,
Daniel suggested that I ping you once more about this.
Basically we want to add a barrier function to make sure that our TTM pool shrinker is not freeing up pages from a device while the device is being unplugged.
Currently we are having a global mutex to serialize all of this, but this caused contention for unmapping the freed pages in the IOMMU.
We just need your Acked-by and I hope my explanation is now more understandable than the last time.
Cheers, Christian.
From: Christian König ckoenig.leichtzumerken@gmail.com
While unplugging a device the TTM shrinker implementation needs a barrier to make sure that all concurrent shrink operations are done and no other CPU is referring to a device specific pool any more.
Taking and releasing the shrinker semaphore on the write side after unmapping and freeing all pages from the device pool should make sure that no shrinker is running in paralell.
This allows us to avoid the contented mutex in the TTM pool implementation for every alloc/free operation.
v2: rework the commit message to make clear why we need this
Signed-off-by: Christian König christian.koenig@amd.com Acked-by: Huang Rui ray.huang@amd.com Reviewed-by: Daniel Vetter daniel.vetter@ffwll.ch --- include/linux/shrinker.h | 1 + mm/vmscan.c | 10 ++++++++++ 2 files changed, 11 insertions(+)
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h index 9814fff58a69..1de17f53cdbc 100644 --- a/include/linux/shrinker.h +++ b/include/linux/shrinker.h @@ -93,4 +93,5 @@ extern void register_shrinker_prepared(struct shrinker *shrinker); extern int register_shrinker(struct shrinker *shrinker); extern void unregister_shrinker(struct shrinker *shrinker); extern void free_prealloced_shrinker(struct shrinker *shrinker); +extern void sync_shrinkers(void); #endif diff --git a/mm/vmscan.c b/mm/vmscan.c index 4620df62f0ff..fde1aabcfa7f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -638,6 +638,16 @@ void unregister_shrinker(struct shrinker *shrinker) } EXPORT_SYMBOL(unregister_shrinker);
+/** + * sync_shrinker - Wait for all running shrinkers to complete. + */ +void sync_shrinkers(void) +{ + down_write(&shrinker_rwsem); + up_write(&shrinker_rwsem); +} +EXPORT_SYMBOL(sync_shrinkers); + #define SHRINK_BATCH 128
static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
On Fri, 20 Aug 2021 14:05:27 +0200 "Christian König" ckoenig.leichtzumerken@gmail.com wrote:
While unplugging a device the TTM shrinker implementation needs a barrier to make sure that all concurrent shrink operations are done and no other CPU is referring to a device specific pool any more.
Taking and releasing the shrinker semaphore on the write side after unmapping and freeing all pages from the device pool should make sure that no shrinker is running in paralell.
This allows us to avoid the contented mutex in the TTM pool implementation for every alloc/free operation.
v2: rework the commit message to make clear why we need this
Acked-by: Andrew Morton akpm@linux-foundation.org
On Fri, Aug 20, 2021 at 02:05:27PM +0200, Christian König wrote:
From: Christian König ckoenig.leichtzumerken@gmail.com
While unplugging a device the TTM shrinker implementation needs a barrier to make sure that all concurrent shrink operations are done and no other CPU is referring to a device specific pool any more.
Taking and releasing the shrinker semaphore on the write side after unmapping and freeing all pages from the device pool should make sure that no shrinker is running in paralell.
This allows us to avoid the contented mutex in the TTM pool implementation for every alloc/free operation.
v2: rework the commit message to make clear why we need this
Signed-off-by: Christian König christian.koenig@amd.com Acked-by: Huang Rui ray.huang@amd.com Reviewed-by: Daniel Vetter daniel.vetter@ffwll.ch
include/linux/shrinker.h | 1 + mm/vmscan.c | 10 ++++++++++ 2 files changed, 11 insertions(+)
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h index 9814fff58a69..1de17f53cdbc 100644 --- a/include/linux/shrinker.h +++ b/include/linux/shrinker.h @@ -93,4 +93,5 @@ extern void register_shrinker_prepared(struct shrinker *shrinker); extern int register_shrinker(struct shrinker *shrinker); extern void unregister_shrinker(struct shrinker *shrinker); extern void free_prealloced_shrinker(struct shrinker *shrinker); +extern void sync_shrinkers(void); #endif diff --git a/mm/vmscan.c b/mm/vmscan.c index 4620df62f0ff..fde1aabcfa7f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -638,6 +638,16 @@ void unregister_shrinker(struct shrinker *shrinker) } EXPORT_SYMBOL(unregister_shrinker);
+/**
- sync_shrinker - Wait for all running shrinkers to complete.
I think it would be good to add a bit more text here maybe:
"This is equivalent to calling unregister_shrink() and register_shrinker(), but atomically and with less overhead. This is useful to guarantee that all shrinker invocations have seen an update, before freeing memory, similar to rcu."
Also a bit a bikeshed, but if we look at the equivalent in irq land it's called synchronize_irq() and synchronize_hardirq(). I think it'd be good to bikeshed that for more conceptual consistency. -Daniel
- */
+void sync_shrinkers(void) +{
- down_write(&shrinker_rwsem);
- up_write(&shrinker_rwsem);
+} +EXPORT_SYMBOL(sync_shrinkers);
#define SHRINK_BATCH 128
static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
2.25.1
On Thu, Aug 26, 2021 at 03:27:30PM +0200, Daniel Vetter wrote:
On Fri, Aug 20, 2021 at 02:05:27PM +0200, Christian König wrote:
From: Christian König ckoenig.leichtzumerken@gmail.com
While unplugging a device the TTM shrinker implementation needs a barrier to make sure that all concurrent shrink operations are done and no other CPU is referring to a device specific pool any more.
Taking and releasing the shrinker semaphore on the write side after unmapping and freeing all pages from the device pool should make sure that no shrinker is running in paralell.
This allows us to avoid the contented mutex in the TTM pool implementation for every alloc/free operation.
v2: rework the commit message to make clear why we need this
Signed-off-by: Christian König christian.koenig@amd.com Acked-by: Huang Rui ray.huang@amd.com Reviewed-by: Daniel Vetter daniel.vetter@ffwll.ch
include/linux/shrinker.h | 1 + mm/vmscan.c | 10 ++++++++++ 2 files changed, 11 insertions(+)
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h index 9814fff58a69..1de17f53cdbc 100644 --- a/include/linux/shrinker.h +++ b/include/linux/shrinker.h @@ -93,4 +93,5 @@ extern void register_shrinker_prepared(struct shrinker *shrinker); extern int register_shrinker(struct shrinker *shrinker); extern void unregister_shrinker(struct shrinker *shrinker); extern void free_prealloced_shrinker(struct shrinker *shrinker); +extern void sync_shrinkers(void); #endif diff --git a/mm/vmscan.c b/mm/vmscan.c index 4620df62f0ff..fde1aabcfa7f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -638,6 +638,16 @@ void unregister_shrinker(struct shrinker *shrinker) } EXPORT_SYMBOL(unregister_shrinker);
+/**
- sync_shrinker - Wait for all running shrinkers to complete.
I think it would be good to add a bit more text here maybe:
"This is equivalent to calling unregister_shrink() and register_shrinker(), but atomically and with less overhead. This is useful to guarantee that all shrinker invocations have seen an update, before freeing memory, similar to rcu."
Also a bit a bikeshed, but if we look at the equivalent in irq land it's called synchronize_irq() and synchronize_hardirq(). I think it'd be good to bikeshed that for more conceptual consistency.
Oh also synchronize_*rcu* also spells them all out, so even more reasons to do the same. -Daniel
- */
+void sync_shrinkers(void) +{
- down_write(&shrinker_rwsem);
- up_write(&shrinker_rwsem);
+} +EXPORT_SYMBOL(sync_shrinkers);
#define SHRINK_BATCH 128
static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
2.25.1
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
Am 26.08.21 um 15:28 schrieb Daniel Vetter:
On Thu, Aug 26, 2021 at 03:27:30PM +0200, Daniel Vetter wrote:
On Fri, Aug 20, 2021 at 02:05:27PM +0200, Christian König wrote:
From: Christian König ckoenig.leichtzumerken@gmail.com
While unplugging a device the TTM shrinker implementation needs a barrier to make sure that all concurrent shrink operations are done and no other CPU is referring to a device specific pool any more.
Taking and releasing the shrinker semaphore on the write side after unmapping and freeing all pages from the device pool should make sure that no shrinker is running in paralell.
This allows us to avoid the contented mutex in the TTM pool implementation for every alloc/free operation.
v2: rework the commit message to make clear why we need this
Signed-off-by: Christian König christian.koenig@amd.com Acked-by: Huang Rui ray.huang@amd.com Reviewed-by: Daniel Vetter daniel.vetter@ffwll.ch
include/linux/shrinker.h | 1 + mm/vmscan.c | 10 ++++++++++ 2 files changed, 11 insertions(+)
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h index 9814fff58a69..1de17f53cdbc 100644 --- a/include/linux/shrinker.h +++ b/include/linux/shrinker.h @@ -93,4 +93,5 @@ extern void register_shrinker_prepared(struct shrinker *shrinker); extern int register_shrinker(struct shrinker *shrinker); extern void unregister_shrinker(struct shrinker *shrinker); extern void free_prealloced_shrinker(struct shrinker *shrinker); +extern void sync_shrinkers(void); #endif diff --git a/mm/vmscan.c b/mm/vmscan.c index 4620df62f0ff..fde1aabcfa7f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -638,6 +638,16 @@ void unregister_shrinker(struct shrinker *shrinker) } EXPORT_SYMBOL(unregister_shrinker);
+/**
- sync_shrinker - Wait for all running shrinkers to complete.
I think it would be good to add a bit more text here maybe:
"This is equivalent to calling unregister_shrink() and register_shrinker(), but atomically and with less overhead. This is useful to guarantee that all shrinker invocations have seen an update, before freeing memory, similar to rcu."
Also a bit a bikeshed, but if we look at the equivalent in irq land it's called synchronize_irq() and synchronize_hardirq(). I think it'd be good to bikeshed that for more conceptual consistency.
Oh also synchronize_*rcu* also spells them all out, so even more reasons to do the same.
I will just go with the explanation above.
The synchronize_rcu() explanation is so extensive that most people will probably stop reading after the first paragraph.
Thanks, Christian.
-Daniel
- */
+void sync_shrinkers(void) +{
- down_write(&shrinker_rwsem);
- up_write(&shrinker_rwsem);
+} +EXPORT_SYMBOL(sync_shrinkers);
#define SHRINK_BATCH 128
static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
-- 2.25.1
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Thu, Aug 26, 2021 at 04:58:06PM +0200, Christian König wrote:
Am 26.08.21 um 15:28 schrieb Daniel Vetter:
On Thu, Aug 26, 2021 at 03:27:30PM +0200, Daniel Vetter wrote:
On Fri, Aug 20, 2021 at 02:05:27PM +0200, Christian König wrote:
From: Christian König ckoenig.leichtzumerken@gmail.com
While unplugging a device the TTM shrinker implementation needs a barrier to make sure that all concurrent shrink operations are done and no other CPU is referring to a device specific pool any more.
Taking and releasing the shrinker semaphore on the write side after unmapping and freeing all pages from the device pool should make sure that no shrinker is running in paralell.
This allows us to avoid the contented mutex in the TTM pool implementation for every alloc/free operation.
v2: rework the commit message to make clear why we need this
Signed-off-by: Christian König christian.koenig@amd.com Acked-by: Huang Rui ray.huang@amd.com Reviewed-by: Daniel Vetter daniel.vetter@ffwll.ch
include/linux/shrinker.h | 1 + mm/vmscan.c | 10 ++++++++++ 2 files changed, 11 insertions(+)
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h index 9814fff58a69..1de17f53cdbc 100644 --- a/include/linux/shrinker.h +++ b/include/linux/shrinker.h @@ -93,4 +93,5 @@ extern void register_shrinker_prepared(struct shrinker *shrinker); extern int register_shrinker(struct shrinker *shrinker); extern void unregister_shrinker(struct shrinker *shrinker); extern void free_prealloced_shrinker(struct shrinker *shrinker); +extern void sync_shrinkers(void); #endif diff --git a/mm/vmscan.c b/mm/vmscan.c index 4620df62f0ff..fde1aabcfa7f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -638,6 +638,16 @@ void unregister_shrinker(struct shrinker *shrinker) } EXPORT_SYMBOL(unregister_shrinker); +/**
- sync_shrinker - Wait for all running shrinkers to complete.
I think it would be good to add a bit more text here maybe:
"This is equivalent to calling unregister_shrink() and register_shrinker(), but atomically and with less overhead. This is useful to guarantee that all shrinker invocations have seen an update, before freeing memory, similar to rcu."
Also a bit a bikeshed, but if we look at the equivalent in irq land it's called synchronize_irq() and synchronize_hardirq(). I think it'd be good to bikeshed that for more conceptual consistency.
Oh also synchronize_*rcu* also spells them all out, so even more reasons to do the same.
I will just go with the explanation above.
The synchronize_rcu() explanation is so extensive that most people will probably stop reading after the first paragraph.
Ack, my comment was only about the function name (spelled out instead of abbreviated), not about pulling the entire kerneldoc in from these. -Daniel
Thanks, Christian.
-Daniel
- */
+void sync_shrinkers(void) +{
- down_write(&shrinker_rwsem);
- up_write(&shrinker_rwsem);
+} +EXPORT_SYMBOL(sync_shrinkers);
- #define SHRINK_BATCH 128 static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
-- 2.25.1
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
Am 26.08.21 um 17:34 schrieb Daniel Vetter:
On Thu, Aug 26, 2021 at 04:58:06PM +0200, Christian König wrote:
Am 26.08.21 um 15:28 schrieb Daniel Vetter:
On Thu, Aug 26, 2021 at 03:27:30PM +0200, Daniel Vetter wrote:
On Fri, Aug 20, 2021 at 02:05:27PM +0200, Christian König wrote:
From: Christian König ckoenig.leichtzumerken@gmail.com
While unplugging a device the TTM shrinker implementation needs a barrier to make sure that all concurrent shrink operations are done and no other CPU is referring to a device specific pool any more.
Taking and releasing the shrinker semaphore on the write side after unmapping and freeing all pages from the device pool should make sure that no shrinker is running in paralell.
This allows us to avoid the contented mutex in the TTM pool implementation for every alloc/free operation.
v2: rework the commit message to make clear why we need this
Signed-off-by: Christian König christian.koenig@amd.com Acked-by: Huang Rui ray.huang@amd.com Reviewed-by: Daniel Vetter daniel.vetter@ffwll.ch
include/linux/shrinker.h | 1 + mm/vmscan.c | 10 ++++++++++ 2 files changed, 11 insertions(+)
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h index 9814fff58a69..1de17f53cdbc 100644 --- a/include/linux/shrinker.h +++ b/include/linux/shrinker.h @@ -93,4 +93,5 @@ extern void register_shrinker_prepared(struct shrinker *shrinker); extern int register_shrinker(struct shrinker *shrinker); extern void unregister_shrinker(struct shrinker *shrinker); extern void free_prealloced_shrinker(struct shrinker *shrinker); +extern void sync_shrinkers(void); #endif diff --git a/mm/vmscan.c b/mm/vmscan.c index 4620df62f0ff..fde1aabcfa7f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -638,6 +638,16 @@ void unregister_shrinker(struct shrinker *shrinker) } EXPORT_SYMBOL(unregister_shrinker); +/**
- sync_shrinker - Wait for all running shrinkers to complete.
I think it would be good to add a bit more text here maybe:
"This is equivalent to calling unregister_shrink() and register_shrinker(), but atomically and with less overhead. This is useful to guarantee that all shrinker invocations have seen an update, before freeing memory, similar to rcu."
Also a bit a bikeshed, but if we look at the equivalent in irq land it's called synchronize_irq() and synchronize_hardirq(). I think it'd be good to bikeshed that for more conceptual consistency.
Oh also synchronize_*rcu* also spells them all out, so even more reasons to do the same.
I will just go with the explanation above.
The synchronize_rcu() explanation is so extensive that most people will probably stop reading after the first paragraph.
Ack, my comment was only about the function name (spelled out instead of abbreviated), not about pulling the entire kerneldoc in from these.
Ah, good point. Going to change that as well.
Christian.
-Daniel
Thanks, Christian.
-Daniel
- */
+void sync_shrinkers(void) +{
- down_write(&shrinker_rwsem);
- up_write(&shrinker_rwsem);
+} +EXPORT_SYMBOL(sync_shrinkers);
- #define SHRINK_BATCH 128 static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
-- 2.25.1
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
Switch back to using a spinlock again by moving the IOMMU unmap outside of the locked region.
This avoids contention especially while freeing pages.
v2: Add a comment explaining why we need sync_shrinkers().
Signed-off-by: Christian König christian.koenig@amd.com Acked-by: Huang Rui ray.huang@amd.com Reviewed-by: Daniel Vetter daniel.vetter@ffwll.ch --- drivers/gpu/drm/ttm/ttm_pool.c | 40 +++++++++++++++++++--------------- 1 file changed, 22 insertions(+), 18 deletions(-)
diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c index cb38b1a17b09..7d4f76d4141d 100644 --- a/drivers/gpu/drm/ttm/ttm_pool.c +++ b/drivers/gpu/drm/ttm/ttm_pool.c @@ -70,7 +70,7 @@ static struct ttm_pool_type global_uncached[MAX_ORDER]; static struct ttm_pool_type global_dma32_write_combined[MAX_ORDER]; static struct ttm_pool_type global_dma32_uncached[MAX_ORDER];
-static struct mutex shrinker_lock; +static spinlock_t shrinker_lock; static struct list_head shrinker_list; static struct shrinker mm_shrinker;
@@ -263,9 +263,9 @@ static void ttm_pool_type_init(struct ttm_pool_type *pt, struct ttm_pool *pool, spin_lock_init(&pt->lock); INIT_LIST_HEAD(&pt->pages);
- mutex_lock(&shrinker_lock); + spin_lock(&shrinker_lock); list_add_tail(&pt->shrinker_list, &shrinker_list); - mutex_unlock(&shrinker_lock); + spin_unlock(&shrinker_lock); }
/* Remove a pool_type from the global shrinker list and free all pages */ @@ -273,9 +273,9 @@ static void ttm_pool_type_fini(struct ttm_pool_type *pt) { struct page *p;
- mutex_lock(&shrinker_lock); + spin_lock(&shrinker_lock); list_del(&pt->shrinker_list); - mutex_unlock(&shrinker_lock); + spin_unlock(&shrinker_lock);
while ((p = ttm_pool_type_take(pt))) ttm_pool_free_page(pt->pool, pt->caching, pt->order, p); @@ -313,24 +313,23 @@ static struct ttm_pool_type *ttm_pool_select_type(struct ttm_pool *pool, static unsigned int ttm_pool_shrink(void) { struct ttm_pool_type *pt; - unsigned int num_freed; + unsigned int num_pages; struct page *p;
- mutex_lock(&shrinker_lock); + spin_lock(&shrinker_lock); pt = list_first_entry(&shrinker_list, typeof(*pt), shrinker_list); + list_move_tail(&pt->shrinker_list, &shrinker_list); + spin_unlock(&shrinker_lock);
p = ttm_pool_type_take(pt); if (p) { ttm_pool_free_page(pt->pool, pt->caching, pt->order, p); - num_freed = 1 << pt->order; + num_pages = 1 << pt->order; } else { - num_freed = 0; + num_pages = 0; }
- list_move_tail(&pt->shrinker_list, &shrinker_list); - mutex_unlock(&shrinker_lock); - - return num_freed; + return num_pages; }
/* Return the allocation order based for a page */ @@ -530,6 +529,11 @@ void ttm_pool_fini(struct ttm_pool *pool) for (j = 0; j < MAX_ORDER; ++j) ttm_pool_type_fini(&pool->caching[i].orders[j]); } + + /* We removed the pool types from the LRU, but we need to also make sure + * that no shrinker is concurrently freeing pages from the pool. + */ + sync_shrinkers(); }
/* As long as pages are available make sure to release at least one */ @@ -604,7 +608,7 @@ static int ttm_pool_debugfs_globals_show(struct seq_file *m, void *data) { ttm_pool_debugfs_header(m);
- mutex_lock(&shrinker_lock); + spin_lock(&shrinker_lock); seq_puts(m, "wc\t:"); ttm_pool_debugfs_orders(global_write_combined, m); seq_puts(m, "uc\t:"); @@ -613,7 +617,7 @@ static int ttm_pool_debugfs_globals_show(struct seq_file *m, void *data) ttm_pool_debugfs_orders(global_dma32_write_combined, m); seq_puts(m, "uc 32\t:"); ttm_pool_debugfs_orders(global_dma32_uncached, m); - mutex_unlock(&shrinker_lock); + spin_unlock(&shrinker_lock);
ttm_pool_debugfs_footer(m);
@@ -640,7 +644,7 @@ int ttm_pool_debugfs(struct ttm_pool *pool, struct seq_file *m)
ttm_pool_debugfs_header(m);
- mutex_lock(&shrinker_lock); + spin_lock(&shrinker_lock); for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i) { seq_puts(m, "DMA "); switch (i) { @@ -656,7 +660,7 @@ int ttm_pool_debugfs(struct ttm_pool *pool, struct seq_file *m) } ttm_pool_debugfs_orders(pool->caching[i].orders, m); } - mutex_unlock(&shrinker_lock); + spin_unlock(&shrinker_lock);
ttm_pool_debugfs_footer(m); return 0; @@ -693,7 +697,7 @@ int ttm_pool_mgr_init(unsigned long num_pages) if (!page_pool_size) page_pool_size = num_pages;
- mutex_init(&shrinker_lock); + spin_lock_init(&shrinker_lock); INIT_LIST_HEAD(&shrinker_list);
for (i = 0; i < MAX_ORDER; ++i) {
dri-devel@lists.freedesktop.org