[PATCH 0/4] drm/ttm: Get rid of a number of atomic read-modify-write ops

List overview All Threads
Download

newer

older

[Bug 49981] [drm:radeon_cs_ioctl]...

[PATCH 0/4] drm/ttm: Get rid of a...

Thomas Hellstrom

5 Nov 2012 5 Nov '12

1:31 p.m.

A patch series for next that removes a substantial number of read-modify-write operations from TTM command submission, in particular if TTM objects are used to export objects to user-space. The only per-object atomic r-m-w operations left during a typical execbuf call should be refcount up and down.

In-Reply-To:

Show replies by date

Thomas Hellstrom

5 Nov 5 Nov

1:31 p.m.

New subject: [PATCH 1/4] drm: Make hashtab rcu-safe

TTM base objects will be the first consumer.

Signed-off-by: Thomas Hellstrom thellstrom@vmware.com --- drivers/gpu/drm/drm_hashtab.c | 18 +++++++----------- 1 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/drm_hashtab.c b/drivers/gpu/drm/drm_hashtab.c index c3745c4..5729e39 100644 --- a/drivers/gpu/drm/drm_hashtab.c +++ b/drivers/gpu/drm/drm_hashtab.c @@ -67,10 +67,8 @@ void drm_ht_verbose_list(struct drm_open_hash *ht, unsigned long key) hashed_key = hash_long(key, ht->order); DRM_DEBUG("Key is 0x%08lx, Hashed key is 0x%08x\n", key, hashed_key); h_list = &ht->table[hashed_key]; - hlist_for_each(list, h_list) { - entry = hlist_entry(list, struct drm_hash_item, head); + hlist_for_each_entry_rcu(entry, list, h_list, head) DRM_DEBUG("count %d, key: 0x%08lx\n", count++, entry->key); - } }

static struct hlist_node *drm_ht_find_key(struct drm_open_hash *ht, @@ -83,8 +81,7 @@ static struct hlist_node *drm_ht_find_key(struct drm_open_hash *ht,

hashed_key = hash_long(key, ht->order); h_list = &ht->table[hashed_key]; - hlist_for_each(list, h_list) { - entry = hlist_entry(list, struct drm_hash_item, head); + hlist_for_each_entry_rcu(entry, list, h_list, head) { if (entry->key == key) return list; if (entry->key > key) @@ -105,8 +102,7 @@ int drm_ht_insert_item(struct drm_open_hash *ht, struct drm_hash_item *item) hashed_key = hash_long(key, ht->order); h_list = &ht->table[hashed_key]; parent = NULL; - hlist_for_each(list, h_list) { - entry = hlist_entry(list, struct drm_hash_item, head); + hlist_for_each_entry_rcu(entry, list, h_list, head) { if (entry->key == key) return -EINVAL; if (entry->key > key) @@ -114,9 +110,9 @@ int drm_ht_insert_item(struct drm_open_hash *ht, struct drm_hash_item *item) parent = list; } if (parent) { - hlist_add_after(parent, &item->head); + hlist_add_after_rcu(parent, &item->head); } else { - hlist_add_head(&item->head, h_list); + hlist_add_head_rcu(&item->head, h_list); } return 0; } @@ -171,7 +167,7 @@ int drm_ht_remove_key(struct drm_open_hash *ht, unsigned long key)

list = drm_ht_find_key(ht, key); if (list) { - hlist_del_init(list); + hlist_del_init_rcu(list); return 0; } return -EINVAL; @@ -179,7 +175,7 @@ int drm_ht_remove_key(struct drm_open_hash *ht, unsigned long key)

int drm_ht_remove_item(struct drm_open_hash *ht, struct drm_hash_item *item) { - hlist_del_init(&item->head); + hlist_del_init_rcu(&item->head); return 0; } EXPORT_SYMBOL(drm_ht_remove_item);

-- 1.7.4.4

Thomas Hellstrom

1:31 p.m.

New subject: [PATCH 2/4] kref: Implement kref_get_unless_zero

This function is intended to simplify locking around refcounting for objects that can be looked up from a lookup structure, and which are removed from that lookup structure in the object destructor. Operations on such objects require at least a read lock around lookup + kref_get, and a write lock around kref_put + remove from lookup structure. Furthermore, RCU implementations become extremely tricky. With a lookup followed by a kref_get_unless_zero *with return value check* locking in the kref_put path can be deferred to the actual removal from the lookup structure and RCU lookups become trivial.

Signed-off-by: Thomas Hellstrom thellstrom@vmware.com --- include/linux/kref.h | 21 +++++++++++++++++++++ 1 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/include/linux/kref.h b/include/linux/kref.h index 65af688..fd16042 100644 --- a/include/linux/kref.h +++ b/include/linux/kref.h @@ -111,4 +111,25 @@ static inline int kref_put_mutex(struct kref *kref, } return 0; } + +/** + * kref_get_unless_zero - Increment refcount for object unless it is zero. + * @kref: object. + * + * Return 0 if the increment succeeded. Otherwise return non-zero. + + * This function is intended to simplify locking around refcounting for + * objects that can be looked up from a lookup structure, and which are + * removed from that lookup structure in the object destructor. + * Operations on such objects require at least a read lock around + * lookup + kref_get, and a write lock around kref_put + remove from lookup + * structure. Furthermore, RCU implementations become extremely tricky. + * With a lookup followed by a kref_get_unless_zero *with return value check* + * locking in the kref_put path can be deferred to the actual removal from + * the lookup structure and RCU lookups become trivial. + */ +static inline int __must_check kref_get_unless_zero(struct kref *kref) +{ + return !atomic_add_unless(&kref->refcount, 1, 0); +} #endif /* _KREF_H_ */

-- 1.7.4.4

Thomas Hellstrom

1:31 p.m.

New subject: [PATCH 3/4] drm/ttm, drm/vmwgfx: Use RCU locking for object lookups

The mostly used lookup+get put+potential_destroy path of TTM objects is converted to use RCU locks. This will substantially decrease the amount of locked bus cycles during normal operation. Since we use kfree_rcu to free the objects, no rcu synchronization is needed at module unload time.

Signed-off-by: Thomas Hellstrom thellstrom@vmware.com --- drivers/gpu/drm/ttm/ttm_object.c | 30 +++++++++++------------------- drivers/gpu/drm/vmwgfx/vmwgfx_resource.c | 8 ++++---- include/drm/ttm/ttm_object.h | 4 ++++ include/linux/kref.h | 2 +- 4 files changed, 20 insertions(+), 24 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_object.c b/drivers/gpu/drm/ttm/ttm_object.c index c785787..9d7f674 100644 --- a/drivers/gpu/drm/ttm/ttm_object.c +++ b/drivers/gpu/drm/ttm/ttm_object.c @@ -80,7 +80,7 @@ struct ttm_object_file { */

struct ttm_object_device { - rwlock_t object_lock; + spinlock_t object_lock; struct drm_open_hash object_hash; atomic_t object_count; struct ttm_mem_global *mem_glob; @@ -157,12 +157,12 @@ int ttm_base_object_init(struct ttm_object_file *tfile, base->refcount_release = refcount_release; base->ref_obj_release = ref_obj_release; base->object_type = object_type; - write_lock(&tdev->object_lock); + spin_lock(&tdev->object_lock); kref_init(&base->refcount); ret = drm_ht_just_insert_please(&tdev->object_hash, &base->hash, (unsigned long)base, 31, 0, 0); - write_unlock(&tdev->object_lock); + spin_unlock(&tdev->object_lock); if (unlikely(ret != 0)) goto out_err0;

@@ -186,30 +186,22 @@ static void ttm_release_base(struct kref *kref) container_of(kref, struct ttm_base_object, refcount); struct ttm_object_device *tdev = base->tfile->tdev;

+ spin_lock(&tdev->object_lock); (void)drm_ht_remove_item(&tdev->object_hash, &base->hash); - write_unlock(&tdev->object_lock); + spin_unlock(&tdev->object_lock); if (base->refcount_release) { ttm_object_file_unref(&base->tfile); base->refcount_release(&base); } - write_lock(&tdev->object_lock); }

void ttm_base_object_unref(struct ttm_base_object **p_base) { struct ttm_base_object *base = *p_base; - struct ttm_object_device *tdev = base->tfile->tdev;

*p_base = NULL;

- /* - * Need to take the lock here to avoid racing with - * users trying to look up the object. - */ - - write_lock(&tdev->object_lock); kref_put(&base->refcount, ttm_release_base); - write_unlock(&tdev->object_lock); } EXPORT_SYMBOL(ttm_base_object_unref);

@@ -221,14 +213,14 @@ struct ttm_base_object *ttm_base_object_lookup(struct ttm_object_file *tfile, struct drm_hash_item *hash; int ret;

- read_lock(&tdev->object_lock); + rcu_read_lock(); ret = drm_ht_find_item(&tdev->object_hash, key, &hash);

if (likely(ret == 0)) { base = drm_hash_entry(hash, struct ttm_base_object, hash); - kref_get(&base->refcount); + ret = kref_get_unless_zero(&base->refcount); } - read_unlock(&tdev->object_lock); + rcu_read_unlock();

if (unlikely(ret != 0)) return NULL; @@ -426,7 +418,7 @@ struct ttm_object_device *ttm_object_device_init(struct ttm_mem_global return NULL;

tdev->mem_glob = mem_glob; - rwlock_init(&tdev->object_lock); + spin_lock_init(&tdev->object_lock); atomic_set(&tdev->object_count, 0); ret = drm_ht_create(&tdev->object_hash, hash_order);

@@ -444,9 +436,9 @@ void ttm_object_device_release(struct ttm_object_device **p_tdev)

*p_tdev = NULL;

- write_lock(&tdev->object_lock); + spin_lock(&tdev->object_lock); drm_ht_remove(&tdev->object_hash); - write_unlock(&tdev->object_lock); + spin_unlock(&tdev->object_lock);

kfree(tdev); } diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c index da3c6b5..ae675c6 100644 --- a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c @@ -351,7 +351,7 @@ static void vmw_user_context_free(struct vmw_resource *res) container_of(res, struct vmw_user_context, res); struct vmw_private *dev_priv = res->dev_priv;

- kfree(ctx); + ttm_base_object_kfree(ctx, base); ttm_mem_global_free(vmw_mem_glob(dev_priv), vmw_user_context_size); } @@ -1147,7 +1147,7 @@ static void vmw_user_surface_free(struct vmw_resource *res) kfree(srf->offsets); kfree(srf->sizes); kfree(srf->snooper.image); - kfree(user_srf); + ttm_base_object_kfree(user_srf, base); ttm_mem_global_free(vmw_mem_glob(dev_priv), size); }

@@ -1575,7 +1575,7 @@ static void vmw_user_dmabuf_destroy(struct ttm_buffer_object *bo) { struct vmw_user_dma_buffer *vmw_user_bo = vmw_user_dma_buffer(bo);

- kfree(vmw_user_bo); + ttm_base_object_kfree(vmw_user_bo, base); }

static void vmw_user_dmabuf_release(struct ttm_base_object **p_base) @@ -1763,7 +1763,7 @@ static void vmw_user_stream_free(struct vmw_resource *res) container_of(res, struct vmw_user_stream, stream.res); struct vmw_private *dev_priv = res->dev_priv;

- kfree(stream); + ttm_base_object_kfree(stream, base); ttm_mem_global_free(vmw_mem_glob(dev_priv), vmw_user_stream_size); } diff --git a/include/drm/ttm/ttm_object.h b/include/drm/ttm/ttm_object.h index b01c563..fc0cf06 100644 --- a/include/drm/ttm/ttm_object.h +++ b/include/drm/ttm/ttm_object.h @@ -40,6 +40,7 @@ #include <linux/list.h> #include <drm/drm_hashtab.h> #include <linux/kref.h> +#include <linux/rcupdate.h> #include <ttm/ttm_memory.h>

/** @@ -120,6 +121,7 @@ struct ttm_object_device; */

struct ttm_base_object { + struct rcu_head rhead; struct drm_hash_item hash; enum ttm_object_type object_type; bool shareable; @@ -268,4 +270,6 @@ extern struct ttm_object_device *ttm_object_device_init

extern void ttm_object_device_release(struct ttm_object_device **p_tdev);

+#define ttm_base_object_kfree(__object, __base)\ + kfree_rcu(__object, __base.rhead) #endif diff --git a/include/linux/kref.h b/include/linux/kref.h index fd16042..bae91d0 100644 --- a/include/linux/kref.h +++ b/include/linux/kref.h @@ -117,7 +117,7 @@ static inline int kref_put_mutex(struct kref *kref, * @kref: object. * * Return 0 if the increment succeeded. Otherwise return non-zero. - + * * This function is intended to simplify locking around refcounting for * objects that can be looked up from a lookup structure, and which are * removed from that lookup structure in the object destructor.

-- 1.7.4.4

Thomas Hellstrom

1:34 p.m.

New subject: [PATCH 3/4] drm/ttm, drm/vmwgfx: Use RCU locking for object lookups

On 11/05/2012 02:31 PM, Thomas Hellstrom wrote:

...

The mostly used lookup+get put+potential_destroy path of TTM objects is converted to use RCU locks. This will substantially decrease the amount of locked bus cycles during normal operation. Since we use kfree_rcu to free the objects, no rcu synchronization is needed at module unload time.

Signed-off-by: Thomas Hellstrom thellstrom@vmware.com

drivers/gpu/drm/ttm/ttm_object.c | 30 +++++++++++------------------- drivers/gpu/drm/vmwgfx/vmwgfx_resource.c | 8 ++++---- include/drm/ttm/ttm_object.h | 4 ++++ include/linux/kref.h | 2 +- 4 files changed, 20 insertions(+), 24 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_object.c b/drivers/gpu/drm/ttm/ttm_object.c index c785787..9d7f674 100644 --- a/drivers/gpu/drm/ttm/ttm_object.c +++ b/drivers/gpu/drm/ttm/ttm_object.c @@ -80,7 +80,7 @@ struct ttm_object_file { */

struct ttm_object_device {

rwlock_t object_lock;

spinlock_t object_lock; struct drm_open_hash object_hash; atomic_t object_count; struct ttm_mem_global *mem_glob;

@@ -157,12 +157,12 @@ int ttm_base_object_init(struct ttm_object_file *tfile, base->refcount_release = refcount_release; base->ref_obj_release = ref_obj_release; base->object_type = object_type;

write_lock(&tdev->object_lock);

spin_lock(&tdev->object_lock); kref_init(&base->refcount); ret = drm_ht_just_insert_please(&tdev->object_hash, &base->hash, (unsigned long)base, 31, 0, 0);

write_unlock(&tdev->object_lock);

spin_unlock(&tdev->object_lock); if (unlikely(ret != 0)) goto out_err0;

@@ -186,30 +186,22 @@ static void ttm_release_base(struct kref *kref) container_of(kref, struct ttm_base_object, refcount); struct ttm_object_device *tdev = base->tfile->tdev;

spin_lock(&tdev->object_lock); (void)drm_ht_remove_item(&tdev->object_hash, &base->hash);

write_unlock(&tdev->object_lock);

spin_unlock(&tdev->object_lock); if (base->refcount_release) { ttm_object_file_unref(&base->tfile); base->refcount_release(&base); }
write_lock(&tdev->object_lock); }

void ttm_base_object_unref(struct ttm_base_object **p_base) { struct ttm_base_object *base = *p_base;

struct ttm_object_device *tdev = base->tfile->tdev;

*p_base = NULL;

/*
* Need to take the lock here to avoid racing with
* users trying to look up the object.
*/
write_lock(&tdev->object_lock); kref_put(&base->refcount, ttm_release_base);

write_unlock(&tdev->object_lock); } EXPORT_SYMBOL(ttm_base_object_unref);
@@ -221,14 +213,14 @@ struct ttm_base_object *ttm_base_object_lookup(struct ttm_object_file *tfile, struct drm_hash_item *hash; int ret;

read_lock(&tdev->object_lock);

rcu_read_lock(); ret = drm_ht_find_item(&tdev->object_hash, key, &hash);

if (likely(ret == 0)) { base = drm_hash_entry(hash, struct ttm_base_object, hash);
kref_get(&base->refcount);
ret = kref_get_unless_zero(&base->refcount);
}
read_unlock(&tdev->object_lock);

rcu_read_unlock();

if (unlikely(ret != 0)) return NULL;

@@ -426,7 +418,7 @@ struct ttm_object_device *ttm_object_device_init(struct ttm_mem_global return NULL;

tdev->mem_glob = mem_glob;

rwlock_init(&tdev->object_lock);

spin_lock_init(&tdev->object_lock); atomic_set(&tdev->object_count, 0); ret = drm_ht_create(&tdev->object_hash, hash_order);

@@ -444,9 +436,9 @@ void ttm_object_device_release(struct ttm_object_device **p_tdev)

*p_tdev = NULL;

write_lock(&tdev->object_lock);

spin_lock(&tdev->object_lock); drm_ht_remove(&tdev->object_hash);

write_unlock(&tdev->object_lock);

spin_unlock(&tdev->object_lock);

kfree(tdev); }

diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c index da3c6b5..ae675c6 100644 --- a/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_resource.c @@ -351,7 +351,7 @@ static void vmw_user_context_free(struct vmw_resource *res) container_of(res, struct vmw_user_context, res); struct vmw_private *dev_priv = res->dev_priv;

kfree(ctx);

ttm_base_object_kfree(ctx, base); ttm_mem_global_free(vmw_mem_glob(dev_priv), vmw_user_context_size); }

@@ -1147,7 +1147,7 @@ static void vmw_user_surface_free(struct vmw_resource *res) kfree(srf->offsets); kfree(srf->sizes); kfree(srf->snooper.image);

kfree(user_srf);

ttm_base_object_kfree(user_srf, base); ttm_mem_global_free(vmw_mem_glob(dev_priv), size); }

@@ -1575,7 +1575,7 @@ static void vmw_user_dmabuf_destroy(struct ttm_buffer_object *bo) { struct vmw_user_dma_buffer *vmw_user_bo = vmw_user_dma_buffer(bo);

kfree(vmw_user_bo);

ttm_base_object_kfree(vmw_user_bo, base); }

static void vmw_user_dmabuf_release(struct ttm_base_object **p_base)

@@ -1763,7 +1763,7 @@ static void vmw_user_stream_free(struct vmw_resource *res) container_of(res, struct vmw_user_stream, stream.res); struct vmw_private *dev_priv = res->dev_priv;

kfree(stream);

ttm_base_object_kfree(stream, base); ttm_mem_global_free(vmw_mem_glob(dev_priv), vmw_user_stream_size); }

diff --git a/include/drm/ttm/ttm_object.h b/include/drm/ttm/ttm_object.h index b01c563..fc0cf06 100644 --- a/include/drm/ttm/ttm_object.h +++ b/include/drm/ttm/ttm_object.h @@ -40,6 +40,7 @@ #include <linux/list.h> #include <drm/drm_hashtab.h> #include <linux/kref.h> +#include <linux/rcupdate.h> #include <ttm/ttm_memory.h>

/** @@ -120,6 +121,7 @@ struct ttm_object_device; */

struct ttm_base_object {

struct rcu_head rhead; struct drm_hash_item hash; enum ttm_object_type object_type; bool shareable;

@@ -268,4 +270,6 @@ extern struct ttm_object_device *ttm_object_device_init

extern void ttm_object_device_release(struct ttm_object_device **p_tdev);

+#define ttm_base_object_kfree(__object, __base)\

kfree_rcu(__object, __base.rhead) #endif

diff --git a/include/linux/kref.h b/include/linux/kref.h index fd16042..bae91d0 100644 --- a/include/linux/kref.h +++ b/include/linux/kref.h @@ -117,7 +117,7 @@ static inline int kref_put_mutex(struct kref *kref,

@kref: object.

Return 0 if the increment succeeded. Otherwise return non-zero.

This function is intended to simplify locking around refcounting for

objects that can be looked up from a lookup structure, and which are

removed from that lookup structure in the object destructor.

Hmm, This patch looks bad. I'll respin it. /Thomas

Thomas Hellstrom

1:31 p.m.

New subject: [PATCH 4/4] drm/ttm: Optimize reservation slightly

Reservation locking currently always takes place under the LRU spinlock. Hence, strictly there is no need for an atomic_cmpxchg call; we can use atomic_read followed by atomic_write since nobody else will ever reserve without the lru spinlock held. At least on Intel this should remove a locked bus cycle on successful reserve.

Signed-off-by: Thomas Hellstrom thellstrom@vmware.com --- drivers/gpu/drm/ttm/ttm_bo.c | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c index bf6e4b5..46008ea 100644 --- a/drivers/gpu/drm/ttm/ttm_bo.c +++ b/drivers/gpu/drm/ttm/ttm_bo.c @@ -220,7 +220,7 @@ int ttm_bo_reserve_locked(struct ttm_buffer_object *bo, struct ttm_bo_global *glob = bo->glob; int ret;

- while (unlikely(atomic_cmpxchg(&bo->reserved, 0, 1) != 0)) { + while (unlikely(atomic_read(&bo->reserved) != 0)) { /** * Deadlock avoidance for multi-bo reserving. */ @@ -249,6 +249,7 @@ int ttm_bo_reserve_locked(struct ttm_buffer_object *bo, return ret; }

+ atomic_set(&bo->reserved, 1); if (use_sequence) { /** * Wake up waiters that may need to recheck for deadlock,

-- 1.7.4.4

Maarten Lankhorst

2:01 p.m.

New subject: [PATCH 4/4] drm/ttm: Optimize reservation slightly

Hey,

Op 05-11-12 14:31, Thomas Hellstrom schreef:

...

Reservation locking currently always takes place under the LRU spinlock. Hence, strictly there is no need for an atomic_cmpxchg call; we can use atomic_read followed by atomic_write since nobody else will ever reserve without the lru spinlock held. At least on Intel this should remove a locked bus cycle on successful reserve.

Signed-off-by: Thomas Hellstrom thellstrom@vmware.com

Is that really a good thing to submit when I am busy killing lru lock around reserve? :-)

- while (unlikely(atomic_cmpxchg(&bo->reserved, 0, 1) != 0)) { + while (unlikely(atomic_xchg(&bo->reserved, 1) != 0)) {

Works without lru lock too!

In fact mutexes are done in a similar way[1], except with some more magic, and unlocked state is 1, not 0. However I do think that to get that right (saves a irq disable in unlock path, and less wakeups in contended case), I should really just post the mutex extension patches for reservations and ride the flames. It's getting too close to real mutexes so I really want it to be a mutex in that case. So lets convert it.. Soon! :-)

~Maarten

[1] See linux/include/asm-generic/mutex-xchg.h and linux/include/asm-generic/mutex-dec.h for how archs generally implement mutex fastpaths.

Thomas Hellstrom

2:09 p.m.

New subject: [PATCH 4/4] drm/ttm: Optimize reservation slightly

On 11/05/2012 03:01 PM, Maarten Lankhorst wrote:

...

Hey,

Op 05-11-12 14:31, Thomas Hellstrom schreef:

...
Reservation locking currently always takes place under the LRU spinlock. Hence, strictly there is no need for an atomic_cmpxchg call; we can use atomic_read followed by atomic_write since nobody else will ever reserve without the lru spinlock held. At least on Intel this should remove a locked bus cycle on successful reserve.

Signed-off-by: Thomas Hellstrom thellstrom@vmware.com

Is that really a good thing to submit when I am busy killing lru lock around reserve? :-)

If your patch series makes it into the same kernel, let's kill this patch. Otherwise it may live at least for a kernel release. It's not a big thing to rebase against, and I won't complain if your patch adds another atomic read-modify-write op here. :)

/Thomas

4558

Age (days ago)

4558

Last active (days ago)

dri-devel@lists.freedesktop.org

7 comments

2 participants

tags (0)

participants (2)

Maarten Lankhorst
Thomas Hellstrom