So as a followup is 2 patch. The first one just stop trying to move object at each cs ioctl i believe it could be included in 3.7 as it improve performances (especialy with vram change from userspace).
The second one implement a vram eviction policy. It's a simple one, buffer used for write operation are more important than buffer used for read operation. Buffer get evicted from vram only if they haven't been use in the last 50ms (so in the last few frames) and only if there is buffer that have been recently use and that could be move into vram. This is mostly were i believe discussion should be, what kind of heuristic would work better than tat.
So without first patch and with mesa master xonotic high is at 17fps, with first patch it goes to 40fps, with second patch it goes to 48fps.
Cheers, Jerome
From: Jerome Glisse jglisse@redhat.com
The bo creation placement is where the bo will be. Instead of trying to move bo at each command stream let this work to another worker thread that will use more advance heuristic.
Signed-off-by: Jerome Glisse jglisse@redhat.com --- drivers/gpu/drm/radeon/radeon.h | 1 + drivers/gpu/drm/radeon/radeon_object.c | 17 ++++++++--------- 2 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h index 8c42d54..0a2664c 100644 --- a/drivers/gpu/drm/radeon/radeon.h +++ b/drivers/gpu/drm/radeon/radeon.h @@ -313,6 +313,7 @@ struct radeon_bo { struct list_head list; /* Protected by tbo.reserved */ u32 placements[3]; + u32 busy_placements[3]; struct ttm_placement placement; struct ttm_buffer_object tbo; struct ttm_bo_kmap_obj kmap; diff --git a/drivers/gpu/drm/radeon/radeon_object.c b/drivers/gpu/drm/radeon/radeon_object.c index 3f9f3bb..e25ae20 100644 --- a/drivers/gpu/drm/radeon/radeon_object.c +++ b/drivers/gpu/drm/radeon/radeon_object.c @@ -84,7 +84,6 @@ void radeon_ttm_placement_from_domain(struct radeon_bo *rbo, u32 domain) rbo->placement.fpfn = 0; rbo->placement.lpfn = 0; rbo->placement.placement = rbo->placements; - rbo->placement.busy_placement = rbo->placements; if (domain & RADEON_GEM_DOMAIN_VRAM) rbo->placements[c++] = TTM_PL_FLAG_WC | TTM_PL_FLAG_UNCACHED | TTM_PL_FLAG_VRAM; @@ -105,6 +104,14 @@ void radeon_ttm_placement_from_domain(struct radeon_bo *rbo, u32 domain) if (!c) rbo->placements[c++] = TTM_PL_MASK_CACHING | TTM_PL_FLAG_SYSTEM; rbo->placement.num_placement = c; + + c = 0; + rbo->placement.busy_placement = rbo->busy_placements; + if (rbo->rdev->flags & RADEON_IS_AGP) { + rbo->busy_placements[c++] = TTM_PL_FLAG_WC | TTM_PL_FLAG_TT; + } else { + rbo->busy_placements[c++] = TTM_PL_FLAG_CACHED | TTM_PL_FLAG_TT; + } rbo->placement.num_busy_placement = c; }
@@ -360,17 +367,9 @@ int radeon_bo_list_validate(struct list_head *head) list_for_each_entry(lobj, head, tv.head) { bo = lobj->bo; if (!bo->pin_count) { - domain = lobj->wdomain ? lobj->wdomain : lobj->rdomain; - - retry: - radeon_ttm_placement_from_domain(bo, domain); r = ttm_bo_validate(&bo->tbo, &bo->placement, true, false, false); if (unlikely(r)) { - if (r != -ERESTARTSYS && domain == RADEON_GEM_DOMAIN_VRAM) { - domain |= RADEON_GEM_DOMAIN_GTT; - goto retry; - } return r; } }
On Thu, Nov 29, 2012 at 10:35 AM, j.glisse@gmail.com wrote:
From: Jerome Glisse jglisse@redhat.com
The bo creation placement is where the bo will be. Instead of trying to move bo at each command stream let this work to another worker thread that will use more advance heuristic.
Signed-off-by: Jerome Glisse jglisse@redhat.com
What about including this for 3.8 it will mostly fix all regression performance and is a first valid step for proper bo placement.
Cheers, Jerome
drivers/gpu/drm/radeon/radeon.h | 1 + drivers/gpu/drm/radeon/radeon_object.c | 17 ++++++++--------- 2 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h index 8c42d54..0a2664c 100644 --- a/drivers/gpu/drm/radeon/radeon.h +++ b/drivers/gpu/drm/radeon/radeon.h @@ -313,6 +313,7 @@ struct radeon_bo { struct list_head list; /* Protected by tbo.reserved */ u32 placements[3];
u32 busy_placements[3]; struct ttm_placement placement; struct ttm_buffer_object tbo; struct ttm_bo_kmap_obj kmap;
diff --git a/drivers/gpu/drm/radeon/radeon_object.c b/drivers/gpu/drm/radeon/radeon_object.c index 3f9f3bb..e25ae20 100644 --- a/drivers/gpu/drm/radeon/radeon_object.c +++ b/drivers/gpu/drm/radeon/radeon_object.c @@ -84,7 +84,6 @@ void radeon_ttm_placement_from_domain(struct radeon_bo *rbo, u32 domain) rbo->placement.fpfn = 0; rbo->placement.lpfn = 0; rbo->placement.placement = rbo->placements;
rbo->placement.busy_placement = rbo->placements; if (domain & RADEON_GEM_DOMAIN_VRAM) rbo->placements[c++] = TTM_PL_FLAG_WC | TTM_PL_FLAG_UNCACHED | TTM_PL_FLAG_VRAM;
@@ -105,6 +104,14 @@ void radeon_ttm_placement_from_domain(struct radeon_bo *rbo, u32 domain) if (!c) rbo->placements[c++] = TTM_PL_MASK_CACHING | TTM_PL_FLAG_SYSTEM; rbo->placement.num_placement = c;
c = 0;
rbo->placement.busy_placement = rbo->busy_placements;
if (rbo->rdev->flags & RADEON_IS_AGP) {
rbo->busy_placements[c++] = TTM_PL_FLAG_WC | TTM_PL_FLAG_TT;
} else {
rbo->busy_placements[c++] = TTM_PL_FLAG_CACHED | TTM_PL_FLAG_TT;
} rbo->placement.num_busy_placement = c;
}
@@ -360,17 +367,9 @@ int radeon_bo_list_validate(struct list_head *head) list_for_each_entry(lobj, head, tv.head) { bo = lobj->bo; if (!bo->pin_count) {
domain = lobj->wdomain ? lobj->wdomain : lobj->rdomain;
retry:
radeon_ttm_placement_from_domain(bo, domain); r = ttm_bo_validate(&bo->tbo, &bo->placement, true, false, false); if (unlikely(r)) {
if (r != -ERESTARTSYS && domain == RADEON_GEM_DOMAIN_VRAM) {
domain |= RADEON_GEM_DOMAIN_GTT;
goto retry;
} return r; } }
-- 1.7.11.7
On Mon, Dec 10, 2012 at 3:16 PM, Jerome Glisse j.glisse@gmail.com wrote:
On Thu, Nov 29, 2012 at 10:35 AM, j.glisse@gmail.com wrote:
From: Jerome Glisse jglisse@redhat.com
The bo creation placement is where the bo will be. Instead of trying to move bo at each command stream let this work to another worker thread that will use more advance heuristic.
Signed-off-by: Jerome Glisse jglisse@redhat.com
What about including this for 3.8 it will mostly fix all regression performance and is a first valid step for proper bo placement.
Looks good to me. I'll add it to my 3.8 tree unless there are any objections.
Alex
Cheers, Jerome
drivers/gpu/drm/radeon/radeon.h | 1 + drivers/gpu/drm/radeon/radeon_object.c | 17 ++++++++--------- 2 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h index 8c42d54..0a2664c 100644 --- a/drivers/gpu/drm/radeon/radeon.h +++ b/drivers/gpu/drm/radeon/radeon.h @@ -313,6 +313,7 @@ struct radeon_bo { struct list_head list; /* Protected by tbo.reserved */ u32 placements[3];
u32 busy_placements[3]; struct ttm_placement placement; struct ttm_buffer_object tbo; struct ttm_bo_kmap_obj kmap;
diff --git a/drivers/gpu/drm/radeon/radeon_object.c b/drivers/gpu/drm/radeon/radeon_object.c index 3f9f3bb..e25ae20 100644 --- a/drivers/gpu/drm/radeon/radeon_object.c +++ b/drivers/gpu/drm/radeon/radeon_object.c @@ -84,7 +84,6 @@ void radeon_ttm_placement_from_domain(struct radeon_bo *rbo, u32 domain) rbo->placement.fpfn = 0; rbo->placement.lpfn = 0; rbo->placement.placement = rbo->placements;
rbo->placement.busy_placement = rbo->placements; if (domain & RADEON_GEM_DOMAIN_VRAM) rbo->placements[c++] = TTM_PL_FLAG_WC | TTM_PL_FLAG_UNCACHED | TTM_PL_FLAG_VRAM;
@@ -105,6 +104,14 @@ void radeon_ttm_placement_from_domain(struct radeon_bo *rbo, u32 domain) if (!c) rbo->placements[c++] = TTM_PL_MASK_CACHING | TTM_PL_FLAG_SYSTEM; rbo->placement.num_placement = c;
c = 0;
rbo->placement.busy_placement = rbo->busy_placements;
if (rbo->rdev->flags & RADEON_IS_AGP) {
rbo->busy_placements[c++] = TTM_PL_FLAG_WC | TTM_PL_FLAG_TT;
} else {
rbo->busy_placements[c++] = TTM_PL_FLAG_CACHED | TTM_PL_FLAG_TT;
} rbo->placement.num_busy_placement = c;
}
@@ -360,17 +367,9 @@ int radeon_bo_list_validate(struct list_head *head) list_for_each_entry(lobj, head, tv.head) { bo = lobj->bo; if (!bo->pin_count) {
domain = lobj->wdomain ? lobj->wdomain : lobj->rdomain;
retry:
radeon_ttm_placement_from_domain(bo, domain); r = ttm_bo_validate(&bo->tbo, &bo->placement, true, false, false); if (unlikely(r)) {
if (r != -ERESTARTSYS && domain == RADEON_GEM_DOMAIN_VRAM) {
domain |= RADEON_GEM_DOMAIN_GTT;
goto retry;
} return r; } }
-- 1.7.11.7
From: Jerome Glisse jglisse@redhat.com
Use delayed work thread to move buffer out of vram if they haven't been use over some period of time. This allow to make room for buffer that are actively use.
Signed-off-by: Jerome Glisse jglisse@redhat.com --- drivers/gpu/drm/radeon/radeon.h | 13 ++ drivers/gpu/drm/radeon/radeon_cs.c | 2 +- drivers/gpu/drm/radeon/radeon_device.c | 8 ++ drivers/gpu/drm/radeon/radeon_object.c | 241 ++++++++++++++++++++++++++++++++- drivers/gpu/drm/radeon/radeon_object.h | 3 +- 5 files changed, 262 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h index 0a2664c..a2e92da 100644 --- a/drivers/gpu/drm/radeon/radeon.h +++ b/drivers/gpu/drm/radeon/radeon.h @@ -102,6 +102,8 @@ extern int radeon_lockup_timeout; */ #define RADEON_MAX_USEC_TIMEOUT 100000 /* 100 ms */ #define RADEON_FENCE_JIFFIES_TIMEOUT (HZ / 2) +#define RADEON_PLACEMENT_WORK_MS 500 +#define RADEON_PLACEMENT_MAX_EVICTION 8 /* RADEON_IB_POOL_SIZE must be a power of 2 */ #define RADEON_IB_POOL_SIZE 16 #define RADEON_DEBUGFS_MAX_COMPONENTS 32 @@ -311,6 +313,10 @@ struct radeon_bo_va { struct radeon_bo { /* Protected by gem.mutex */ struct list_head list; + /* Protected by rdev->placement_mutex */ + struct list_head plist; + struct list_head *head; + unsigned long last_use_jiffies; /* Protected by tbo.reserved */ u32 placements[3]; u32 busy_placements[3]; @@ -1523,6 +1529,13 @@ struct radeon_device { struct drm_device *ddev; struct pci_dev *pdev; struct rw_semaphore exclusive_lock; + struct mutex placement_mutex; + struct list_head wvram_in_list; + struct list_head rvram_in_list; + struct list_head wvram_out_list; + struct list_head rvram_out_list; + struct delayed_work placement_work; + unsigned long vram_in_size; /* ASIC */ union radeon_asic_config config; enum radeon_family family; diff --git a/drivers/gpu/drm/radeon/radeon_cs.c b/drivers/gpu/drm/radeon/radeon_cs.c index 41672cc..e9e90bc 100644 --- a/drivers/gpu/drm/radeon/radeon_cs.c +++ b/drivers/gpu/drm/radeon/radeon_cs.c @@ -88,7 +88,7 @@ static int radeon_cs_parser_relocs(struct radeon_cs_parser *p) } else p->relocs[i].handle = 0; } - return radeon_bo_list_validate(&p->validated); + return radeon_bo_list_validate(p->rdev, &p->validated); }
static int radeon_cs_get_ring(struct radeon_cs_parser *p, u32 ring, s32 priority) diff --git a/drivers/gpu/drm/radeon/radeon_device.c b/drivers/gpu/drm/radeon/radeon_device.c index e2f5f88..0c4c874 100644 --- a/drivers/gpu/drm/radeon/radeon_device.c +++ b/drivers/gpu/drm/radeon/radeon_device.c @@ -1001,6 +1001,14 @@ int radeon_device_init(struct radeon_device *rdev, init_rwsem(&rdev->pm.mclk_lock); init_rwsem(&rdev->exclusive_lock); init_waitqueue_head(&rdev->irq.vblank_queue); + + mutex_init(&rdev->placement_mutex); + INIT_LIST_HEAD(&rdev->wvram_in_list); + INIT_LIST_HEAD(&rdev->rvram_in_list); + INIT_LIST_HEAD(&rdev->wvram_out_list); + INIT_LIST_HEAD(&rdev->rvram_out_list); + INIT_DELAYED_WORK(&rdev->placement_work, radeon_placement_work_handler); + r = radeon_gem_init(rdev); if (r) return r; diff --git a/drivers/gpu/drm/radeon/radeon_object.c b/drivers/gpu/drm/radeon/radeon_object.c index e25ae20..f2bcc5f 100644 --- a/drivers/gpu/drm/radeon/radeon_object.c +++ b/drivers/gpu/drm/radeon/radeon_object.c @@ -64,6 +64,10 @@ static void radeon_ttm_bo_destroy(struct ttm_buffer_object *tbo) mutex_lock(&bo->rdev->gem.mutex); list_del_init(&bo->list); mutex_unlock(&bo->rdev->gem.mutex); + mutex_lock(&bo->rdev->placement_mutex); + list_del_init(&bo->plist); + bo->head = NULL; + mutex_unlock(&bo->rdev->placement_mutex); radeon_bo_clear_surface_reg(bo); radeon_bo_clear_va(bo); drm_gem_object_release(&bo->gem_base); @@ -153,6 +157,8 @@ int radeon_bo_create(struct radeon_device *rdev, bo->surface_reg = -1; INIT_LIST_HEAD(&bo->list); INIT_LIST_HEAD(&bo->va); + INIT_LIST_HEAD(&bo->plist); + bo->head = NULL; radeon_ttm_placement_from_domain(bo, domain); /* Kernel allocation are uninterruptible */ down_read(&rdev->pm.mclk_lock); @@ -263,8 +269,14 @@ int radeon_bo_pin_restricted(struct radeon_bo *bo, u32 domain, u64 max_offset, if (gpu_addr != NULL) *gpu_addr = radeon_bo_gpu_offset(bo); } - if (unlikely(r != 0)) + if (unlikely(r != 0)) { dev_err(bo->rdev->dev, "%p pin failed\n", bo); + } else { + mutex_lock(&bo->rdev->placement_mutex); + list_del_init(&bo->plist); + bo->head = NULL; + mutex_unlock(&bo->rdev->placement_mutex); + } return r; }
@@ -353,11 +365,200 @@ void radeon_bo_list_add_object(struct radeon_bo_list *lobj, } }
-int radeon_bo_list_validate(struct list_head *head) +static inline int list_is_first(const struct list_head *list, + const struct list_head *head) +{ + return list->prev == head; +} + +static inline void list_exchange(struct list_head *list1, + struct list_head *list2) +{ + struct list_head *tmp; + + tmp = list1->next; + list1->next = list2->next; + list1->next->prev = list1; + list2->next = tmp; + list2->next->prev = list2; + + tmp = list1->prev; + list1->prev = list2->prev; + list1->prev->next = list1; + list2->prev = tmp; + list2->prev->next = list2; +} + +void radeon_placement_work_handler(struct work_struct *work) +{ + struct radeon_device *rdev; + struct radeon_bo *rbo, *movein = NULL; + struct radeon_bo *moveout[RADEON_PLACEMENT_MAX_EVICTION]; + unsigned ceviction = 0; + unsigned long cjiffies = jiffies, size = 0; + unsigned long elapsed_ms, eelapsed_ms; + int r, i; + + rdev = container_of(work, struct radeon_device, placement_work.work); + mutex_lock(&rdev->placement_mutex); + if (!list_empty(&rdev->wvram_in_list)) { + movein = list_first_entry(&rdev->wvram_in_list, struct radeon_bo, plist); + } + if (movein == NULL && !list_empty(&rdev->rvram_in_list)) { + movein = list_first_entry(&rdev->rvram_in_list, struct radeon_bo, plist); + } + if (movein == NULL) { + /* nothing is waiting to move in so do nothing */ + goto out; + } + if (time_after(movein->last_use_jiffies, cjiffies)) { + /* wrap around */ + movein->last_use_jiffies = 0; + } + elapsed_ms = jiffies_to_msecs(cjiffies - movein->last_use_jiffies); + /* try to evict read buffer first */ + list_for_each_entry(rbo, &rdev->rvram_out_list, plist) { + if (time_after(rbo->last_use_jiffies, cjiffies)) { + /* wrap around */ + rbo->last_use_jiffies = 0; + } + eelapsed_ms = jiffies_to_msecs(cjiffies - rbo->last_use_jiffies); + if (eelapsed_ms > (elapsed_ms + 50)) { + /* haven't been use in at least the last 50ms compared to + * the move in one + */ + r = radeon_bo_reserve(rbo, false); + if (!r) { + moveout[ceviction++] = rbo; + } + } + if (ceviction >= RADEON_PLACEMENT_MAX_EVICTION) { + goto out; + } + } + if (ceviction >= RADEON_PLACEMENT_MAX_EVICTION) { + goto out; + } + list_for_each_entry(rbo, &rdev->wvram_out_list, plist) { + if (time_after(rbo->last_use_jiffies, cjiffies)) { + /* wrap around */ + rbo->last_use_jiffies = 0; + } + eelapsed_ms = jiffies_to_msecs(cjiffies - rbo->last_use_jiffies); + if (eelapsed_ms > (elapsed_ms + 50)) { + /* haven't been use in at least the last 50ms compared to + * the move in one + */ + r = radeon_bo_reserve(rbo, false); + if (!r) { + moveout[ceviction++] = rbo; + } + } + if (ceviction >= RADEON_PLACEMENT_MAX_EVICTION) { + goto out; + } + } +out: + mutex_unlock(&rdev->placement_mutex); + for (i = 0; i < ceviction; i++) { + if (!moveout[i]->pin_count) + { + radeon_ttm_placement_from_domain(moveout[i], RADEON_GEM_DOMAIN_GTT); + r = ttm_bo_validate(&moveout[i]->tbo, &moveout[i]->placement, + true, true, true); + if (!r) { + size += moveout[i]->tbo.mem.num_pages << PAGE_SHIFT; + } + } + radeon_bo_unreserve(moveout[i]); + } + DRM_INFO("vram out (%8ldMB %8ldKB) vram in (%8ldMB %8ldKB)\n", size >> 20, size >> 10, rdev->vram_in_size >> 20, rdev->vram_in_size >> 10); + rdev->vram_in_size = 0; +} + +static void radeon_bo_placement_promote_locked(struct radeon_bo *rbo, unsigned wdomain, unsigned rdomain) +{ + struct radeon_device *rdev = rbo->rdev; + unsigned long cjiffies, elapsed_ms; + + cjiffies = jiffies; + if (wdomain & RADEON_GEM_DOMAIN_VRAM) { + if (time_after(rbo->last_use_jiffies, cjiffies)) { + /* wrap around */ + rbo->last_use_jiffies = 0; + } + elapsed_ms = jiffies_to_msecs(cjiffies - rbo->last_use_jiffies); + + if (list_empty(&rbo->plist) || rbo->head != &rdev->wvram_in_list) { + list_del_init(&rbo->plist); + list_add_tail(&rbo->plist, &rdev->wvram_in_list); + rbo->head = &rdev->wvram_in_list; + } else { + if (!list_is_first(&rbo->plist, &rdev->wvram_in_list)) { + struct radeon_bo *pbo; + unsigned long pelapsed_ms; + + /* move up the list */ + pbo = list_entry(rbo->plist.prev, struct radeon_bo, plist); + if (time_after(pbo->last_use_jiffies, cjiffies)) { + /* wrap around */ + pbo->last_use_jiffies = 0; + } + pelapsed_ms = jiffies_to_msecs(cjiffies - pbo->last_use_jiffies); + if (pelapsed_ms > elapsed_ms) { + list_exchange(&rbo->plist, &pbo->plist); + } + } + } + rbo->last_use_jiffies = cjiffies; + } else if (rdomain & RADEON_GEM_DOMAIN_VRAM) { + if (time_after(rbo->last_use_jiffies, cjiffies)) { + /* wrap around */ + rbo->last_use_jiffies = 0; + } + elapsed_ms = jiffies_to_msecs(cjiffies - rbo->last_use_jiffies); + + if (list_empty(&rbo->plist) || rbo->head != &rdev->rvram_in_list) { + list_del_init(&rbo->plist); + list_add_tail(&rbo->plist, &rdev->rvram_in_list); + rbo->head = &rdev->rvram_in_list; + } else { + if (!list_is_first(&rbo->plist, &rdev->rvram_in_list)) { + struct radeon_bo *pbo; + unsigned long pelapsed_ms; + + /* move up the list */ + pbo = list_entry(rbo->plist.prev, struct radeon_bo, plist); + if (time_after(pbo->last_use_jiffies, cjiffies)) { + /* wrap around */ + pbo->last_use_jiffies = 0; + } + pelapsed_ms = jiffies_to_msecs(cjiffies - pbo->last_use_jiffies); + if (pelapsed_ms > elapsed_ms) { + list_exchange(&rbo->plist, &pbo->plist); + } + } + } + rbo->last_use_jiffies = cjiffies; + } +} + +static void radeon_bo_placement_update_locked(struct radeon_bo *rbo) +{ + if (rbo->head) { + list_move_tail(&rbo->plist, rbo->head); + } else { + list_del_init(&rbo->plist); + list_add_tail(&rbo->plist, &rbo->rdev->rvram_out_list); + rbo->head = &rbo->rdev->rvram_out_list; + } + rbo->last_use_jiffies = jiffies; +} + +int radeon_bo_list_validate(struct radeon_device *rdev, struct list_head *head) { struct radeon_bo_list *lobj; struct radeon_bo *bo; - u32 domain; int r;
r = ttm_eu_reserve_buffers(head); @@ -367,6 +568,15 @@ int radeon_bo_list_validate(struct list_head *head) list_for_each_entry(lobj, head, tv.head) { bo = lobj->bo; if (!bo->pin_count) { + if (bo->tbo.mem.mem_type != TTM_PL_VRAM) { + mutex_lock(&rdev->placement_mutex); + radeon_bo_placement_promote_locked(bo, lobj->wdomain, lobj->rdomain); + mutex_unlock(&rdev->placement_mutex); + } else { + mutex_lock(&rdev->placement_mutex); + radeon_bo_placement_update_locked(bo); + mutex_unlock(&rdev->placement_mutex); + } r = ttm_bo_validate(&bo->tbo, &bo->placement, true, false, false); if (unlikely(r)) { @@ -376,6 +586,8 @@ int radeon_bo_list_validate(struct list_head *head) lobj->gpu_offset = radeon_bo_gpu_offset(bo); lobj->tiling_flags = bo->tiling_flags; } + + schedule_delayed_work(&rdev->placement_work, msecs_to_jiffies(RADEON_PLACEMENT_WORK_MS)); return 0; }
@@ -558,11 +770,34 @@ void radeon_bo_move_notify(struct ttm_buffer_object *bo, struct ttm_mem_reg *mem) { struct radeon_bo *rbo; + struct radeon_device *rdev; + if (!radeon_ttm_bo_is_radeon_bo(bo)) return; rbo = container_of(bo, struct radeon_bo, tbo); radeon_bo_check_tiling(rbo, 0, 1); radeon_vm_bo_invalidate(rbo->rdev, rbo); + if (mem && mem->mem_type == TTM_PL_VRAM && !(mem->placement & TTM_PL_FLAG_NO_EVICT)) { + rdev = rbo->rdev; + mutex_lock(&rdev->placement_mutex); + if (rbo->head == &rdev->wvram_in_list) { + list_del_init(&rbo->plist); + list_add_tail(&rbo->plist, &rdev->wvram_out_list); + rbo->head = &rdev->wvram_out_list; + } else { + list_del_init(&rbo->plist); + list_add_tail(&rbo->plist, &rdev->rvram_out_list); + rbo->head = &rdev->rvram_out_list; + } + mutex_unlock(&rdev->placement_mutex); + rdev->vram_in_size += rbo->tbo.mem.num_pages << PAGE_SHIFT; + } else { + rdev = rbo->rdev; + mutex_lock(&rdev->placement_mutex); + list_del_init(&rbo->plist); + rbo->head = NULL; + mutex_unlock(&rdev->placement_mutex); + } }
int radeon_bo_fault_reserve_notify(struct ttm_buffer_object *bo) diff --git a/drivers/gpu/drm/radeon/radeon_object.h b/drivers/gpu/drm/radeon/radeon_object.h index 93cd491..7babfc9 100644 --- a/drivers/gpu/drm/radeon/radeon_object.h +++ b/drivers/gpu/drm/radeon/radeon_object.h @@ -128,7 +128,7 @@ extern int radeon_bo_init(struct radeon_device *rdev); extern void radeon_bo_fini(struct radeon_device *rdev); extern void radeon_bo_list_add_object(struct radeon_bo_list *lobj, struct list_head *head); -extern int radeon_bo_list_validate(struct list_head *head); +extern int radeon_bo_list_validate(struct radeon_device *rdev, struct list_head *head); extern int radeon_bo_fbdev_mmap(struct radeon_bo *bo, struct vm_area_struct *vma); extern int radeon_bo_set_tiling_flags(struct radeon_bo *bo, @@ -141,6 +141,7 @@ extern void radeon_bo_move_notify(struct ttm_buffer_object *bo, struct ttm_mem_reg *mem); extern int radeon_bo_fault_reserve_notify(struct ttm_buffer_object *bo); extern int radeon_bo_get_surface_reg(struct radeon_bo *bo); +extern void radeon_placement_work_handler(struct work_struct *work);
/* * sub allocation
On Thu, Nov 29, 2012 at 4:35 PM, j.glisse@gmail.com wrote:
So as a followup is 2 patch. The first one just stop trying to move object at each cs ioctl i believe it could be included in 3.7 as it improve performances (especialy with vram change from userspace).
The second one implement a vram eviction policy. It's a simple one, buffer used for write operation are more important than buffer used for read operation. Buffer get evicted from vram only if they haven't been use in the last 50ms (so in the last few frames) and only if there is buffer that have been recently use and that could be move into vram. This is mostly were i believe discussion should be, what kind of heuristic would work better than tat.
First off, I didn't review the patches, because I'm not as familiar with the DRM. Just some comments.
Isn't 50ms too little? You always need at least 20 fps for that to make sense. I think 200ms would be more reasonable, but still that isn't perfect either.
Another option would be to use something else than elapsed time, something that's based on actual usage.
This weekend, I'll try to make a Mesa patch that sends an end-of-frame flag to the kernel through the CS ioctl. I think there would be at most 2 end-of-frame flags received by the kernel each frame: one from the GL app and the other one from the compositor.
Marek
On Thu, Nov 29, 2012 at 8:57 PM, Marek Olšák maraeo@gmail.com wrote:
On Thu, Nov 29, 2012 at 4:35 PM, j.glisse@gmail.com wrote:
So as a followup is 2 patch. The first one just stop trying to move object at each cs ioctl i believe it could be included in 3.7 as it improve performances (especialy with vram change from userspace).
The second one implement a vram eviction policy. It's a simple one, buffer used for write operation are more important than buffer used for read operation. Buffer get evicted from vram only if they haven't been use in the last 50ms (so in the last few frames) and only if there is buffer that have been recently use and that could be move into vram. This is mostly were i believe discussion should be, what kind of heuristic would work better than tat.
First off, I didn't review the patches, because I'm not as familiar with the DRM. Just some comments.
Isn't 50ms too little? You always need at least 20 fps for that to make sense. I think 200ms would be more reasonable, but still that isn't perfect either.
Well it's an tunable, i did try several value btw 100ms and 500ms and 500ms gave the best result, i think it's mostly because there is still ping pong otherwise, ie one of the bo is big and it get evicted because for few frame is not use but at the same time other bo that move in are not that big and doesn't make much diff to be in vram or gtt, but then the big one suddenly is use again but this time its in gtt. It's mostly guessing through what my printk shows in vram in and vram out.
Also there is quite few short lived bo and migrating any long lived bo out of vram hurts.
Another option would be to use something else than elapsed time, something that's based on actual usage.
This weekend, I'll try to make a Mesa patch that sends an end-of-frame flag to the kernel through the CS ioctl. I think there would be at most 2 end-of-frame flags received by the kernel each frame: one from the GL app and the other one from the compositor.
Marek
I am not sure how much value it's, if you take time like 15ms and assume that the program run at 60hz you get a good approximation. I think that rather than use last time since used metric for eviction i should rather compute use frequency, it would smooth things out ie things might stay in vram tiny bit longer than strictly necessary but at the same time buffer with high frequency usage would move in first, with my current algo a buffer might get lucky and move in just because it was last use.
But anyway we could probably also play with metric involving end of frame.
Cheers, Jerome
dri-devel@lists.freedesktop.org