Hi list, Thomas,
I will be investigating the use of a hotness score for each bo, to replace the ping-pong causing LRU eviction in radeon*.
The goal is to put all bos that fit in VRAM there, in order of hotness; a new bo should only be placed there if its hotness score is greater than the lowest VRAM bo's. Then the lowest-hotness-bos in VRAM should be evicted until the new bo fits. This should result in a more stable set with less ping-pong.
Jerome advised that the bo placement should be done entirely outside TTM. As I'm not (yet) too familiar with that side of the kernel, what is the opinion of TTM folks?
- Lauri
* github.com/clbr/jamkthesis
Hi!
On 12/05/2013 10:36 AM, Lauri Kasanen wrote:
Hi list, Thomas,
I will be investigating the use of a hotness score for each bo, to replace the ping-pong causing LRU eviction in radeon*.
The goal is to put all bos that fit in VRAM there, in order of hotness; a new bo should only be placed there if its hotness score is greater than the lowest VRAM bo's. Then the lowest-hotness-bos in VRAM should be evicted until the new bo fits. This should result in a more stable set with less ping-pong.
Jerome advised that the bo placement should be done entirely outside TTM. As I'm not (yet) too familiar with that side of the kernel, what is the opinion of TTM folks?
There are a couple of things to be considered: 1) You need to decide where a bo to be validated should be placed. The driver can give a list of possible placements to TTM and let TTM decide, trying each placement in turn. A driver that thinks this isn't sufficient can come up with its on strategy and give only a single placement to TTM. If TTM can't satisfy that, it will give you an error back, and the driver will need to validate with an alternative placement. I think Radeon already does this? vmwgfx does it to some extent.
2) As you say, TTM is evicting strictly on an lru basis, and is maintaining one LRU list per memory type, and also a global swap lru list for buffers that are backed by system pages (not VRAM). I guess what you would want to do is to replace the VRAM lru list with a priority queue where bos are continously sorted based on hotness. As long as you obey the locking rules: *) Locking order is bo::reserve -> lru-lock *) When walking the queue with the lru-lock held, you must therefore tryreserve if you want to reserve an object on the queue *) bo:s need to be removed from the queue as soon as they are reserved *) Don't remove a bo from the queue unless it is reserved Nothing stops you from doing this in the driver, but OTOH if this ends up being useful for other drivers I'd prefer we put it into TTM.
Thanks, Thomas
- Lauri
- github.com/clbr/jamkthesis
On Thu, Dec 05, 2013 at 11:26:46AM +0100, Thomas Hellstrom wrote:
Hi!
On 12/05/2013 10:36 AM, Lauri Kasanen wrote:
Hi list, Thomas,
I will be investigating the use of a hotness score for each bo, to replace the ping-pong causing LRU eviction in radeon*.
The goal is to put all bos that fit in VRAM there, in order of hotness; a new bo should only be placed there if its hotness score is greater than the lowest VRAM bo's. Then the lowest-hotness-bos in VRAM should be evicted until the new bo fits. This should result in a more stable set with less ping-pong.
Jerome advised that the bo placement should be done entirely outside TTM. As I'm not (yet) too familiar with that side of the kernel, what is the opinion of TTM folks?
There are a couple of things to be considered:
- You need to decide where a bo to be validated should be placed.
The driver can give a list of possible placements to TTM and let TTM decide, trying each placement in turn. A driver that thinks this isn't sufficient can come up with its on strategy and give only a single placement to TTM. If TTM can't satisfy that, it will give you an error back, and the driver will need to validate with an alternative placement. I think Radeon already does this? vmwgfx does it to some extent.
- As you say, TTM is evicting strictly on an lru basis, and is
maintaining one LRU list per memory type, and also a global swap lru list for buffers that are backed by system pages (not VRAM). I guess what you would want to do is to replace the VRAM lru list with a priority queue where bos are continously sorted based on hotness. As long as you obey the locking rules: *) Locking order is bo::reserve -> lru-lock *) When walking the queue with the lru-lock held, you must therefore tryreserve if you want to reserve an object on the queue *) bo:s need to be removed from the queue as soon as they are reserved *) Don't remove a bo from the queue unless it is reserved Nothing stops you from doing this in the driver, but OTOH if this ends up being useful for other drivers I'd prefer we put it into TTM.
It will be useful to others, the point i am making is that others might not use ttm either and there is nothing about bo placement that needs to be ttm specific.
To avoid bo eviction from lru list is just a matter of driver never over committing bo on a pool of memory and driver doing eviction by itself, ie deciding of a new placement for bo and moving that bo before moving in other bo, which can be done outside ttm.
The only thing that will needs modification to ttm is work done to control memory fragmentation but this should be not be enforce on all ttm user and should be a runtime decision. GPU with virtual address space can scatter bo through vram by using vram pages making memory fragmentation pretty much a non issue (some GPU still needs contiguous memory for scan out buffer or other specific buffer).
Cheers, Jerome
op 05-12-13 16:49, Jerome Glisse schreef:
On Thu, Dec 05, 2013 at 11:26:46AM +0100, Thomas Hellstrom wrote:
Hi!
On 12/05/2013 10:36 AM, Lauri Kasanen wrote:
Hi list, Thomas,
I will be investigating the use of a hotness score for each bo, to replace the ping-pong causing LRU eviction in radeon*.
The goal is to put all bos that fit in VRAM there, in order of hotness; a new bo should only be placed there if its hotness score is greater than the lowest VRAM bo's. Then the lowest-hotness-bos in VRAM should be evicted until the new bo fits. This should result in a more stable set with less ping-pong.
Jerome advised that the bo placement should be done entirely outside TTM. As I'm not (yet) too familiar with that side of the kernel, what is the opinion of TTM folks?
There are a couple of things to be considered:
- You need to decide where a bo to be validated should be placed.
The driver can give a list of possible placements to TTM and let TTM decide, trying each placement in turn. A driver that thinks this isn't sufficient can come up with its on strategy and give only a single placement to TTM. If TTM can't satisfy that, it will give you an error back, and the driver will need to validate with an alternative placement. I think Radeon already does this? vmwgfx does it to some extent.
- As you say, TTM is evicting strictly on an lru basis, and is
maintaining one LRU list per memory type, and also a global swap lru list for buffers that are backed by system pages (not VRAM). I guess what you would want to do is to replace the VRAM lru list with a priority queue where bos are continously sorted based on hotness. As long as you obey the locking rules: *) Locking order is bo::reserve -> lru-lock *) When walking the queue with the lru-lock held, you must therefore tryreserve if you want to reserve an object on the queue *) bo:s need to be removed from the queue as soon as they are reserved *) Don't remove a bo from the queue unless it is reserved Nothing stops you from doing this in the driver, but OTOH if this ends up being useful for other drivers I'd prefer we put it into TTM.
It will be useful to others, the point i am making is that others might not use ttm either and there is nothing about bo placement that needs to be ttm specific.
To avoid bo eviction from lru list is just a matter of driver never over committing bo on a pool of memory and driver doing eviction by itself, ie deciding of a new placement for bo and moving that bo before moving in other bo, which can be done outside ttm.
The only thing that will needs modification to ttm is work done to control memory fragmentation but this should be not be enforce on all ttm user and should be a runtime decision. GPU with virtual address space can scatter bo through vram by using vram pages making memory fragmentation pretty much a non issue (some GPU still needs contiguous memory for scan out buffer or other specific buffer).
You're correct it COULD be done like that, but that's a nasty workaround. Simply assign a priority to each buffer, then modify ttm_bo_add_to_lru, ttm_bo_swapout, ttm_mem_evict_first and be done with it.
Memory management is exactly the kind of thing that should be done in TTM, so why have something 'generic' for something that's little more than a renamed priority queue?
~Maarten
On Thu, Dec 05, 2013 at 05:22:54PM +0100, Maarten Lankhorst wrote:
op 05-12-13 16:49, Jerome Glisse schreef:
On Thu, Dec 05, 2013 at 11:26:46AM +0100, Thomas Hellstrom wrote:
Hi!
On 12/05/2013 10:36 AM, Lauri Kasanen wrote:
Hi list, Thomas,
I will be investigating the use of a hotness score for each bo, to replace the ping-pong causing LRU eviction in radeon*.
The goal is to put all bos that fit in VRAM there, in order of hotness; a new bo should only be placed there if its hotness score is greater than the lowest VRAM bo's. Then the lowest-hotness-bos in VRAM should be evicted until the new bo fits. This should result in a more stable set with less ping-pong.
Jerome advised that the bo placement should be done entirely outside TTM. As I'm not (yet) too familiar with that side of the kernel, what is the opinion of TTM folks?
There are a couple of things to be considered:
- You need to decide where a bo to be validated should be placed.
The driver can give a list of possible placements to TTM and let TTM decide, trying each placement in turn. A driver that thinks this isn't sufficient can come up with its on strategy and give only a single placement to TTM. If TTM can't satisfy that, it will give you an error back, and the driver will need to validate with an alternative placement. I think Radeon already does this? vmwgfx does it to some extent.
- As you say, TTM is evicting strictly on an lru basis, and is
maintaining one LRU list per memory type, and also a global swap lru list for buffers that are backed by system pages (not VRAM). I guess what you would want to do is to replace the VRAM lru list with a priority queue where bos are continously sorted based on hotness. As long as you obey the locking rules: *) Locking order is bo::reserve -> lru-lock *) When walking the queue with the lru-lock held, you must therefore tryreserve if you want to reserve an object on the queue *) bo:s need to be removed from the queue as soon as they are reserved *) Don't remove a bo from the queue unless it is reserved Nothing stops you from doing this in the driver, but OTOH if this ends up being useful for other drivers I'd prefer we put it into TTM.
It will be useful to others, the point i am making is that others might not use ttm either and there is nothing about bo placement that needs to be ttm specific.
To avoid bo eviction from lru list is just a matter of driver never over committing bo on a pool of memory and driver doing eviction by itself, ie deciding of a new placement for bo and moving that bo before moving in other bo, which can be done outside ttm.
The only thing that will needs modification to ttm is work done to control memory fragmentation but this should be not be enforce on all ttm user and should be a runtime decision. GPU with virtual address space can scatter bo through vram by using vram pages making memory fragmentation pretty much a non issue (some GPU still needs contiguous memory for scan out buffer or other specific buffer).
You're correct it COULD be done like that, but that's a nasty workaround. Simply assign a priority to each buffer, then modify ttm_bo_add_to_lru, ttm_bo_swapout, ttm_mem_evict_first and be done with it.
Memory management is exactly the kind of thing that should be done in TTM, so why have something 'generic' for something that's little more than a renamed priority queue?
The end score and use of the score for placement decision be done in ttm but the whole score computation and heuristic related to it should not.
Cheers, Jerome
On Thu, Dec 05, 2013 at 11:45:03AM -0500, Jerome Glisse wrote:
On Thu, Dec 05, 2013 at 05:22:54PM +0100, Maarten Lankhorst wrote:
op 05-12-13 16:49, Jerome Glisse schreef:
On Thu, Dec 05, 2013 at 11:26:46AM +0100, Thomas Hellstrom wrote:
Hi!
On 12/05/2013 10:36 AM, Lauri Kasanen wrote:
Hi list, Thomas,
I will be investigating the use of a hotness score for each bo, to replace the ping-pong causing LRU eviction in radeon*.
The goal is to put all bos that fit in VRAM there, in order of hotness; a new bo should only be placed there if its hotness score is greater than the lowest VRAM bo's. Then the lowest-hotness-bos in VRAM should be evicted until the new bo fits. This should result in a more stable set with less ping-pong.
Jerome advised that the bo placement should be done entirely outside TTM. As I'm not (yet) too familiar with that side of the kernel, what is the opinion of TTM folks?
There are a couple of things to be considered:
- You need to decide where a bo to be validated should be placed.
The driver can give a list of possible placements to TTM and let TTM decide, trying each placement in turn. A driver that thinks this isn't sufficient can come up with its on strategy and give only a single placement to TTM. If TTM can't satisfy that, it will give you an error back, and the driver will need to validate with an alternative placement. I think Radeon already does this? vmwgfx does it to some extent.
- As you say, TTM is evicting strictly on an lru basis, and is
maintaining one LRU list per memory type, and also a global swap lru list for buffers that are backed by system pages (not VRAM). I guess what you would want to do is to replace the VRAM lru list with a priority queue where bos are continously sorted based on hotness. As long as you obey the locking rules: *) Locking order is bo::reserve -> lru-lock *) When walking the queue with the lru-lock held, you must therefore tryreserve if you want to reserve an object on the queue *) bo:s need to be removed from the queue as soon as they are reserved *) Don't remove a bo from the queue unless it is reserved Nothing stops you from doing this in the driver, but OTOH if this ends up being useful for other drivers I'd prefer we put it into TTM.
It will be useful to others, the point i am making is that others might not use ttm either and there is nothing about bo placement that needs to be ttm specific.
To avoid bo eviction from lru list is just a matter of driver never over committing bo on a pool of memory and driver doing eviction by itself, ie deciding of a new placement for bo and moving that bo before moving in other bo, which can be done outside ttm.
The only thing that will needs modification to ttm is work done to control memory fragmentation but this should be not be enforce on all ttm user and should be a runtime decision. GPU with virtual address space can scatter bo through vram by using vram pages making memory fragmentation pretty much a non issue (some GPU still needs contiguous memory for scan out buffer or other specific buffer).
You're correct it COULD be done like that, but that's a nasty workaround. Simply assign a priority to each buffer, then modify ttm_bo_add_to_lru, ttm_bo_swapout, ttm_mem_evict_first and be done with it.
Memory management is exactly the kind of thing that should be done in TTM, so why have something 'generic' for something that's little more than a renamed priority queue?
The end score and use of the score for placement decision be done in ttm but the whole score computation and heuristic related to it should not.
btw another thing to look at is the eviction roaster in drm_mm. It's completely standalone, the only thing it requires is that you have a deterministic order to add objects to it and unroll them (but that can always be solved by putting objects on a temporary list).
That way if you have some big objects and a highly fragmented vram you don't end up eviction a big load of data, but just a perfectly-sized hole. All the scanning is linar, but ime with the implementation in i915.ko that's not a real-world issue really. The drm_mm roaster supports all the same features as the normal block allocator, so range-restricted allocations (and everything else) also works. See evict_something in i915_gem_eviction.c for how it all works (yeah, no docs but writing those for drm_mm.c is on my todo somewhere). -Daniel
On 12/09/2013 06:28 PM, Daniel Vetter wrote:
On Thu, Dec 05, 2013 at 11:45:03AM -0500, Jerome Glisse wrote:
On Thu, Dec 05, 2013 at 05:22:54PM +0100, Maarten Lankhorst wrote:
op 05-12-13 16:49, Jerome Glisse schreef:
On Thu, Dec 05, 2013 at 11:26:46AM +0100, Thomas Hellstrom wrote:
Hi!
On 12/05/2013 10:36 AM, Lauri Kasanen wrote:
Hi list, Thomas,
I will be investigating the use of a hotness score for each bo, to replace the ping-pong causing LRU eviction in radeon*.
The goal is to put all bos that fit in VRAM there, in order of hotness; a new bo should only be placed there if its hotness score is greater than the lowest VRAM bo's. Then the lowest-hotness-bos in VRAM should be evicted until the new bo fits. This should result in a more stable set with less ping-pong.
Jerome advised that the bo placement should be done entirely outside TTM. As I'm not (yet) too familiar with that side of the kernel, what is the opinion of TTM folks?
There are a couple of things to be considered:
- You need to decide where a bo to be validated should be placed.
The driver can give a list of possible placements to TTM and let TTM decide, trying each placement in turn. A driver that thinks this isn't sufficient can come up with its on strategy and give only a single placement to TTM. If TTM can't satisfy that, it will give you an error back, and the driver will need to validate with an alternative placement. I think Radeon already does this? vmwgfx does it to some extent.
- As you say, TTM is evicting strictly on an lru basis, and is
maintaining one LRU list per memory type, and also a global swap lru list for buffers that are backed by system pages (not VRAM). I guess what you would want to do is to replace the VRAM lru list with a priority queue where bos are continously sorted based on hotness. As long as you obey the locking rules: *) Locking order is bo::reserve -> lru-lock *) When walking the queue with the lru-lock held, you must therefore tryreserve if you want to reserve an object on the queue *) bo:s need to be removed from the queue as soon as they are reserved *) Don't remove a bo from the queue unless it is reserved Nothing stops you from doing this in the driver, but OTOH if this ends up being useful for other drivers I'd prefer we put it into TTM.
It will be useful to others, the point i am making is that others might not use ttm either and there is nothing about bo placement that needs to be ttm specific.
To avoid bo eviction from lru list is just a matter of driver never over committing bo on a pool of memory and driver doing eviction by itself, ie deciding of a new placement for bo and moving that bo before moving in other bo, which can be done outside ttm.
The only thing that will needs modification to ttm is work done to control memory fragmentation but this should be not be enforce on all ttm user and should be a runtime decision. GPU with virtual address space can scatter bo through vram by using vram pages making memory fragmentation pretty much a non issue (some GPU still needs contiguous memory for scan out buffer or other specific buffer).
You're correct it COULD be done like that, but that's a nasty workaround. Simply assign a priority to each buffer, then modify ttm_bo_add_to_lru, ttm_bo_swapout, ttm_mem_evict_first and be done with it.
Memory management is exactly the kind of thing that should be done in TTM, so why have something 'generic' for something that's little more than a renamed priority queue?
The end score and use of the score for placement decision be done in ttm but the whole score computation and heuristic related to it should not.
btw another thing to look at is the eviction roaster in drm_mm. It's completely standalone, the only thing it requires is that you have a deterministic order to add objects to it and unroll them (but that can always be solved by putting objects on a temporary list).
That way if you have some big objects and a highly fragmented vram you don't end up eviction a big load of data, but just a perfectly-sized hole. All the scanning is linar, but ime with the implementation in i915.ko that's not a real-world issue really. The drm_mm roaster supports all the same features as the normal block allocator, so range-restricted allocations (and everything else) also works. See evict_something in i915_gem_eviction.c for how it all works (yeah, no docs but writing those for drm_mm.c is on my todo somewhere). -Daniel
The problem with combining this with TTM is that eviction by default doesn't take place under a mutex, so multiple threads may be traversing the LRU list more or less at the same time, evicting stuff.
However, when it comes to eviction, that's not really a behaviour we need to preserve. It would, IMO, be OK to take a "big" per-memory-type mutex around eviction, but then one would have to sort out how / whether swapping and delayed destruction would need to wait on that mutex as well....
/Thomas
Hi Lauri,
FYI, since the userspace driver sends end-of-frame markers to the kernel, the radeon kernel driver knows the current frame number and it can also save the frame number of the last use of each buffer. We should definitely use that to measure the buffer hotness, or just prevent eviction if the buffer was used recently (the last 2 or 3 frames) and you can drop the hotness calculations entirely.
Also, MSAA buffers and depth buffers should have higher probability of being placed in VRAM than other buffers, because their placement has higher impact on performance. They also tend to contain auxiliary data which significantly improve performance, like fast clear data, MSAA fragment coverage data, and hierarchical depth and stencil data. We can add a new ioctl which sets buffer usage flags.
One can say the same thing about colorbuffers too, but there's no easy way to distinguish between a colorbuffer and an ordinary texture which isn't used as a colorbuffer but is blitted from time to time.
Marek
On Thu, Dec 5, 2013 at 10:36 AM, Lauri Kasanen cand@gmx.com wrote:
Hi list, Thomas,
I will be investigating the use of a hotness score for each bo, to replace the ping-pong causing LRU eviction in radeon*.
The goal is to put all bos that fit in VRAM there, in order of hotness; a new bo should only be placed there if its hotness score is greater than the lowest VRAM bo's. Then the lowest-hotness-bos in VRAM should be evicted until the new bo fits. This should result in a more stable set with less ping-pong.
Jerome advised that the bo placement should be done entirely outside TTM. As I'm not (yet) too familiar with that side of the kernel, what is the opinion of TTM folks?
- Lauri
- github.com/clbr/jamkthesis
dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
On Mon, 9 Dec 2013 20:28:21 +0100 Marek Olšák maraeo@gmail.com wrote:
Hi,
FYI, since the userspace driver sends end-of-frame markers to the kernel, the radeon kernel driver knows the current frame number and it can also save the frame number of the last use of each buffer. We should definitely use that to measure the buffer hotness, or just prevent eviction if the buffer was used recently (the last 2 or 3 frames) and you can drop the hotness calculations entirely.
I think this would result in sub-optimal behavior with one client, but a workload larger than VRAM. If everything is needed in one frame, then this logic would almost randomly decide what gets to stay.
Also, MSAA buffers and depth buffers should have higher probability of being placed in VRAM than other buffers, because their placement has higher impact on performance. They also tend to contain auxiliary data which significantly improve performance, like fast clear data, MSAA fragment coverage data, and hierarchical depth and stencil data. We can add a new ioctl which sets buffer usage flags.
Thanks, this info will be useful.
Note that the hotness calculation will be in userspace, as only there are the necessary counters available. So the finished hotness score will be passed to the kernel, instead of sending all the necessary data there. Ought to be less context switches that way.
- Lauri
On Mon, Dec 9, 2013 at 9:30 PM, Lauri Kasanen cand@gmx.com wrote:
On Mon, 9 Dec 2013 20:28:21 +0100 Marek Olšák maraeo@gmail.com wrote:
Hi,
FYI, since the userspace driver sends end-of-frame markers to the kernel, the radeon kernel driver knows the current frame number and it can also save the frame number of the last use of each buffer. We should definitely use that to measure the buffer hotness, or just prevent eviction if the buffer was used recently (the last 2 or 3 frames) and you can drop the hotness calculations entirely.
I think this would result in sub-optimal behavior with one client, but a workload larger than VRAM. If everything is needed in one frame, then this logic would almost randomly decide what gets to stay.
Also, MSAA buffers and depth buffers should have higher probability of being placed in VRAM than other buffers, because their placement has higher impact on performance. They also tend to contain auxiliary data which significantly improve performance, like fast clear data, MSAA fragment coverage data, and hierarchical depth and stencil data. We can add a new ioctl which sets buffer usage flags.
Thanks, this info will be useful.
Note that the hotness calculation will be in userspace, as only there are the necessary counters available. So the finished hotness score will be passed to the kernel, instead of sending all the necessary data there. Ought to be less context switches that way.
This sounds good, but you will also need to update the DDX for everything up to and including Cayman. Hopefully the DDX doesn't emit IBs outside of glamor on Southern Islands and later chips.
Marek
On Mon, 2013-12-09 at 23:45 +0100, Marek Olšák wrote:
On Mon, Dec 9, 2013 at 9:30 PM, Lauri Kasanen cand@gmx.com wrote:
Note that the hotness calculation will be in userspace, as only there are the necessary counters available. So the finished hotness score will be passed to the kernel, instead of sending all the necessary data there. Ought to be less context switches that way.
Sounds like this could be abused by userspace though...
This sounds good, but you will also need to update the DDX for everything up to and including Cayman. Hopefully the DDX doesn't emit IBs outside of glamor on Southern Islands and later chips.
It doesn't.
op 10-12-13 01:49, Michel Dänzer schreef:
On Mon, 2013-12-09 at 23:45 +0100, Marek Olšák wrote:
On Mon, Dec 9, 2013 at 9:30 PM, Lauri Kasanen cand@gmx.com wrote:
Note that the hotness calculation will be in userspace, as only there are the necessary counters available. So the finished hotness score will be passed to the kernel, instead of sending all the necessary data there. Ought to be less context switches that way.
Sounds like this could be abused by userspace though...
Of all the worries that exist, this is a non-issue. Userspace can simply queue a lot of draw calls that take 1 second each through the normal command submission methods, why would it need to tweak some obscure number to cause some eviction?
This sounds good, but you will also need to update the DDX for everything up to and including Cayman. Hopefully the DDX doesn't emit IBs outside of glamor on Southern Islands and later chips.
It doesn't.
On Die, 2013-12-10 at 12:03 +0100, Maarten Lankhorst wrote:
op 10-12-13 01:49, Michel Dänzer schreef:
On Mon, 2013-12-09 at 23:45 +0100, Marek Olšák wrote:
On Mon, Dec 9, 2013 at 9:30 PM, Lauri Kasanen cand@gmx.com wrote:
Note that the hotness calculation will be in userspace, as only there are the necessary counters available. So the finished hotness score will be passed to the kernel, instead of sending all the necessary data there. Ought to be less context switches that way.
Sounds like this could be abused by userspace though...
Of all the worries that exist, this is a non-issue. Userspace can simply queue a lot of draw calls that take 1 second each through the normal command submission methods, why would it need to tweak some obscure number to cause some eviction?
That's not what I'm concerned about.
Consider e.g. a multiseat environment: Some users could patch their userspace drivers such that their buffers are more likely to stay in VRAM than those of other users.
I agree it's not a huge issue, I'm just saying we should try to make the score calculation as much as possible based on the actual usage of the buffers instead of on meta data provided by userspace.
op 11-12-13 04:04, Michel Dänzer schreef:
On Die, 2013-12-10 at 12:03 +0100, Maarten Lankhorst wrote:
op 10-12-13 01:49, Michel Dänzer schreef:
On Mon, 2013-12-09 at 23:45 +0100, Marek Olšák wrote:
On Mon, Dec 9, 2013 at 9:30 PM, Lauri Kasanen cand@gmx.com wrote:
Note that the hotness calculation will be in userspace, as only there are the necessary counters available. So the finished hotness score will be passed to the kernel, instead of sending all the necessary data there. Ought to be less context switches that way.
Sounds like this could be abused by userspace though...
Of all the worries that exist, this is a non-issue. Userspace can simply queue a lot of draw calls that take 1 second each through the normal command submission methods, why would it need to tweak some obscure number to cause some eviction?
That's not what I'm concerned about.
Consider e.g. a multiseat environment: Some users could patch their userspace drivers such that their buffers are more likely to stay in VRAM than those of other users.
I agree it's not a huge issue, I'm just saying we should try to make the score calculation as much as possible based on the actual usage of the buffers instead of on meta data provided by userspace.
Well, the easiest solution is to make the score only count as penalty, and set buffers that don't have the meta-data to maximum score. This preserves current behavior for clients that aren't score aware.
On 12/11/2013 08:57 AM, Maarten Lankhorst wrote:
op 11-12-13 04:04, Michel Dänzer schreef:
On Die, 2013-12-10 at 12:03 +0100, Maarten Lankhorst wrote:
op 10-12-13 01:49, Michel Dänzer schreef:
On Mon, 2013-12-09 at 23:45 +0100, Marek Olšák wrote:
On Mon, Dec 9, 2013 at 9:30 PM, Lauri Kasanen cand@gmx.com wrote:
Note that the hotness calculation will be in userspace, as only there are the necessary counters available. So the finished hotness score will be passed to the kernel, instead of sending all the necessary data there. Ought to be less context switches that way.
Sounds like this could be abused by userspace though...
Of all the worries that exist, this is a non-issue. Userspace can simply queue a lot of draw calls that take 1 second each through the normal command submission methods, why would it need to tweak some obscure number to cause some eviction?
That's not what I'm concerned about.
Consider e.g. a multiseat environment: Some users could patch their userspace drivers such that their buffers are more likely to stay in VRAM than those of other users.
I agree it's not a huge issue, I'm just saying we should try to make the score calculation as much as possible based on the actual usage of the buffers instead of on meta data provided by userspace.
Well, the easiest solution is to make the score only count as penalty, and set buffers that don't have the meta-data to maximum score. This preserves current behavior for clients that aren't score aware.
I agree with Michel that some mechanism needs to be in place to stop user-space clients from effectively pinning buffers by giving them a certain score. Two other things:
1) A good memory manager should be able to guarantee a certain amount of GPU visible memory to be available, so that user-space knows when to flush. for an execbuf call (albeit not necessarily VRAM), if, due to fragmentation or something else, this is hard to achieve during normal (score based or LRU) eviction mechanism, a typical implementation would lock out other execbuf processes, release all reservations, evict what's necessary and restart execbuf. In this "panic" case, I think the score-based eviction needs to be relaxed to allow new buffers in regardless of score.
2) If score is calculated in user-space, how are shared buffers handled?
Thanks, Thomas
dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
On Wed, 11 Dec 2013 12:04:05 +0900 Michel Dänzer michel@daenzer.net wrote:
Of all the worries that exist, this is a non-issue. Userspace can simply queue a lot of draw calls that take 1 second each through the normal command submission methods, why would it need to tweak some obscure number to cause some eviction?
That's not what I'm concerned about.
Consider e.g. a multiseat environment: Some users could patch their userspace drivers such that their buffers are more likely to stay in VRAM than those of other users.
I agree it's not a huge issue, I'm just saying we should try to make the score calculation as much as possible based on the actual usage of the buffers instead of on meta data provided by userspace.
We don't have that in the kernel. Only userspace has the accurate stats on usage. If we instead modified userspace to pass these stats, it would have the exact same issue of "what if somebody passes false data"?
Maarten:
Well, the easiest solution is to make the score only count as penalty, and set buffers that don't have the meta-data to maximum score. This preserves current behavior for clients that aren't score aware.
No, this would be the exact opposite: it would pin the old-userspace buffers, at the cost of possibly not letting proper-scored buffers in VRAM.
Thomas:
I agree with Michel that some mechanism needs to be in place to stop user-space clients from effectively pinning buffers by giving them a certain > score.
I think the kernel just has to trust userspace on this. I can't think of any way of not involving userspace, so if somebody really wants to hack mesa to gain some fps advantage on a multiseat system, let them ;)
Basically, they already can hack mesa to pass invalid buffers to cause a hang/crash the kernel. So we already trust userspace more than this new functionality would.
- If score is calculated in user-space, how are shared buffers handled?
Good question, I don't know yet.
- Lauri
On 12/11/2013 12:35 PM, Lauri Kasanen wrote:
On Wed, 11 Dec 2013 12:04:05 +0900 Michel Dänzer michel@daenzer.net wrote:
Of all the worries that exist, this is a non-issue. Userspace can simply queue a lot of draw calls that take 1 second each through the normal command submission methods, why would it need to tweak some obscure number to cause some eviction?
That's not what I'm concerned about.
Consider e.g. a multiseat environment: Some users could patch their userspace drivers such that their buffers are more likely to stay in VRAM than those of other users.
I agree it's not a huge issue, I'm just saying we should try to make the score calculation as much as possible based on the actual usage of the buffers instead of on meta data provided by userspace.
We don't have that in the kernel. Only userspace has the accurate stats on usage. If we instead modified userspace to pass these stats, it would have the exact same issue of "what if somebody passes false data"?
Maarten:
Well, the easiest solution is to make the score only count as penalty, and set buffers that don't have the meta-data to maximum score. This preserves current behavior for clients that aren't score aware.
No, this would be the exact opposite: it would pin the old-userspace buffers, at the cost of possibly not letting proper-scored buffers in VRAM.
Thomas:
I agree with Michel that some mechanism needs to be in place to stop user-space clients from effectively pinning buffers by giving them a certain > score.
I think the kernel just has to trust userspace on this. I can't think of any way of not involving userspace, so if somebody really wants to hack mesa to gain some fps advantage on a multiseat system, let them ;)
Basically, they already can hack mesa to pass invalid buffers to cause a hang/crash the kernel. So we already trust userspace more than this new functionality would.
Yes, but these are two different things. Letting user-space pin buffers by design is building in a software DOS in the kernel. I don't think even Microsoft is allowing this, and AFAICT we've avoided that since the very dawn of kernel buffer management.
Not having a perfect command stream parser or proper GPU hang recovery mechanism is something else, and something we wish to have but don't at the moment.
Allowing a new type of DOS just because we have other flaws isn't moving things forward, but i guess in the end it's your choice.
Thanks, Thomas
On Wed, 11 Dec 2013 15:46:53 +0100 Thomas Hellstrom thellstrom@vmware.com wrote:
I think the kernel just has to trust userspace on this. I can't think of any way of not involving userspace, so if somebody really wants to hack mesa to gain some fps advantage on a multiseat system, let them ;)
Basically, they already can hack mesa to pass invalid buffers to cause a hang/crash the kernel. So we already trust userspace more than this new functionality would.
Yes, but these are two different things. Letting user-space pin buffers by design is building in a software DOS in the kernel. I don't think even Microsoft is allowing this, and AFAICT we've avoided that since the very dawn of kernel buffer management.
Not having a perfect command stream parser or proper GPU hang recovery mechanism is something else, and something we wish to have but don't at the moment.
Allowing a new type of DOS just because we have other flaws isn't moving things forward, but i guess in the end it's your choice.
The worst case with the scoring is that a new client will work somewhat slower than it otherwise would. I wouldn't call this a DOS.
Instead I would compare it to nice levels. Still, I agree with your concern that a user could disturb another user. This wouldn't be an issue within a single user environment, as the user obviously wanted it if he went that far.
Perhaps we could solve that by taking the process's UID into account inside the kernel. If there are multiple UIDs with 3d processes running, reserve a chunk of VRAM for each?
- Lauri
On Tue, Dec 10, 2013 at 1:49 AM, Michel Dänzer michel@daenzer.net wrote:
On Mon, 2013-12-09 at 23:45 +0100, Marek Olšák wrote:
On Mon, Dec 9, 2013 at 9:30 PM, Lauri Kasanen cand@gmx.com wrote:
Note that the hotness calculation will be in userspace, as only there are the necessary counters available. So the finished hotness score will be passed to the kernel, instead of sending all the necessary data there. Ought to be less context switches that way.
Sounds like this could be abused by userspace though...
Anything can be abused by userspace, but is there any security risk? I don't think so. The CS ioctl is way more dangerous than this.
Marek
On Mon, 9 Dec 2013 23:45:12 +0100 Marek Olšák maraeo@gmail.com wrote:
Note that the hotness calculation will be in userspace, as only there are the necessary counters available. So the finished hotness score will be passed to the kernel, instead of sending all the necessary data there. Ought to be less context switches that way.
This sounds good, but you will also need to update the DDX for everything up to and including Cayman. Hopefully the DDX doesn't emit IBs outside of glamor on Southern Islands and later chips.
Do you mean to pass an empty score (0) for 2d buffers, or that 2d buffers should also get the calculation? I suppose this depends on which ioctl is used to pass it.
IMHO for 2d use the current behavior is ok, so passing a dummy value and falling back should be enough.
- Lauri
On Tue, Dec 10, 2013 at 12:59 PM, Lauri Kasanen cand@gmx.com wrote:
On Mon, 9 Dec 2013 23:45:12 +0100 Marek Olšák maraeo@gmail.com wrote:
Note that the hotness calculation will be in userspace, as only there are the necessary counters available. So the finished hotness score will be passed to the kernel, instead of sending all the necessary data there. Ought to be less context switches that way.
This sounds good, but you will also need to update the DDX for everything up to and including Cayman. Hopefully the DDX doesn't emit IBs outside of glamor on Southern Islands and later chips.
Do you mean to pass an empty score (0) for 2d buffers, or that 2d buffers should also get the calculation? I suppose this depends on which ioctl is used to pass it.
IMHO for 2d use the current behavior is ok, so passing a dummy value and falling back should be enough.
It should be robust enough to handle 3D and 2D at the same time. Note that the DDX is responsible for putting OpenGL framebuffers on the screen, so even though it's not an OpenGL operation, it affects OpenGL performance.
Marek
dri-devel@lists.freedesktop.org