Hi,
So i want over the whole fence and sync point stuff as it's becoming a pressing issue. I think we first need to agree on what is the problem we want to solve and what would be the requirements to solve it.
Problem : Explicit synchronization btw different hardware block over a buffer object.
Requirements : Share common infrastructure. Allow optimal hardware command stream scheduling accross hardware block. Allow android sync point to be implemented on top of it. Handle/acknowledge exception (like good old gpu lockup). Minimize driver changes.
Glossary : hardware timeline: timeline bound to a specific hardware block. pipeline timeline: timeline bound to a userspace rendering pipeline, each point on that timeline can be a composite of several different hardware pipeline point. pipeline: abstract object representing userspace application graphic pipeline of each of the application graphic operations. fence: specific point in a timeline where synchronization needs to happen.
So now, current include/linux/fence.h implementation is i believe missing the objective by confusing hardware and pipeline timeline and by bolting fence to buffer object while what is really needed is true and proper timeline for both hardware and pipeline. But before going further down that road let me look at things and explain how i see them.
Current ttm fence have one and a sole purpose, allow synchronization for buffer object move even thought some driver like radeon slightly abuse it and use them for things like lockup detection.
The new fence want to expose an api that would allow some implementation of a timeline. For that it introduces callback and some hard requirement on what the driver have to expose : enable_signaling [signaled] wait
Each of those have to do work inside the driver to which the fence belongs and each of those can be call more or less from unexpected (with restriction like outside irq) context. So we end up with thing like :
Process 1 Process 2 Process 3 I_A_schedule(fence0) CI_A_F_B_signaled(fence0) I_A_signal(fence0) CI_B_F_A_callback(fence0) CI_A_F_B_wait(fence0) Lexique: I_x in driver x (I_A == in driver A) CI_x_F_y call in driver X from driver Y (CI_A_F_B call in driver A from driver B)
So this is an happy mess everyone call everyone and this bound to get messy. Yes i know there is all kind of requirement on what happen once a fence is signaled. But those requirement only looks like they are trying to atone any mess that can happen from the whole callback dance.
While i was too seduced by the whole callback idea long time ago, i think it is a highly dangerous path to take where the combinatorial of what could happen are bound to explode with the increase in the number of players.
So now back to how to solve the problem we are trying to address. First i want to make an observation, almost all GPU that exist today have a command ring on to which userspace command buffer are executed and inside the command ring you can do something like :
if (condition) execute_command_buffer else skip_command_buffer
where condition is a simple expression (memory_address cop value)) with cop one of the generic comparison (==, <, >, <=, >=). I think it is a safe assumption that any gpu that slightly matter can do that. Those who can not should fix there command ring processor.
With that in mind, i think proper solution is implementing timeline and having fence be a timeline object with a way simpler api. For each hardware timeline driver provide a system memory address at which the lastest signaled fence sequence number can be read. Each fence object is uniquely associated with both a hardware and a pipeline timeline. Each pipeline timeline have a wait queue.
When scheduling something that require synchronization on a hardware timeline a fence is created and associated with the pipeline timeline and hardware timeline. Other hardware block that need to wait on a fence can use there command ring conditional execution to directly check the fence sequence from the other hw block so you do optimistic scheduling. If optimistic scheduling fails (which would be reported by hw block specific solution and hidden) then things can fallback to software cpu wait inside what could be considered the kernel thread of the pipeline timeline.
From api point of view there is no inter-driver call. All the driver needs to
do is wakeup the pipeline timeline wait_queue when things are signaled or when things go sideway (gpu lockup).
So how to implement that with current driver ? Well easy. Currently we assume implicit synchronization so all we need is an implicit pipeline timeline per userspace process (note this do not prevent inter process synchronization). Everytime a command buffer is submitted it is added to the implicit timeline with the simple fence object :
struct fence { struct list_head list_hwtimeline; struct list_head list_pipetimeline; struct hw_timeline *hw_timeline; uint64_t seq_num; work_t timedout_work; void *csdata; };
So with set of helper function call by each of the driver command execution ioctl you have the implicit timeline that is properly populated and each dirver command execution get the dependency from the implicit timeline.
Of course to take full advantages of all flexibilities this could offer we would need to allow userspace to create pipeline timeline and to schedule against the pipeline timeline of there choice. We could create file for each of the pipeline timeline and have file operation to wait/query progress.
Note that the gpu lockup are considered exceptional event, the implicit timeline will probably want to continue on other job on other hardware block but the explicit one probably will want to decide wether to continue or abort or retry without the fault hw block.
I realize i am late to the party and that i should have taken a serious look at all this long time ago. I apologize for that and if you consider this is to late then just ignore me modulo the big warning the crazyness that callback will introduce an how bad things bound to happen. I am not saying that bad things can not happen with what i propose just that because everything happen inside the process context that is the one asking/requiring synchronization there will be not interprocess kernel callback (a callback that was registered by one process and that is call inside another process time slice because fence signaling is happening inside this other process time slice).
Pseudo code for explicitness :
drm_cs_ioctl_wrapper(struct drm_device *dev, void *data, struct file *filp) { struct fence *dependency[16], *fence; int m;
m = timeline_schedule(filp->implicit_pipeline, dev->hw_pipeline, dependency, 16, &fence); if (m < 0) return m; if (m >= 16) { // alloc m and recall; } dev->cs_ioctl(dev, data, filp, dev->implicit_pipeline, dependency, fence); }
int timeline_schedule(ptimeline, hwtimeline, timeout, dependency, mdep, **fence) { // allocate fence set hw_timeline and init work // build up list of dependency by looking at list of pending fence in // timeline }
// If device driver schedule job hopping for all dependency to be signaled then // it must also call this function with csdata being a copy of what needs to be // executed once all dependency are signaled void timeline_missed_schedule(timeline, fence, void *csdata) { INITWORK(fence->work, timeline_missed_schedule_worker) fence->csdata = csdata; schedule_delayed_work(fence->work, default_timeout) }
void timeline_missed_schedule_worker(work) { driver = driver_from_fence_hwtimeline(fence)
// Make sure that each of the hwtimeline dependency will fire irq by // calling a driver function. timeline_wait_for_fence_dependency(fence); driver->execute_cs(driver, fence); }
// This function is call by driver code that signal fence (could be call from // interrupt context). It is responsabilities of device driver to call that // function. void timeline_signal(hwtimeline) { for_each_fence(fence, hwtimeline->fences, list_hwtimeline) { wakeup(fence->pipetimeline->wait_queue); } }
Cheers, Jérôme
On Tue, Aug 12, 2014 at 06:13:41PM -0400, Jerome Glisse wrote:
Hi,
So i want over the whole fence and sync point stuff as it's becoming a pressing issue. I think we first need to agree on what is the problem we want to solve and what would be the requirements to solve it.
Problem : Explicit synchronization btw different hardware block over a buffer object.
Requirements : Share common infrastructure. Allow optimal hardware command stream scheduling accross hardware block. Allow android sync point to be implemented on top of it. Handle/acknowledge exception (like good old gpu lockup). Minimize driver changes.
Glossary : hardware timeline: timeline bound to a specific hardware block. pipeline timeline: timeline bound to a userspace rendering pipeline, each point on that timeline can be a composite of several different hardware pipeline point. pipeline: abstract object representing userspace application graphic pipeline of each of the application graphic operations. fence: specific point in a timeline where synchronization needs to happen.
So now, current include/linux/fence.h implementation is i believe missing the objective by confusing hardware and pipeline timeline and by bolting fence to buffer object while what is really needed is true and proper timeline for both hardware and pipeline. But before going further down that road let me look at things and explain how i see them.
Current ttm fence have one and a sole purpose, allow synchronization for buffer object move even thought some driver like radeon slightly abuse it and use them for things like lockup detection.
The new fence want to expose an api that would allow some implementation of a timeline. For that it introduces callback and some hard requirement on what the driver have to expose : enable_signaling [signaled] wait
Each of those have to do work inside the driver to which the fence belongs and each of those can be call more or less from unexpected (with restriction like outside irq) context. So we end up with thing like :
Process 1 Process 2 Process 3 I_A_schedule(fence0) CI_A_F_B_signaled(fence0) I_A_signal(fence0) CI_B_F_A_callback(fence0) CI_A_F_B_wait(fence0) Lexique: I_x in driver x (I_A == in driver A) CI_x_F_y call in driver X from driver Y (CI_A_F_B call in driver A from driver B)
So this is an happy mess everyone call everyone and this bound to get messy. Yes i know there is all kind of requirement on what happen once a fence is signaled. But those requirement only looks like they are trying to atone any mess that can happen from the whole callback dance.
While i was too seduced by the whole callback idea long time ago, i think it is a highly dangerous path to take where the combinatorial of what could happen are bound to explode with the increase in the number of players.
So now back to how to solve the problem we are trying to address. First i want to make an observation, almost all GPU that exist today have a command ring on to which userspace command buffer are executed and inside the command ring you can do something like :
if (condition) execute_command_buffer else skip_command_buffer
where condition is a simple expression (memory_address cop value)) with cop one of the generic comparison (==, <, >, <=, >=). I think it is a safe assumption that any gpu that slightly matter can do that. Those who can not should fix there command ring processor.
With that in mind, i think proper solution is implementing timeline and having fence be a timeline object with a way simpler api. For each hardware timeline driver provide a system memory address at which the lastest signaled fence sequence number can be read. Each fence object is uniquely associated with both a hardware and a pipeline timeline. Each pipeline timeline have a wait queue.
When scheduling something that require synchronization on a hardware timeline a fence is created and associated with the pipeline timeline and hardware timeline. Other hardware block that need to wait on a fence can use there command ring conditional execution to directly check the fence sequence from the other hw block so you do optimistic scheduling. If optimistic scheduling fails (which would be reported by hw block specific solution and hidden) then things can fallback to software cpu wait inside what could be considered the kernel thread of the pipeline timeline.
From api point of view there is no inter-driver call. All the driver needs to do is wakeup the pipeline timeline wait_queue when things are signaled or when things go sideway (gpu lockup).
So how to implement that with current driver ? Well easy. Currently we assume implicit synchronization so all we need is an implicit pipeline timeline per userspace process (note this do not prevent inter process synchronization). Everytime a command buffer is submitted it is added to the implicit timeline with the simple fence object :
struct fence { struct list_head list_hwtimeline; struct list_head list_pipetimeline; struct hw_timeline *hw_timeline; uint64_t seq_num; work_t timedout_work; void *csdata; };
So with set of helper function call by each of the driver command execution ioctl you have the implicit timeline that is properly populated and each dirver command execution get the dependency from the implicit timeline.
Of course to take full advantages of all flexibilities this could offer we would need to allow userspace to create pipeline timeline and to schedule against the pipeline timeline of there choice. We could create file for each of the pipeline timeline and have file operation to wait/query progress.
Note that the gpu lockup are considered exceptional event, the implicit timeline will probably want to continue on other job on other hardware block but the explicit one probably will want to decide wether to continue or abort or retry without the fault hw block.
I realize i am late to the party and that i should have taken a serious look at all this long time ago. I apologize for that and if you consider this is to late then just ignore me modulo the big warning the crazyness that callback will introduce an how bad things bound to happen. I am not saying that bad things can not happen with what i propose just that because everything happen inside the process context that is the one asking/requiring synchronization there will be not interprocess kernel callback (a callback that was registered by one process and that is call inside another process time slice because fence signaling is happening inside this other process time slice).
Pseudo code for explicitness :
drm_cs_ioctl_wrapper(struct drm_device *dev, void *data, struct file *filp) { struct fence *dependency[16], *fence; int m;
m = timeline_schedule(filp->implicit_pipeline, dev->hw_pipeline, dependency, 16, &fence); if (m < 0) return m; if (m >= 16) { // alloc m and recall; } dev->cs_ioctl(dev, data, filp, dev->implicit_pipeline, dependency, fence); }
int timeline_schedule(ptimeline, hwtimeline, timeout, dependency, mdep, **fence) { // allocate fence set hw_timeline and init work // build up list of dependency by looking at list of pending fence in // timeline }
// If device driver schedule job hopping for all dependency to be signaled then // it must also call this function with csdata being a copy of what needs to be // executed once all dependency are signaled void timeline_missed_schedule(timeline, fence, void *csdata) { INITWORK(fence->work, timeline_missed_schedule_worker) fence->csdata = csdata; schedule_delayed_work(fence->work, default_timeout) }
void timeline_missed_schedule_worker(work) { driver = driver_from_fence_hwtimeline(fence)
// Make sure that each of the hwtimeline dependency will fire irq by // calling a driver function. timeline_wait_for_fence_dependency(fence); driver->execute_cs(driver, fence); }
// This function is call by driver code that signal fence (could be call from // interrupt context). It is responsabilities of device driver to call that // function. void timeline_signal(hwtimeline) { for_each_fence(fence, hwtimeline->fences, list_hwtimeline) { wakeup(fence->pipetimeline->wait_queue); } }
Btw as extra note, because of implicit timeline any shared object schedule on a hw timeline must add a fence to all the implicit timeline where this object exist.
Also there is no need to have a fence pointer per object.
Cheers, Jérôme
Hi Jerome,
first of all that finally sounds like somebody starts to draw the whole picture for me.
So far all I have seen was a bunch of specialized requirements and some not so obvious design decisions based on those requirements.
So thanks a lot for finally summarizing the requirements from a top above view and I perfectly agree with your analysis of the current fence design and the downsides of that API.
Apart from that I also have some comments / requirements that hopefully can be taken into account as well:
pipeline timeline: timeline bound to a userspace rendering pipeline, each point on that timeline can be a composite of several different hardware pipeline point. pipeline: abstract object representing userspace application graphic pipeline of each of the application graphic operations.
In the long term a requirement for the driver for AMD GFX hardware is that instead of a fixed pipeline timeline we need a bit more flexible model where concurrent execution on different hardware engines is possible as well.
So the requirement is that you can do things like submitting a 3D job A, a DMA job B, a VCE job C and another 3D job D that are executed like this: A / \ B C \ / D
(Let's just hope that looks as good on your mail client as it looked for me).
My current thinking is that we avoid having a pipeline object in the kernel and instead letting userspace specify which fence we want to synchronize to explicitly as long as everything stays withing the same client. As soon as any buffer is shared between clients the kernel we would need to fall back to implicitly synchronization to allow backward compatibility with DRI2/3.
if (condition) execute_command_buffer else skip_command_buffer
where condition is a simple expression (memory_address cop value)) with cop one of the generic comparison (==, <, >, <=, >=). I think it is a safe assumption that any gpu that slightly matter can do that. Those who can not should fix there command ring processor.
At least for some engines on AMD hardware that isn't possible (UVD, VCE and in some extends DMA as well), but I don't see any reason why we shouldn't be able to use software based scheduling on those engines by default. So this isn't really a problem, but just an additional comment to keep in mind.
Regards, Christian.
Am 13.08.2014 um 00:13 schrieb Jerome Glisse:
Hi,
So i want over the whole fence and sync point stuff as it's becoming a pressing issue. I think we first need to agree on what is the problem we want to solve and what would be the requirements to solve it.
Problem : Explicit synchronization btw different hardware block over a buffer object.
Requirements : Share common infrastructure. Allow optimal hardware command stream scheduling accross hardware block. Allow android sync point to be implemented on top of it. Handle/acknowledge exception (like good old gpu lockup). Minimize driver changes.
Glossary : hardware timeline: timeline bound to a specific hardware block. pipeline timeline: timeline bound to a userspace rendering pipeline, each point on that timeline can be a composite of several different hardware pipeline point. pipeline: abstract object representing userspace application graphic pipeline of each of the application graphic operations. fence: specific point in a timeline where synchronization needs to happen.
So now, current include/linux/fence.h implementation is i believe missing the objective by confusing hardware and pipeline timeline and by bolting fence to buffer object while what is really needed is true and proper timeline for both hardware and pipeline. But before going further down that road let me look at things and explain how i see them.
Current ttm fence have one and a sole purpose, allow synchronization for buffer object move even thought some driver like radeon slightly abuse it and use them for things like lockup detection.
The new fence want to expose an api that would allow some implementation of a timeline. For that it introduces callback and some hard requirement on what the driver have to expose : enable_signaling [signaled] wait
Each of those have to do work inside the driver to which the fence belongs and each of those can be call more or less from unexpected (with restriction like outside irq) context. So we end up with thing like :
Process 1 Process 2 Process 3 I_A_schedule(fence0) CI_A_F_B_signaled(fence0) I_A_signal(fence0) CI_B_F_A_callback(fence0) CI_A_F_B_wait(fence0) Lexique: I_x in driver x (I_A == in driver A) CI_x_F_y call in driver X from driver Y (CI_A_F_B call in driver A from driver B)
So this is an happy mess everyone call everyone and this bound to get messy. Yes i know there is all kind of requirement on what happen once a fence is signaled. But those requirement only looks like they are trying to atone any mess that can happen from the whole callback dance.
While i was too seduced by the whole callback idea long time ago, i think it is a highly dangerous path to take where the combinatorial of what could happen are bound to explode with the increase in the number of players.
So now back to how to solve the problem we are trying to address. First i want to make an observation, almost all GPU that exist today have a command ring on to which userspace command buffer are executed and inside the command ring you can do something like :
if (condition) execute_command_buffer else skip_command_buffer
where condition is a simple expression (memory_address cop value)) with cop one of the generic comparison (==, <, >, <=, >=). I think it is a safe assumption that any gpu that slightly matter can do that. Those who can not should fix there command ring processor.
With that in mind, i think proper solution is implementing timeline and having fence be a timeline object with a way simpler api. For each hardware timeline driver provide a system memory address at which the lastest signaled fence sequence number can be read. Each fence object is uniquely associated with both a hardware and a pipeline timeline. Each pipeline timeline have a wait queue.
When scheduling something that require synchronization on a hardware timeline a fence is created and associated with the pipeline timeline and hardware timeline. Other hardware block that need to wait on a fence can use there command ring conditional execution to directly check the fence sequence from the other hw block so you do optimistic scheduling. If optimistic scheduling fails (which would be reported by hw block specific solution and hidden) then things can fallback to software cpu wait inside what could be considered the kernel thread of the pipeline timeline.
From api point of view there is no inter-driver call. All the driver needs to do is wakeup the pipeline timeline wait_queue when things are signaled or when things go sideway (gpu lockup).
So how to implement that with current driver ? Well easy. Currently we assume implicit synchronization so all we need is an implicit pipeline timeline per userspace process (note this do not prevent inter process synchronization). Everytime a command buffer is submitted it is added to the implicit timeline with the simple fence object :
struct fence { struct list_head list_hwtimeline; struct list_head list_pipetimeline; struct hw_timeline *hw_timeline; uint64_t seq_num; work_t timedout_work; void *csdata; };
So with set of helper function call by each of the driver command execution ioctl you have the implicit timeline that is properly populated and each dirver command execution get the dependency from the implicit timeline.
Of course to take full advantages of all flexibilities this could offer we would need to allow userspace to create pipeline timeline and to schedule against the pipeline timeline of there choice. We could create file for each of the pipeline timeline and have file operation to wait/query progress.
Note that the gpu lockup are considered exceptional event, the implicit timeline will probably want to continue on other job on other hardware block but the explicit one probably will want to decide wether to continue or abort or retry without the fault hw block.
I realize i am late to the party and that i should have taken a serious look at all this long time ago. I apologize for that and if you consider this is to late then just ignore me modulo the big warning the crazyness that callback will introduce an how bad things bound to happen. I am not saying that bad things can not happen with what i propose just that because everything happen inside the process context that is the one asking/requiring synchronization there will be not interprocess kernel callback (a callback that was registered by one process and that is call inside another process time slice because fence signaling is happening inside this other process time slice).
Pseudo code for explicitness :
drm_cs_ioctl_wrapper(struct drm_device *dev, void *data, struct file *filp) { struct fence *dependency[16], *fence; int m;
m = timeline_schedule(filp->implicit_pipeline, dev->hw_pipeline, dependency, 16, &fence); if (m < 0) return m; if (m >= 16) { // alloc m and recall; } dev->cs_ioctl(dev, data, filp, dev->implicit_pipeline, dependency, fence);
}
int timeline_schedule(ptimeline, hwtimeline, timeout, dependency, mdep, **fence) { // allocate fence set hw_timeline and init work // build up list of dependency by looking at list of pending fence in // timeline }
// If device driver schedule job hopping for all dependency to be signaled then // it must also call this function with csdata being a copy of what needs to be // executed once all dependency are signaled void timeline_missed_schedule(timeline, fence, void *csdata) { INITWORK(fence->work, timeline_missed_schedule_worker) fence->csdata = csdata; schedule_delayed_work(fence->work, default_timeout) }
void timeline_missed_schedule_worker(work) { driver = driver_from_fence_hwtimeline(fence)
// Make sure that each of the hwtimeline dependency will fire irq by // calling a driver function. timeline_wait_for_fence_dependency(fence); driver->execute_cs(driver, fence);
}
// This function is call by driver code that signal fence (could be call from // interrupt context). It is responsabilities of device driver to call that // function. void timeline_signal(hwtimeline) { for_each_fence(fence, hwtimeline->fences, list_hwtimeline) { wakeup(fence->pipetimeline->wait_queue); } }
Cheers, Jérôme
On Wed, Aug 13, 2014 at 09:59:26AM +0200, Christian König wrote:
Hi Jerome,
first of all that finally sounds like somebody starts to draw the whole picture for me.
So far all I have seen was a bunch of specialized requirements and some not so obvious design decisions based on those requirements.
So thanks a lot for finally summarizing the requirements from a top above view and I perfectly agree with your analysis of the current fence design and the downsides of that API.
Apart from that I also have some comments / requirements that hopefully can be taken into account as well:
pipeline timeline: timeline bound to a userspace rendering pipeline, each point on that timeline can be a composite of several different hardware pipeline point. pipeline: abstract object representing userspace application graphic pipeline of each of the application graphic operations.
In the long term a requirement for the driver for AMD GFX hardware is that instead of a fixed pipeline timeline we need a bit more flexible model where concurrent execution on different hardware engines is possible as well.
So the requirement is that you can do things like submitting a 3D job A, a DMA job B, a VCE job C and another 3D job D that are executed like this: A / \ B C \ / D
(Let's just hope that looks as good on your mail client as it looked for me).
My thinking of hw timeline is that a gpu like amd or nvidia would have several different hw timeline. They are per block/engine so one for dma ring, one for gfx, one for vce, ....
My current thinking is that we avoid having a pipeline object in the kernel and instead letting userspace specify which fence we want to synchronize to explicitly as long as everything stays withing the same client. As soon as any buffer is shared between clients the kernel we would need to fall back to implicitly synchronization to allow backward compatibility with DRI2/3.
The whole issue is that today cs ioctl assume implied synchronization. So this can not change, so for now anything that goes through cs ioctl would need to use an implied timeline and have all ring that use common buffer synchronize on it. As long as those ring use different buffer there is no need for sync.
Buffer object are what links hw timeline.
Of course there might be way to be more flexible if timeline are expose to userspace and userspace can create several of them for a single process.
if (condition) execute_command_buffer else skip_command_buffer
where condition is a simple expression (memory_address cop value)) with cop one of the generic comparison (==, <, >, <=, >=). I think it is a safe assumption that any gpu that slightly matter can do that. Those who can not should fix there command ring processor.
At least for some engines on AMD hardware that isn't possible (UVD, VCE and in some extends DMA as well), but I don't see any reason why we shouldn't be able to use software based scheduling on those engines by default. So this isn't really a problem, but just an additional comment to keep in mind.
Yes not everything can do that but as it's a simple memory access with simple comparison then it's easy to do on cpu for limited hardware. But this really sounds like something so easy to add to hw ring execution that it is a shame hw designer do not already added such thing.
Regards, Christian.
Am 13.08.2014 um 00:13 schrieb Jerome Glisse:
Hi,
So i want over the whole fence and sync point stuff as it's becoming a pressing issue. I think we first need to agree on what is the problem we want to solve and what would be the requirements to solve it.
Problem : Explicit synchronization btw different hardware block over a buffer object.
Requirements : Share common infrastructure. Allow optimal hardware command stream scheduling accross hardware block. Allow android sync point to be implemented on top of it. Handle/acknowledge exception (like good old gpu lockup). Minimize driver changes.
Glossary : hardware timeline: timeline bound to a specific hardware block. pipeline timeline: timeline bound to a userspace rendering pipeline, each point on that timeline can be a composite of several different hardware pipeline point. pipeline: abstract object representing userspace application graphic pipeline of each of the application graphic operations. fence: specific point in a timeline where synchronization needs to happen.
So now, current include/linux/fence.h implementation is i believe missing the objective by confusing hardware and pipeline timeline and by bolting fence to buffer object while what is really needed is true and proper timeline for both hardware and pipeline. But before going further down that road let me look at things and explain how i see them.
Current ttm fence have one and a sole purpose, allow synchronization for buffer object move even thought some driver like radeon slightly abuse it and use them for things like lockup detection.
The new fence want to expose an api that would allow some implementation of a timeline. For that it introduces callback and some hard requirement on what the driver have to expose : enable_signaling [signaled] wait
Each of those have to do work inside the driver to which the fence belongs and each of those can be call more or less from unexpected (with restriction like outside irq) context. So we end up with thing like :
Process 1 Process 2 Process 3 I_A_schedule(fence0) CI_A_F_B_signaled(fence0) I_A_signal(fence0) CI_B_F_A_callback(fence0) CI_A_F_B_wait(fence0) Lexique: I_x in driver x (I_A == in driver A) CI_x_F_y call in driver X from driver Y (CI_A_F_B call in driver A from driver B)
So this is an happy mess everyone call everyone and this bound to get messy. Yes i know there is all kind of requirement on what happen once a fence is signaled. But those requirement only looks like they are trying to atone any mess that can happen from the whole callback dance.
While i was too seduced by the whole callback idea long time ago, i think it is a highly dangerous path to take where the combinatorial of what could happen are bound to explode with the increase in the number of players.
So now back to how to solve the problem we are trying to address. First i want to make an observation, almost all GPU that exist today have a command ring on to which userspace command buffer are executed and inside the command ring you can do something like :
if (condition) execute_command_buffer else skip_command_buffer
where condition is a simple expression (memory_address cop value)) with cop one of the generic comparison (==, <, >, <=, >=). I think it is a safe assumption that any gpu that slightly matter can do that. Those who can not should fix there command ring processor.
With that in mind, i think proper solution is implementing timeline and having fence be a timeline object with a way simpler api. For each hardware timeline driver provide a system memory address at which the lastest signaled fence sequence number can be read. Each fence object is uniquely associated with both a hardware and a pipeline timeline. Each pipeline timeline have a wait queue.
When scheduling something that require synchronization on a hardware timeline a fence is created and associated with the pipeline timeline and hardware timeline. Other hardware block that need to wait on a fence can use there command ring conditional execution to directly check the fence sequence from the other hw block so you do optimistic scheduling. If optimistic scheduling fails (which would be reported by hw block specific solution and hidden) then things can fallback to software cpu wait inside what could be considered the kernel thread of the pipeline timeline.
From api point of view there is no inter-driver call. All the driver needs to do is wakeup the pipeline timeline wait_queue when things are signaled or when things go sideway (gpu lockup).
So how to implement that with current driver ? Well easy. Currently we assume implicit synchronization so all we need is an implicit pipeline timeline per userspace process (note this do not prevent inter process synchronization). Everytime a command buffer is submitted it is added to the implicit timeline with the simple fence object :
struct fence { struct list_head list_hwtimeline; struct list_head list_pipetimeline; struct hw_timeline *hw_timeline; uint64_t seq_num; work_t timedout_work; void *csdata; };
So with set of helper function call by each of the driver command execution ioctl you have the implicit timeline that is properly populated and each dirver command execution get the dependency from the implicit timeline.
Of course to take full advantages of all flexibilities this could offer we would need to allow userspace to create pipeline timeline and to schedule against the pipeline timeline of there choice. We could create file for each of the pipeline timeline and have file operation to wait/query progress.
Note that the gpu lockup are considered exceptional event, the implicit timeline will probably want to continue on other job on other hardware block but the explicit one probably will want to decide wether to continue or abort or retry without the fault hw block.
I realize i am late to the party and that i should have taken a serious look at all this long time ago. I apologize for that and if you consider this is to late then just ignore me modulo the big warning the crazyness that callback will introduce an how bad things bound to happen. I am not saying that bad things can not happen with what i propose just that because everything happen inside the process context that is the one asking/requiring synchronization there will be not interprocess kernel callback (a callback that was registered by one process and that is call inside another process time slice because fence signaling is happening inside this other process time slice).
Pseudo code for explicitness :
drm_cs_ioctl_wrapper(struct drm_device *dev, void *data, struct file *filp) { struct fence *dependency[16], *fence; int m;
m = timeline_schedule(filp->implicit_pipeline, dev->hw_pipeline, dependency, 16, &fence); if (m < 0) return m; if (m >= 16) { // alloc m and recall; } dev->cs_ioctl(dev, data, filp, dev->implicit_pipeline, dependency, fence); }
int timeline_schedule(ptimeline, hwtimeline, timeout, dependency, mdep, **fence) { // allocate fence set hw_timeline and init work // build up list of dependency by looking at list of pending fence in // timeline }
// If device driver schedule job hopping for all dependency to be signaled then // it must also call this function with csdata being a copy of what needs to be // executed once all dependency are signaled void timeline_missed_schedule(timeline, fence, void *csdata) { INITWORK(fence->work, timeline_missed_schedule_worker) fence->csdata = csdata; schedule_delayed_work(fence->work, default_timeout) }
void timeline_missed_schedule_worker(work) { driver = driver_from_fence_hwtimeline(fence)
// Make sure that each of the hwtimeline dependency will fire irq by // calling a driver function. timeline_wait_for_fence_dependency(fence); driver->execute_cs(driver, fence); }
// This function is call by driver code that signal fence (could be call from // interrupt context). It is responsabilities of device driver to call that // function. void timeline_signal(hwtimeline) { for_each_fence(fence, hwtimeline->fences, list_hwtimeline) { wakeup(fence->pipetimeline->wait_queue); } }
Cheers, Jérôme
dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
The whole issue is that today cs ioctl assume implied synchronization. So this can not change, so for now anything that goes through cs ioctl would need to use an implied timeline and have all ring that use common buffer synchronize on it. As long as those ring use different buffer there is no need for sync.
Exactly my thoughts.
Buffer object are what links hw timeline.
A couple of people at AMD have a problem with that and I'm currently working full time on a solution. But solving this and keeping 100% backward compatibility at the same time is not an easy task.
Of course there might be way to be more flexible if timeline are expose to userspace and userspace can create several of them for a single process.
Concurrent execution is mostly used for temporary things e.g. copying a result to a userspace buffer while VCE is decoding into the ring buffer at a different location for example. Creating an extra timeline just to tell the kernel that two commands are allowed to run in parallel sounds like to much overhead to me.
Cheers, Christian.
Am 13.08.2014 um 15:41 schrieb Jerome Glisse:
On Wed, Aug 13, 2014 at 09:59:26AM +0200, Christian König wrote:
Hi Jerome,
first of all that finally sounds like somebody starts to draw the whole picture for me.
So far all I have seen was a bunch of specialized requirements and some not so obvious design decisions based on those requirements.
So thanks a lot for finally summarizing the requirements from a top above view and I perfectly agree with your analysis of the current fence design and the downsides of that API.
Apart from that I also have some comments / requirements that hopefully can be taken into account as well:
pipeline timeline: timeline bound to a userspace rendering pipeline, each point on that timeline can be a composite of several different hardware pipeline point. pipeline: abstract object representing userspace application graphic pipeline of each of the application graphic operations.
In the long term a requirement for the driver for AMD GFX hardware is that instead of a fixed pipeline timeline we need a bit more flexible model where concurrent execution on different hardware engines is possible as well.
So the requirement is that you can do things like submitting a 3D job A, a DMA job B, a VCE job C and another 3D job D that are executed like this: A / \ B C \ / D
(Let's just hope that looks as good on your mail client as it looked for me).
My thinking of hw timeline is that a gpu like amd or nvidia would have several different hw timeline. They are per block/engine so one for dma ring, one for gfx, one for vce, ....
My current thinking is that we avoid having a pipeline object in the kernel and instead letting userspace specify which fence we want to synchronize to explicitly as long as everything stays withing the same client. As soon as any buffer is shared between clients the kernel we would need to fall back to implicitly synchronization to allow backward compatibility with DRI2/3.
The whole issue is that today cs ioctl assume implied synchronization. So this can not change, so for now anything that goes through cs ioctl would need to use an implied timeline and have all ring that use common buffer synchronize on it. As long as those ring use different buffer there is no need for sync.
Buffer object are what links hw timeline.
Of course there might be way to be more flexible if timeline are expose to userspace and userspace can create several of them for a single process.
if (condition) execute_command_buffer else skip_command_buffer
where condition is a simple expression (memory_address cop value)) with cop one of the generic comparison (==, <, >, <=, >=). I think it is a safe assumption that any gpu that slightly matter can do that. Those who can not should fix there command ring processor.
At least for some engines on AMD hardware that isn't possible (UVD, VCE and in some extends DMA as well), but I don't see any reason why we shouldn't be able to use software based scheduling on those engines by default. So this isn't really a problem, but just an additional comment to keep in mind.
Yes not everything can do that but as it's a simple memory access with simple comparison then it's easy to do on cpu for limited hardware. But this really sounds like something so easy to add to hw ring execution that it is a shame hw designer do not already added such thing.
Regards, Christian.
Am 13.08.2014 um 00:13 schrieb Jerome Glisse:
Hi,
So i want over the whole fence and sync point stuff as it's becoming a pressing issue. I think we first need to agree on what is the problem we want to solve and what would be the requirements to solve it.
Problem : Explicit synchronization btw different hardware block over a buffer object.
Requirements : Share common infrastructure. Allow optimal hardware command stream scheduling accross hardware block. Allow android sync point to be implemented on top of it. Handle/acknowledge exception (like good old gpu lockup). Minimize driver changes.
Glossary : hardware timeline: timeline bound to a specific hardware block. pipeline timeline: timeline bound to a userspace rendering pipeline, each point on that timeline can be a composite of several different hardware pipeline point. pipeline: abstract object representing userspace application graphic pipeline of each of the application graphic operations. fence: specific point in a timeline where synchronization needs to happen.
So now, current include/linux/fence.h implementation is i believe missing the objective by confusing hardware and pipeline timeline and by bolting fence to buffer object while what is really needed is true and proper timeline for both hardware and pipeline. But before going further down that road let me look at things and explain how i see them.
Current ttm fence have one and a sole purpose, allow synchronization for buffer object move even thought some driver like radeon slightly abuse it and use them for things like lockup detection.
The new fence want to expose an api that would allow some implementation of a timeline. For that it introduces callback and some hard requirement on what the driver have to expose : enable_signaling [signaled] wait
Each of those have to do work inside the driver to which the fence belongs and each of those can be call more or less from unexpected (with restriction like outside irq) context. So we end up with thing like :
Process 1 Process 2 Process 3 I_A_schedule(fence0) CI_A_F_B_signaled(fence0) I_A_signal(fence0) CI_B_F_A_callback(fence0) CI_A_F_B_wait(fence0) Lexique: I_x in driver x (I_A == in driver A) CI_x_F_y call in driver X from driver Y (CI_A_F_B call in driver A from driver B)
So this is an happy mess everyone call everyone and this bound to get messy. Yes i know there is all kind of requirement on what happen once a fence is signaled. But those requirement only looks like they are trying to atone any mess that can happen from the whole callback dance.
While i was too seduced by the whole callback idea long time ago, i think it is a highly dangerous path to take where the combinatorial of what could happen are bound to explode with the increase in the number of players.
So now back to how to solve the problem we are trying to address. First i want to make an observation, almost all GPU that exist today have a command ring on to which userspace command buffer are executed and inside the command ring you can do something like :
if (condition) execute_command_buffer else skip_command_buffer
where condition is a simple expression (memory_address cop value)) with cop one of the generic comparison (==, <, >, <=, >=). I think it is a safe assumption that any gpu that slightly matter can do that. Those who can not should fix there command ring processor.
With that in mind, i think proper solution is implementing timeline and having fence be a timeline object with a way simpler api. For each hardware timeline driver provide a system memory address at which the lastest signaled fence sequence number can be read. Each fence object is uniquely associated with both a hardware and a pipeline timeline. Each pipeline timeline have a wait queue.
When scheduling something that require synchronization on a hardware timeline a fence is created and associated with the pipeline timeline and hardware timeline. Other hardware block that need to wait on a fence can use there command ring conditional execution to directly check the fence sequence from the other hw block so you do optimistic scheduling. If optimistic scheduling fails (which would be reported by hw block specific solution and hidden) then things can fallback to software cpu wait inside what could be considered the kernel thread of the pipeline timeline.
From api point of view there is no inter-driver call. All the driver needs to do is wakeup the pipeline timeline wait_queue when things are signaled or when things go sideway (gpu lockup).
So how to implement that with current driver ? Well easy. Currently we assume implicit synchronization so all we need is an implicit pipeline timeline per userspace process (note this do not prevent inter process synchronization). Everytime a command buffer is submitted it is added to the implicit timeline with the simple fence object :
struct fence { struct list_head list_hwtimeline; struct list_head list_pipetimeline; struct hw_timeline *hw_timeline; uint64_t seq_num; work_t timedout_work; void *csdata; };
So with set of helper function call by each of the driver command execution ioctl you have the implicit timeline that is properly populated and each dirver command execution get the dependency from the implicit timeline.
Of course to take full advantages of all flexibilities this could offer we would need to allow userspace to create pipeline timeline and to schedule against the pipeline timeline of there choice. We could create file for each of the pipeline timeline and have file operation to wait/query progress.
Note that the gpu lockup are considered exceptional event, the implicit timeline will probably want to continue on other job on other hardware block but the explicit one probably will want to decide wether to continue or abort or retry without the fault hw block.
I realize i am late to the party and that i should have taken a serious look at all this long time ago. I apologize for that and if you consider this is to late then just ignore me modulo the big warning the crazyness that callback will introduce an how bad things bound to happen. I am not saying that bad things can not happen with what i propose just that because everything happen inside the process context that is the one asking/requiring synchronization there will be not interprocess kernel callback (a callback that was registered by one process and that is call inside another process time slice because fence signaling is happening inside this other process time slice).
Pseudo code for explicitness :
drm_cs_ioctl_wrapper(struct drm_device *dev, void *data, struct file *filp) { struct fence *dependency[16], *fence; int m;
m = timeline_schedule(filp->implicit_pipeline, dev->hw_pipeline, dependency, 16, &fence); if (m < 0) return m; if (m >= 16) { // alloc m and recall; } dev->cs_ioctl(dev, data, filp, dev->implicit_pipeline, dependency, fence);
}
int timeline_schedule(ptimeline, hwtimeline, timeout, dependency, mdep, **fence) { // allocate fence set hw_timeline and init work // build up list of dependency by looking at list of pending fence in // timeline }
// If device driver schedule job hopping for all dependency to be signaled then // it must also call this function with csdata being a copy of what needs to be // executed once all dependency are signaled void timeline_missed_schedule(timeline, fence, void *csdata) { INITWORK(fence->work, timeline_missed_schedule_worker) fence->csdata = csdata; schedule_delayed_work(fence->work, default_timeout) }
void timeline_missed_schedule_worker(work) { driver = driver_from_fence_hwtimeline(fence)
// Make sure that each of the hwtimeline dependency will fire irq by // calling a driver function. timeline_wait_for_fence_dependency(fence); driver->execute_cs(driver, fence);
}
// This function is call by driver code that signal fence (could be call from // interrupt context). It is responsabilities of device driver to call that // function. void timeline_signal(hwtimeline) { for_each_fence(fence, hwtimeline->fences, list_hwtimeline) { wakeup(fence->pipetimeline->wait_queue); } }
Cheers, Jérôme
dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
On Wed, Aug 13, 2014 at 04:08:14PM +0200, Christian König wrote:
The whole issue is that today cs ioctl assume implied synchronization. So this can not change, so for now anything that goes through cs ioctl would need to use an implied timeline and have all ring that use common buffer synchronize on it. As long as those ring use different buffer there is no need for sync.
Exactly my thoughts.
Buffer object are what links hw timeline.
A couple of people at AMD have a problem with that and I'm currently working full time on a solution. But solving this and keeping 100% backward compatibility at the same time is not an easy task.
Let me rephrase, with current cs ioctl forcing synchronization for everybuffer that appear on different hw ring is mandatory and there is no way to fix that.
That being said one can imagine a single buffer where one engine works on a region of it and another hw block on another non overlapping region in which case there is no need for synchronization btw those different hw block (like multi-gpu each rendering one half of the screen). But to properly do such thing you need to expose timeline or something like that to userspace and have user space emit sync on this timeline. So something like :
cs_ioctl(timeline, cs) {return csno + hwtimeline_id;} timeline_sync(nsync, seqno[], hwtimeline_id[])
When you schedule something using a new ioctl that just take a timeline as extra parameter you add no synchronization to timeline you assume that user space will call timeline_sync that will insert synchronization point on the timeline. So you can schedule bunch of cs on different hwblock, user space keep track of last emited cs seqno and its associated hwtimeline and when it wants to synchronize it call the timeline sync and any new cs ioctl on that timeline will have to wait before being able to schedule.
So really i do not see how to fix that properly without a new cs ioctl that just take an extra timeline as a parameter (well in case of radeon we can add a timeline chunk to cs ioctl).
Of course there might be way to be more flexible if timeline are expose to userspace and userspace can create several of them for a single process.
Concurrent execution is mostly used for temporary things e.g. copying a result to a userspace buffer while VCE is decoding into the ring buffer at a different location for example. Creating an extra timeline just to tell the kernel that two commands are allowed to run in parallel sounds like to much overhead to me.
Never was my intention to create different timeline, like said above when scheduling with explicit timeline then each time you schedule there is no synchronization whatsoever. Only time an engine have to wait is when user space emit an explicit sync point. Like said above.
The allowing multi-timeline per process is more for thing where you have a process working on two different distinct problem with no interdependency. Hence one timeline for each of those task but inside that task you can schedule thing concurently as said above.
Cheers, Christian.
Am 13.08.2014 um 15:41 schrieb Jerome Glisse:
On Wed, Aug 13, 2014 at 09:59:26AM +0200, Christian König wrote:
Hi Jerome,
first of all that finally sounds like somebody starts to draw the whole picture for me.
So far all I have seen was a bunch of specialized requirements and some not so obvious design decisions based on those requirements.
So thanks a lot for finally summarizing the requirements from a top above view and I perfectly agree with your analysis of the current fence design and the downsides of that API.
Apart from that I also have some comments / requirements that hopefully can be taken into account as well:
pipeline timeline: timeline bound to a userspace rendering pipeline, each point on that timeline can be a composite of several different hardware pipeline point. pipeline: abstract object representing userspace application graphic pipeline of each of the application graphic operations.
In the long term a requirement for the driver for AMD GFX hardware is that instead of a fixed pipeline timeline we need a bit more flexible model where concurrent execution on different hardware engines is possible as well.
So the requirement is that you can do things like submitting a 3D job A, a DMA job B, a VCE job C and another 3D job D that are executed like this: A / \ B C \ / D
(Let's just hope that looks as good on your mail client as it looked for me).
My thinking of hw timeline is that a gpu like amd or nvidia would have several different hw timeline. They are per block/engine so one for dma ring, one for gfx, one for vce, ....
My current thinking is that we avoid having a pipeline object in the kernel and instead letting userspace specify which fence we want to synchronize to explicitly as long as everything stays withing the same client. As soon as any buffer is shared between clients the kernel we would need to fall back to implicitly synchronization to allow backward compatibility with DRI2/3.
The whole issue is that today cs ioctl assume implied synchronization. So this can not change, so for now anything that goes through cs ioctl would need to use an implied timeline and have all ring that use common buffer synchronize on it. As long as those ring use different buffer there is no need for sync.
Buffer object are what links hw timeline.
Of course there might be way to be more flexible if timeline are expose to userspace and userspace can create several of them for a single process.
if (condition) execute_command_buffer else skip_command_buffer
where condition is a simple expression (memory_address cop value)) with cop one of the generic comparison (==, <, >, <=, >=). I think it is a safe assumption that any gpu that slightly matter can do that. Those who can not should fix there command ring processor.
At least for some engines on AMD hardware that isn't possible (UVD, VCE and in some extends DMA as well), but I don't see any reason why we shouldn't be able to use software based scheduling on those engines by default. So this isn't really a problem, but just an additional comment to keep in mind.
Yes not everything can do that but as it's a simple memory access with simple comparison then it's easy to do on cpu for limited hardware. But this really sounds like something so easy to add to hw ring execution that it is a shame hw designer do not already added such thing.
Regards, Christian.
Am 13.08.2014 um 00:13 schrieb Jerome Glisse:
Hi,
So i want over the whole fence and sync point stuff as it's becoming a pressing issue. I think we first need to agree on what is the problem we want to solve and what would be the requirements to solve it.
Problem : Explicit synchronization btw different hardware block over a buffer object.
Requirements : Share common infrastructure. Allow optimal hardware command stream scheduling accross hardware block. Allow android sync point to be implemented on top of it. Handle/acknowledge exception (like good old gpu lockup). Minimize driver changes.
Glossary : hardware timeline: timeline bound to a specific hardware block. pipeline timeline: timeline bound to a userspace rendering pipeline, each point on that timeline can be a composite of several different hardware pipeline point. pipeline: abstract object representing userspace application graphic pipeline of each of the application graphic operations. fence: specific point in a timeline where synchronization needs to happen.
So now, current include/linux/fence.h implementation is i believe missing the objective by confusing hardware and pipeline timeline and by bolting fence to buffer object while what is really needed is true and proper timeline for both hardware and pipeline. But before going further down that road let me look at things and explain how i see them.
Current ttm fence have one and a sole purpose, allow synchronization for buffer object move even thought some driver like radeon slightly abuse it and use them for things like lockup detection.
The new fence want to expose an api that would allow some implementation of a timeline. For that it introduces callback and some hard requirement on what the driver have to expose : enable_signaling [signaled] wait
Each of those have to do work inside the driver to which the fence belongs and each of those can be call more or less from unexpected (with restriction like outside irq) context. So we end up with thing like :
Process 1 Process 2 Process 3 I_A_schedule(fence0) CI_A_F_B_signaled(fence0) I_A_signal(fence0) CI_B_F_A_callback(fence0) CI_A_F_B_wait(fence0) Lexique: I_x in driver x (I_A == in driver A) CI_x_F_y call in driver X from driver Y (CI_A_F_B call in driver A from driver B)
So this is an happy mess everyone call everyone and this bound to get messy. Yes i know there is all kind of requirement on what happen once a fence is signaled. But those requirement only looks like they are trying to atone any mess that can happen from the whole callback dance.
While i was too seduced by the whole callback idea long time ago, i think it is a highly dangerous path to take where the combinatorial of what could happen are bound to explode with the increase in the number of players.
So now back to how to solve the problem we are trying to address. First i want to make an observation, almost all GPU that exist today have a command ring on to which userspace command buffer are executed and inside the command ring you can do something like :
if (condition) execute_command_buffer else skip_command_buffer
where condition is a simple expression (memory_address cop value)) with cop one of the generic comparison (==, <, >, <=, >=). I think it is a safe assumption that any gpu that slightly matter can do that. Those who can not should fix there command ring processor.
With that in mind, i think proper solution is implementing timeline and having fence be a timeline object with a way simpler api. For each hardware timeline driver provide a system memory address at which the lastest signaled fence sequence number can be read. Each fence object is uniquely associated with both a hardware and a pipeline timeline. Each pipeline timeline have a wait queue.
When scheduling something that require synchronization on a hardware timeline a fence is created and associated with the pipeline timeline and hardware timeline. Other hardware block that need to wait on a fence can use there command ring conditional execution to directly check the fence sequence from the other hw block so you do optimistic scheduling. If optimistic scheduling fails (which would be reported by hw block specific solution and hidden) then things can fallback to software cpu wait inside what could be considered the kernel thread of the pipeline timeline.
From api point of view there is no inter-driver call. All the driver needs to do is wakeup the pipeline timeline wait_queue when things are signaled or when things go sideway (gpu lockup).
So how to implement that with current driver ? Well easy. Currently we assume implicit synchronization so all we need is an implicit pipeline timeline per userspace process (note this do not prevent inter process synchronization). Everytime a command buffer is submitted it is added to the implicit timeline with the simple fence object :
struct fence { struct list_head list_hwtimeline; struct list_head list_pipetimeline; struct hw_timeline *hw_timeline; uint64_t seq_num; work_t timedout_work; void *csdata; };
So with set of helper function call by each of the driver command execution ioctl you have the implicit timeline that is properly populated and each dirver command execution get the dependency from the implicit timeline.
Of course to take full advantages of all flexibilities this could offer we would need to allow userspace to create pipeline timeline and to schedule against the pipeline timeline of there choice. We could create file for each of the pipeline timeline and have file operation to wait/query progress.
Note that the gpu lockup are considered exceptional event, the implicit timeline will probably want to continue on other job on other hardware block but the explicit one probably will want to decide wether to continue or abort or retry without the fault hw block.
I realize i am late to the party and that i should have taken a serious look at all this long time ago. I apologize for that and if you consider this is to late then just ignore me modulo the big warning the crazyness that callback will introduce an how bad things bound to happen. I am not saying that bad things can not happen with what i propose just that because everything happen inside the process context that is the one asking/requiring synchronization there will be not interprocess kernel callback (a callback that was registered by one process and that is call inside another process time slice because fence signaling is happening inside this other process time slice).
Pseudo code for explicitness :
drm_cs_ioctl_wrapper(struct drm_device *dev, void *data, struct file *filp) { struct fence *dependency[16], *fence; int m;
m = timeline_schedule(filp->implicit_pipeline, dev->hw_pipeline, dependency, 16, &fence); if (m < 0) return m; if (m >= 16) { // alloc m and recall; } dev->cs_ioctl(dev, data, filp, dev->implicit_pipeline, dependency, fence); }
int timeline_schedule(ptimeline, hwtimeline, timeout, dependency, mdep, **fence) { // allocate fence set hw_timeline and init work // build up list of dependency by looking at list of pending fence in // timeline }
// If device driver schedule job hopping for all dependency to be signaled then // it must also call this function with csdata being a copy of what needs to be // executed once all dependency are signaled void timeline_missed_schedule(timeline, fence, void *csdata) { INITWORK(fence->work, timeline_missed_schedule_worker) fence->csdata = csdata; schedule_delayed_work(fence->work, default_timeout) }
void timeline_missed_schedule_worker(work) { driver = driver_from_fence_hwtimeline(fence)
// Make sure that each of the hwtimeline dependency will fire irq by // calling a driver function. timeline_wait_for_fence_dependency(fence); driver->execute_cs(driver, fence); }
// This function is call by driver code that signal fence (could be call from // interrupt context). It is responsabilities of device driver to call that // function. void timeline_signal(hwtimeline) { for_each_fence(fence, hwtimeline->fences, list_hwtimeline) { wakeup(fence->pipetimeline->wait_queue); } }
Cheers, Jérôme
dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel
On Tue, Aug 12, 2014 at 06:13:41PM -0400, Jerome Glisse wrote:
Hi,
So i want over the whole fence and sync point stuff as it's becoming a pressing issue. I think we first need to agree on what is the problem we want to solve and what would be the requirements to solve it.
Problem : Explicit synchronization btw different hardware block over a buffer object.
Requirements : Share common infrastructure. Allow optimal hardware command stream scheduling accross hardware block. Allow android sync point to be implemented on top of it. Handle/acknowledge exception (like good old gpu lockup). Minimize driver changes.
Glossary : hardware timeline: timeline bound to a specific hardware block. pipeline timeline: timeline bound to a userspace rendering pipeline, each point on that timeline can be a composite of several different hardware pipeline point. pipeline: abstract object representing userspace application graphic pipeline of each of the application graphic operations. fence: specific point in a timeline where synchronization needs to happen.
So now, current include/linux/fence.h implementation is i believe missing the objective by confusing hardware and pipeline timeline and by bolting fence to buffer object while what is really needed is true and proper timeline for both hardware and pipeline. But before going further down that road let me look at things and explain how i see them.
fences can be used free-standing and no one forces you to integrate them with buffers. We actually plan to go this way with the intel svm stuff. Ofc for dma-buf the plan is to synchronize using such fences, but that's somewhat orthogonal I think. At least you only talk about fences and timeline and not dma-buf here.
Current ttm fence have one and a sole purpose, allow synchronization for buffer object move even thought some driver like radeon slightly abuse it and use them for things like lockup detection.
The new fence want to expose an api that would allow some implementation of a timeline. For that it introduces callback and some hard requirement on what the driver have to expose : enable_signaling [signaled] wait
Each of those have to do work inside the driver to which the fence belongs and each of those can be call more or less from unexpected (with restriction like outside irq) context. So we end up with thing like :
Process 1 Process 2 Process 3 I_A_schedule(fence0) CI_A_F_B_signaled(fence0) I_A_signal(fence0) CI_B_F_A_callback(fence0) CI_A_F_B_wait(fence0) Lexique: I_x in driver x (I_A == in driver A) CI_x_F_y call in driver X from driver Y (CI_A_F_B call in driver A from driver B)
So this is an happy mess everyone call everyone and this bound to get messy. Yes i know there is all kind of requirement on what happen once a fence is signaled. But those requirement only looks like they are trying to atone any mess that can happen from the whole callback dance.
While i was too seduced by the whole callback idea long time ago, i think it is a highly dangerous path to take where the combinatorial of what could happen are bound to explode with the increase in the number of players.
So now back to how to solve the problem we are trying to address. First i want to make an observation, almost all GPU that exist today have a command ring on to which userspace command buffer are executed and inside the command ring you can do something like :
if (condition) execute_command_buffer else skip_command_buffer
where condition is a simple expression (memory_address cop value)) with cop one of the generic comparison (==, <, >, <=, >=). I think it is a safe assumption that any gpu that slightly matter can do that. Those who can not should fix there command ring processor.
With that in mind, i think proper solution is implementing timeline and having fence be a timeline object with a way simpler api. For each hardware timeline driver provide a system memory address at which the lastest signaled fence sequence number can be read. Each fence object is uniquely associated with both a hardware and a pipeline timeline. Each pipeline timeline have a wait queue.
When scheduling something that require synchronization on a hardware timeline a fence is created and associated with the pipeline timeline and hardware timeline. Other hardware block that need to wait on a fence can use there command ring conditional execution to directly check the fence sequence from the other hw block so you do optimistic scheduling. If optimistic scheduling fails (which would be reported by hw block specific solution and hidden) then things can fallback to software cpu wait inside what could be considered the kernel thread of the pipeline timeline.
From api point of view there is no inter-driver call. All the driver needs to do is wakeup the pipeline timeline wait_queue when things are signaled or when things go sideway (gpu lockup).
So how to implement that with current driver ? Well easy. Currently we assume implicit synchronization so all we need is an implicit pipeline timeline per userspace process (note this do not prevent inter process synchronization). Everytime a command buffer is submitted it is added to the implicit timeline with the simple fence object :
struct fence { struct list_head list_hwtimeline; struct list_head list_pipetimeline; struct hw_timeline *hw_timeline; uint64_t seq_num; work_t timedout_work; void *csdata; };
So with set of helper function call by each of the driver command execution ioctl you have the implicit timeline that is properly populated and each dirver command execution get the dependency from the implicit timeline.
Of course to take full advantages of all flexibilities this could offer we would need to allow userspace to create pipeline timeline and to schedule against the pipeline timeline of there choice. We could create file for each of the pipeline timeline and have file operation to wait/query progress.
Note that the gpu lockup are considered exceptional event, the implicit timeline will probably want to continue on other job on other hardware block but the explicit one probably will want to decide wether to continue or abort or retry without the fault hw block.
I realize i am late to the party and that i should have taken a serious look at all this long time ago. I apologize for that and if you consider this is to late then just ignore me modulo the big warning the crazyness that callback will introduce an how bad things bound to happen. I am not saying that bad things can not happen with what i propose just that because everything happen inside the process context that is the one asking/requiring synchronization there will be not interprocess kernel callback (a callback that was registered by one process and that is call inside another process time slice because fence signaling is happening inside this other process time slice).
So I read through it all and presuming I understand it correctly your proposal and what we currently have is about the same. The big difference is that you make a timeline a first-class object and move the callback queue from the fence to the timeline, which requires callers to check the fence/seqno/whatever themselves instead of pushing that responsibility to callers.
If you actually mandate that the fence is just a seqno or similar which can be read lockless then I could register my own special callback into that waitqueue (other stuff than waking up threads is allowed) and from hard-irq context check the seqno and readd my own callback if that's not yet happened (that needs to go through some other context for hilarity).
So from that pov (presuming I didn't miss anything) your proposal is identical to what we have, minor some different color choices (like where to place the callback queue).
I guess that leaves the topic of how do you wait upon other fences, and the current stuff allows you to do that all from process context like you propose. And wrt the hardirq context callbacks I'm not worried at all since lockdep will yell at any driver write who gets this wrong, so debugging this stuff won't be tricky. And you can always reject any such patches if they concern your own driver.
So the only substantial proposal I can distill here is to make timelines first-class citizens, and we can always do that later on. We actually discussed this, but didn't really see a point in having them around.
And yeah you're fairly late to the party, this stuff has been floating around and been discussed since years ;-)
Cheers, Daniel
Pseudo code for explicitness :
drm_cs_ioctl_wrapper(struct drm_device *dev, void *data, struct file *filp) { struct fence *dependency[16], *fence; int m;
m = timeline_schedule(filp->implicit_pipeline, dev->hw_pipeline, dependency, 16, &fence); if (m < 0) return m; if (m >= 16) { // alloc m and recall; } dev->cs_ioctl(dev, data, filp, dev->implicit_pipeline, dependency, fence); }
int timeline_schedule(ptimeline, hwtimeline, timeout, dependency, mdep, **fence) { // allocate fence set hw_timeline and init work // build up list of dependency by looking at list of pending fence in // timeline }
// If device driver schedule job hopping for all dependency to be signaled then // it must also call this function with csdata being a copy of what needs to be // executed once all dependency are signaled void timeline_missed_schedule(timeline, fence, void *csdata) { INITWORK(fence->work, timeline_missed_schedule_worker) fence->csdata = csdata; schedule_delayed_work(fence->work, default_timeout) }
void timeline_missed_schedule_worker(work) { driver = driver_from_fence_hwtimeline(fence)
// Make sure that each of the hwtimeline dependency will fire irq by // calling a driver function. timeline_wait_for_fence_dependency(fence); driver->execute_cs(driver, fence); }
// This function is call by driver code that signal fence (could be call from // interrupt context). It is responsabilities of device driver to call that // function. void timeline_signal(hwtimeline) { for_each_fence(fence, hwtimeline->fences, list_hwtimeline) { wakeup(fence->pipetimeline->wait_queue); } }
Cheers, Jérôme
On Wed, Aug 13, 2014 at 10:28:22AM +0200, Daniel Vetter wrote:
On Tue, Aug 12, 2014 at 06:13:41PM -0400, Jerome Glisse wrote:
Hi,
So i want over the whole fence and sync point stuff as it's becoming a pressing issue. I think we first need to agree on what is the problem we want to solve and what would be the requirements to solve it.
Problem : Explicit synchronization btw different hardware block over a buffer object.
Requirements : Share common infrastructure. Allow optimal hardware command stream scheduling accross hardware block. Allow android sync point to be implemented on top of it. Handle/acknowledge exception (like good old gpu lockup). Minimize driver changes.
Glossary : hardware timeline: timeline bound to a specific hardware block. pipeline timeline: timeline bound to a userspace rendering pipeline, each point on that timeline can be a composite of several different hardware pipeline point. pipeline: abstract object representing userspace application graphic pipeline of each of the application graphic operations. fence: specific point in a timeline where synchronization needs to happen.
So now, current include/linux/fence.h implementation is i believe missing the objective by confusing hardware and pipeline timeline and by bolting fence to buffer object while what is really needed is true and proper timeline for both hardware and pipeline. But before going further down that road let me look at things and explain how i see them.
fences can be used free-standing and no one forces you to integrate them with buffers. We actually plan to go this way with the intel svm stuff. Ofc for dma-buf the plan is to synchronize using such fences, but that's somewhat orthogonal I think. At least you only talk about fences and timeline and not dma-buf here.
Current ttm fence have one and a sole purpose, allow synchronization for buffer object move even thought some driver like radeon slightly abuse it and use them for things like lockup detection.
The new fence want to expose an api that would allow some implementation of a timeline. For that it introduces callback and some hard requirement on what the driver have to expose : enable_signaling [signaled] wait
Each of those have to do work inside the driver to which the fence belongs and each of those can be call more or less from unexpected (with restriction like outside irq) context. So we end up with thing like :
Process 1 Process 2 Process 3 I_A_schedule(fence0) CI_A_F_B_signaled(fence0) I_A_signal(fence0) CI_B_F_A_callback(fence0) CI_A_F_B_wait(fence0) Lexique: I_x in driver x (I_A == in driver A) CI_x_F_y call in driver X from driver Y (CI_A_F_B call in driver A from driver B)
So this is an happy mess everyone call everyone and this bound to get messy. Yes i know there is all kind of requirement on what happen once a fence is signaled. But those requirement only looks like they are trying to atone any mess that can happen from the whole callback dance.
While i was too seduced by the whole callback idea long time ago, i think it is a highly dangerous path to take where the combinatorial of what could happen are bound to explode with the increase in the number of players.
So now back to how to solve the problem we are trying to address. First i want to make an observation, almost all GPU that exist today have a command ring on to which userspace command buffer are executed and inside the command ring you can do something like :
if (condition) execute_command_buffer else skip_command_buffer
where condition is a simple expression (memory_address cop value)) with cop one of the generic comparison (==, <, >, <=, >=). I think it is a safe assumption that any gpu that slightly matter can do that. Those who can not should fix there command ring processor.
With that in mind, i think proper solution is implementing timeline and having fence be a timeline object with a way simpler api. For each hardware timeline driver provide a system memory address at which the lastest signaled fence sequence number can be read. Each fence object is uniquely associated with both a hardware and a pipeline timeline. Each pipeline timeline have a wait queue.
When scheduling something that require synchronization on a hardware timeline a fence is created and associated with the pipeline timeline and hardware timeline. Other hardware block that need to wait on a fence can use there command ring conditional execution to directly check the fence sequence from the other hw block so you do optimistic scheduling. If optimistic scheduling fails (which would be reported by hw block specific solution and hidden) then things can fallback to software cpu wait inside what could be considered the kernel thread of the pipeline timeline.
From api point of view there is no inter-driver call. All the driver needs to do is wakeup the pipeline timeline wait_queue when things are signaled or when things go sideway (gpu lockup).
So how to implement that with current driver ? Well easy. Currently we assume implicit synchronization so all we need is an implicit pipeline timeline per userspace process (note this do not prevent inter process synchronization). Everytime a command buffer is submitted it is added to the implicit timeline with the simple fence object :
struct fence { struct list_head list_hwtimeline; struct list_head list_pipetimeline; struct hw_timeline *hw_timeline; uint64_t seq_num; work_t timedout_work; void *csdata; };
So with set of helper function call by each of the driver command execution ioctl you have the implicit timeline that is properly populated and each dirver command execution get the dependency from the implicit timeline.
Of course to take full advantages of all flexibilities this could offer we would need to allow userspace to create pipeline timeline and to schedule against the pipeline timeline of there choice. We could create file for each of the pipeline timeline and have file operation to wait/query progress.
Note that the gpu lockup are considered exceptional event, the implicit timeline will probably want to continue on other job on other hardware block but the explicit one probably will want to decide wether to continue or abort or retry without the fault hw block.
I realize i am late to the party and that i should have taken a serious look at all this long time ago. I apologize for that and if you consider this is to late then just ignore me modulo the big warning the crazyness that callback will introduce an how bad things bound to happen. I am not saying that bad things can not happen with what i propose just that because everything happen inside the process context that is the one asking/requiring synchronization there will be not interprocess kernel callback (a callback that was registered by one process and that is call inside another process time slice because fence signaling is happening inside this other process time slice).
So I read through it all and presuming I understand it correctly your proposal and what we currently have is about the same. The big difference is that you make a timeline a first-class object and move the callback queue from the fence to the timeline, which requires callers to check the fence/seqno/whatever themselves instead of pushing that responsibility to callers.
No, big difference is that there is no callback thus when waiting for a fence you are either inside the process context that need to wait for it or your inside a kernel thread process context. Which means in both case you can do whatever you want. What i hate about the fence code as it is, is the callback stuff, because you never know into which context fence are signaled then you never know into which context callback are executed.
If you actually mandate that the fence is just a seqno or similar which can be read lockless then I could register my own special callback into that waitqueue (other stuff than waking up threads is allowed) and from hard-irq context check the seqno and readd my own callback if that's not yet happened (that needs to go through some other context for hilarity).
Yes mandating a simple number that can be read from anywhere without lock, i am pretty sure all hw can write to system page and can write a value alongside their command buffer. So either you hw support reading and testing value either you can do it in atomic context right before scheduling.
So from that pov (presuming I didn't miss anything) your proposal is identical to what we have, minor some different color choices (like where to place the callback queue).
No callback is the mantra here, and instead of bolting free living fence to buffer object, they are associated with timeline which means you do not need to go over all buffer object to know what you need to wait for.
I guess that leaves the topic of how do you wait upon other fences, and the current stuff allows you to do that all from process context like you propose. And wrt the hardirq context callbacks I'm not worried at all since lockdep will yell at any driver write who gets this wrong, so debugging this stuff won't be tricky. And you can always reject any such patches if they concern your own driver.
So the only substantial proposal I can distill here is to make timelines first-class citizens, and we can always do that later on. We actually discussed this, but didn't really see a point in having them around.
Again big difference is no callback and no fence bolted to buffer object.
And yeah you're fairly late to the party, this stuff has been floating around and been discussed since years ;-)
Cheers, Daniel
Pseudo code for explicitness :
drm_cs_ioctl_wrapper(struct drm_device *dev, void *data, struct file *filp) { struct fence *dependency[16], *fence; int m;
m = timeline_schedule(filp->implicit_pipeline, dev->hw_pipeline, dependency, 16, &fence); if (m < 0) return m; if (m >= 16) { // alloc m and recall; } dev->cs_ioctl(dev, data, filp, dev->implicit_pipeline, dependency, fence); }
int timeline_schedule(ptimeline, hwtimeline, timeout, dependency, mdep, **fence) { // allocate fence set hw_timeline and init work // build up list of dependency by looking at list of pending fence in // timeline }
// If device driver schedule job hopping for all dependency to be signaled then // it must also call this function with csdata being a copy of what needs to be // executed once all dependency are signaled void timeline_missed_schedule(timeline, fence, void *csdata) { INITWORK(fence->work, timeline_missed_schedule_worker) fence->csdata = csdata; schedule_delayed_work(fence->work, default_timeout) }
void timeline_missed_schedule_worker(work) { driver = driver_from_fence_hwtimeline(fence)
// Make sure that each of the hwtimeline dependency will fire irq by // calling a driver function. timeline_wait_for_fence_dependency(fence); driver->execute_cs(driver, fence); }
// This function is call by driver code that signal fence (could be call from // interrupt context). It is responsabilities of device driver to call that // function. void timeline_signal(hwtimeline) { for_each_fence(fence, hwtimeline->fences, list_hwtimeline) { wakeup(fence->pipetimeline->wait_queue); } }
Cheers, Jérôme
-- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
On Wed, Aug 13, 2014 at 09:36:04AM -0400, Jerome Glisse wrote:
On Wed, Aug 13, 2014 at 10:28:22AM +0200, Daniel Vetter wrote:
On Tue, Aug 12, 2014 at 06:13:41PM -0400, Jerome Glisse wrote:
Hi,
So i want over the whole fence and sync point stuff as it's becoming a pressing issue. I think we first need to agree on what is the problem we want to solve and what would be the requirements to solve it.
Problem : Explicit synchronization btw different hardware block over a buffer object.
Requirements : Share common infrastructure. Allow optimal hardware command stream scheduling accross hardware block. Allow android sync point to be implemented on top of it. Handle/acknowledge exception (like good old gpu lockup). Minimize driver changes.
Glossary : hardware timeline: timeline bound to a specific hardware block. pipeline timeline: timeline bound to a userspace rendering pipeline, each point on that timeline can be a composite of several different hardware pipeline point. pipeline: abstract object representing userspace application graphic pipeline of each of the application graphic operations. fence: specific point in a timeline where synchronization needs to happen.
So now, current include/linux/fence.h implementation is i believe missing the objective by confusing hardware and pipeline timeline and by bolting fence to buffer object while what is really needed is true and proper timeline for both hardware and pipeline. But before going further down that road let me look at things and explain how i see them.
fences can be used free-standing and no one forces you to integrate them with buffers. We actually plan to go this way with the intel svm stuff. Ofc for dma-buf the plan is to synchronize using such fences, but that's somewhat orthogonal I think. At least you only talk about fences and timeline and not dma-buf here.
Current ttm fence have one and a sole purpose, allow synchronization for buffer object move even thought some driver like radeon slightly abuse it and use them for things like lockup detection.
The new fence want to expose an api that would allow some implementation of a timeline. For that it introduces callback and some hard requirement on what the driver have to expose : enable_signaling [signaled] wait
Each of those have to do work inside the driver to which the fence belongs and each of those can be call more or less from unexpected (with restriction like outside irq) context. So we end up with thing like :
Process 1 Process 2 Process 3 I_A_schedule(fence0) CI_A_F_B_signaled(fence0) I_A_signal(fence0) CI_B_F_A_callback(fence0) CI_A_F_B_wait(fence0) Lexique: I_x in driver x (I_A == in driver A) CI_x_F_y call in driver X from driver Y (CI_A_F_B call in driver A from driver B)
So this is an happy mess everyone call everyone and this bound to get messy. Yes i know there is all kind of requirement on what happen once a fence is signaled. But those requirement only looks like they are trying to atone any mess that can happen from the whole callback dance.
While i was too seduced by the whole callback idea long time ago, i think it is a highly dangerous path to take where the combinatorial of what could happen are bound to explode with the increase in the number of players.
So now back to how to solve the problem we are trying to address. First i want to make an observation, almost all GPU that exist today have a command ring on to which userspace command buffer are executed and inside the command ring you can do something like :
if (condition) execute_command_buffer else skip_command_buffer
where condition is a simple expression (memory_address cop value)) with cop one of the generic comparison (==, <, >, <=, >=). I think it is a safe assumption that any gpu that slightly matter can do that. Those who can not should fix there command ring processor.
With that in mind, i think proper solution is implementing timeline and having fence be a timeline object with a way simpler api. For each hardware timeline driver provide a system memory address at which the lastest signaled fence sequence number can be read. Each fence object is uniquely associated with both a hardware and a pipeline timeline. Each pipeline timeline have a wait queue.
When scheduling something that require synchronization on a hardware timeline a fence is created and associated with the pipeline timeline and hardware timeline. Other hardware block that need to wait on a fence can use there command ring conditional execution to directly check the fence sequence from the other hw block so you do optimistic scheduling. If optimistic scheduling fails (which would be reported by hw block specific solution and hidden) then things can fallback to software cpu wait inside what could be considered the kernel thread of the pipeline timeline.
From api point of view there is no inter-driver call. All the driver needs to do is wakeup the pipeline timeline wait_queue when things are signaled or when things go sideway (gpu lockup).
So how to implement that with current driver ? Well easy. Currently we assume implicit synchronization so all we need is an implicit pipeline timeline per userspace process (note this do not prevent inter process synchronization). Everytime a command buffer is submitted it is added to the implicit timeline with the simple fence object :
struct fence { struct list_head list_hwtimeline; struct list_head list_pipetimeline; struct hw_timeline *hw_timeline; uint64_t seq_num; work_t timedout_work; void *csdata; };
So with set of helper function call by each of the driver command execution ioctl you have the implicit timeline that is properly populated and each dirver command execution get the dependency from the implicit timeline.
Of course to take full advantages of all flexibilities this could offer we would need to allow userspace to create pipeline timeline and to schedule against the pipeline timeline of there choice. We could create file for each of the pipeline timeline and have file operation to wait/query progress.
Note that the gpu lockup are considered exceptional event, the implicit timeline will probably want to continue on other job on other hardware block but the explicit one probably will want to decide wether to continue or abort or retry without the fault hw block.
I realize i am late to the party and that i should have taken a serious look at all this long time ago. I apologize for that and if you consider this is to late then just ignore me modulo the big warning the crazyness that callback will introduce an how bad things bound to happen. I am not saying that bad things can not happen with what i propose just that because everything happen inside the process context that is the one asking/requiring synchronization there will be not interprocess kernel callback (a callback that was registered by one process and that is call inside another process time slice because fence signaling is happening inside this other process time slice).
So I read through it all and presuming I understand it correctly your proposal and what we currently have is about the same. The big difference is that you make a timeline a first-class object and move the callback queue from the fence to the timeline, which requires callers to check the fence/seqno/whatever themselves instead of pushing that responsibility to callers.
No, big difference is that there is no callback thus when waiting for a fence you are either inside the process context that need to wait for it or your inside a kernel thread process context. Which means in both case you can do whatever you want. What i hate about the fence code as it is, is the callback stuff, because you never know into which context fence are signaled then you never know into which context callback are executed.
Look at waitqueues a bit closer. They're implemented with callbacks ;-) The only difference is that you're allowed to have spurious wakeups and need to handle that somehow, so need a separate check function.
If you actually mandate that the fence is just a seqno or similar which can be read lockless then I could register my own special callback into that waitqueue (other stuff than waking up threads is allowed) and from hard-irq context check the seqno and readd my own callback if that's not yet happened (that needs to go through some other context for hilarity).
Yes mandating a simple number that can be read from anywhere without lock, i am pretty sure all hw can write to system page and can write a value alongside their command buffer. So either you hw support reading and testing value either you can do it in atomic context right before scheduling.
Imo that's a step in the wrong direction since reading a bit of system memory or checking a bit of irq-safe spinlock protected data or a register shouldn't matter. You just arbitrarily disallow that. And allowing random other kernel subsystems to read mmio or page mappings not under their control is an idea that freaks /me/ out.
So from that pov (presuming I didn't miss anything) your proposal is identical to what we have, minor some different color choices (like where to place the callback queue).
No callback is the mantra here, and instead of bolting free living fence to buffer object, they are associated with timeline which means you do not need to go over all buffer object to know what you need to wait for.
Ok, then I guess I didn't understand that part of your the proposal. Can you please elaborate a bit more how you want to synchronize mulitple drivers accessing a dma-buf object and what piece of state we need to associate to the dma-buf to make this happen?
Thanks, Daniel
On Wed, Aug 13, 2014 at 05:54:20PM +0200, Daniel Vetter wrote:
On Wed, Aug 13, 2014 at 09:36:04AM -0400, Jerome Glisse wrote:
On Wed, Aug 13, 2014 at 10:28:22AM +0200, Daniel Vetter wrote:
On Tue, Aug 12, 2014 at 06:13:41PM -0400, Jerome Glisse wrote:
Hi,
So i want over the whole fence and sync point stuff as it's becoming a pressing issue. I think we first need to agree on what is the problem we want to solve and what would be the requirements to solve it.
Problem : Explicit synchronization btw different hardware block over a buffer object.
Requirements : Share common infrastructure. Allow optimal hardware command stream scheduling accross hardware block. Allow android sync point to be implemented on top of it. Handle/acknowledge exception (like good old gpu lockup). Minimize driver changes.
Glossary : hardware timeline: timeline bound to a specific hardware block. pipeline timeline: timeline bound to a userspace rendering pipeline, each point on that timeline can be a composite of several different hardware pipeline point. pipeline: abstract object representing userspace application graphic pipeline of each of the application graphic operations. fence: specific point in a timeline where synchronization needs to happen.
So now, current include/linux/fence.h implementation is i believe missing the objective by confusing hardware and pipeline timeline and by bolting fence to buffer object while what is really needed is true and proper timeline for both hardware and pipeline. But before going further down that road let me look at things and explain how i see them.
fences can be used free-standing and no one forces you to integrate them with buffers. We actually plan to go this way with the intel svm stuff. Ofc for dma-buf the plan is to synchronize using such fences, but that's somewhat orthogonal I think. At least you only talk about fences and timeline and not dma-buf here.
Current ttm fence have one and a sole purpose, allow synchronization for buffer object move even thought some driver like radeon slightly abuse it and use them for things like lockup detection.
The new fence want to expose an api that would allow some implementation of a timeline. For that it introduces callback and some hard requirement on what the driver have to expose : enable_signaling [signaled] wait
Each of those have to do work inside the driver to which the fence belongs and each of those can be call more or less from unexpected (with restriction like outside irq) context. So we end up with thing like :
Process 1 Process 2 Process 3 I_A_schedule(fence0) CI_A_F_B_signaled(fence0) I_A_signal(fence0) CI_B_F_A_callback(fence0) CI_A_F_B_wait(fence0) Lexique: I_x in driver x (I_A == in driver A) CI_x_F_y call in driver X from driver Y (CI_A_F_B call in driver A from driver B)
So this is an happy mess everyone call everyone and this bound to get messy. Yes i know there is all kind of requirement on what happen once a fence is signaled. But those requirement only looks like they are trying to atone any mess that can happen from the whole callback dance.
While i was too seduced by the whole callback idea long time ago, i think it is a highly dangerous path to take where the combinatorial of what could happen are bound to explode with the increase in the number of players.
So now back to how to solve the problem we are trying to address. First i want to make an observation, almost all GPU that exist today have a command ring on to which userspace command buffer are executed and inside the command ring you can do something like :
if (condition) execute_command_buffer else skip_command_buffer
where condition is a simple expression (memory_address cop value)) with cop one of the generic comparison (==, <, >, <=, >=). I think it is a safe assumption that any gpu that slightly matter can do that. Those who can not should fix there command ring processor.
With that in mind, i think proper solution is implementing timeline and having fence be a timeline object with a way simpler api. For each hardware timeline driver provide a system memory address at which the lastest signaled fence sequence number can be read. Each fence object is uniquely associated with both a hardware and a pipeline timeline. Each pipeline timeline have a wait queue.
When scheduling something that require synchronization on a hardware timeline a fence is created and associated with the pipeline timeline and hardware timeline. Other hardware block that need to wait on a fence can use there command ring conditional execution to directly check the fence sequence from the other hw block so you do optimistic scheduling. If optimistic scheduling fails (which would be reported by hw block specific solution and hidden) then things can fallback to software cpu wait inside what could be considered the kernel thread of the pipeline timeline.
From api point of view there is no inter-driver call. All the driver needs to do is wakeup the pipeline timeline wait_queue when things are signaled or when things go sideway (gpu lockup).
So how to implement that with current driver ? Well easy. Currently we assume implicit synchronization so all we need is an implicit pipeline timeline per userspace process (note this do not prevent inter process synchronization). Everytime a command buffer is submitted it is added to the implicit timeline with the simple fence object :
struct fence { struct list_head list_hwtimeline; struct list_head list_pipetimeline; struct hw_timeline *hw_timeline; uint64_t seq_num; work_t timedout_work; void *csdata; };
So with set of helper function call by each of the driver command execution ioctl you have the implicit timeline that is properly populated and each dirver command execution get the dependency from the implicit timeline.
Of course to take full advantages of all flexibilities this could offer we would need to allow userspace to create pipeline timeline and to schedule against the pipeline timeline of there choice. We could create file for each of the pipeline timeline and have file operation to wait/query progress.
Note that the gpu lockup are considered exceptional event, the implicit timeline will probably want to continue on other job on other hardware block but the explicit one probably will want to decide wether to continue or abort or retry without the fault hw block.
I realize i am late to the party and that i should have taken a serious look at all this long time ago. I apologize for that and if you consider this is to late then just ignore me modulo the big warning the crazyness that callback will introduce an how bad things bound to happen. I am not saying that bad things can not happen with what i propose just that because everything happen inside the process context that is the one asking/requiring synchronization there will be not interprocess kernel callback (a callback that was registered by one process and that is call inside another process time slice because fence signaling is happening inside this other process time slice).
So I read through it all and presuming I understand it correctly your proposal and what we currently have is about the same. The big difference is that you make a timeline a first-class object and move the callback queue from the fence to the timeline, which requires callers to check the fence/seqno/whatever themselves instead of pushing that responsibility to callers.
No, big difference is that there is no callback thus when waiting for a fence you are either inside the process context that need to wait for it or your inside a kernel thread process context. Which means in both case you can do whatever you want. What i hate about the fence code as it is, is the callback stuff, because you never know into which context fence are signaled then you never know into which context callback are executed.
Look at waitqueues a bit closer. They're implemented with callbacks ;-) The only difference is that you're allowed to have spurious wakeups and need to handle that somehow, so need a separate check function.
No this not how wait queue are implemented, ie wait queue do not callback a random function from a random driver, it callback a limited set of function from core linux kernel scheduler so that the process thread that was waiting and out of the scheduler list is added back and marked as something that should be schedule. Unless this part of the kernel drasticly changed for the worse recently.
So this is fundamentaly different, fence as they are now allow random driver callback and this is bound to get ugly this is bound to lead to one driver doing something that seems innocuous but turn out to break heavoc when call from some other driver function.
If you actually mandate that the fence is just a seqno or similar which can be read lockless then I could register my own special callback into that waitqueue (other stuff than waking up threads is allowed) and from hard-irq context check the seqno and readd my own callback if that's not yet happened (that needs to go through some other context for hilarity).
Yes mandating a simple number that can be read from anywhere without lock, i am pretty sure all hw can write to system page and can write a value alongside their command buffer. So either you hw support reading and testing value either you can do it in atomic context right before scheduling.
Imo that's a step in the wrong direction since reading a bit of system memory or checking a bit of irq-safe spinlock protected data or a register shouldn't matter. You just arbitrarily disallow that. And allowing random other kernel subsystems to read mmio or page mappings not under their control is an idea that freaks /me/ out.
Let me make this crystal clear this must be a valid kernel page that have a valid kernel mapping for the lifetime of the device. Hence there is no access to mmio space or anything, just a regular kernel page. If can not rely on that this is a sad world.
That being said, yes i am aware that some device incapacity to write to such a page. For those dumb hardware what you need to do is have the irq handler write to this page on behalf of the hardware. But i would like to know any hardware than can not write a single dword from its ring buffer.
The only tricky part in this, is when device is unloaded and driver is removing itself, it obviously need to synchronize itself with anyone possibly waiting on it and possibly reading. But truly this is not that hard to solve.
So tell me once the above is clear what kind of scary thing can happen when cpu or a device _read_ a kernel page ?
So from that pov (presuming I didn't miss anything) your proposal is identical to what we have, minor some different color choices (like where to place the callback queue).
No callback is the mantra here, and instead of bolting free living fence to buffer object, they are associated with timeline which means you do not need to go over all buffer object to know what you need to wait for.
Ok, then I guess I didn't understand that part of your the proposal. Can you please elaborate a bit more how you want to synchronize mulitple drivers accessing a dma-buf object and what piece of state we need to associate to the dma-buf to make this happen?
Beauty of it you associate ziltch to the buffer. So for existing cs ioctl where the implicit synchronization is the rule it enforce mandatory synchronization accross all hw timeline on which a buffer shows up : for_each_buffer_in_cmdbuffer(buffer, cmdbuf) { if (!cmdbuf_write_to_buffer(buffer, cmdbuf)) continue; for_each_process_sharing_buffer(buffer, process) { schedule_fence(process->implicit_timeline, cmdbuf->fence) } }
Next time another process use current ioctl with implicit sync it will synch with the last fence for any shared resource. This might sounds bad but truely as it is right now this is already how it happens (at least for radeon).
So now, why it's better than include/linux/fence.h is because you can only link a single fence to buffer and even if you are ready to waste memory and allow linking several fence to each buffer you would need userspace to start tracking for each buffer which fence it cares about. Basicly implementing the android sync point (at least as i understand android sync point) is hard and error prone if not just impossible with the fence code.
On the contrary with explicit timeline (ie once you expose timeline to userspace and allow to schedule thing against a usertime line) you can do crazy stuff. For a single process for instance : (htimeline1, seqno1) = dev_A_hwblock_X_cmdbuf_ioctl(cmdbuf1, utimeline_p0, buf0) (htimeline2, seqno2) = dev_A_hwblock_Y_cmdbuf_ioctl(cmdbuf2, utimeline_p0, buf0) (htimeline3, seqno3) = dev_B_hwblock_X_cmdbuf_ioctl(cmdbuf3, utimeline_p0, buf0) (htimeline4, seqno4) = dev_C_hwblock_Z_cmdbuf_ioctl(cmdbuf4, utimeline_p0, buf0) (htimeline1, seqno5) = dev_A_hwblock_X_cmdbuf_ioctl(cmdbuf5, utimeline_p0, buf0) timeline_sync(utimeline, (htimeline1_p0, seqno5), (htimeline3, seqno3)) (htimeline1, seqno6) = dev_A_hwblock_X_cmdbuf_ioctl(cmdbuf6, utimeline_p0, buf0)
So what this allow is for cmdbuf6 to wait on cmdbuf1, cmdbuf3 and cmdbuf5 and just ignore completely what ever cmdbuf4 is doing as a do not care. Truely only user space knows the kind of interdependcy that can exist. And all cmdbuf1, cmdbuf2, cmdbuf3, cmdbuf4 and cmdbuf5 will/can execute concurrently.
Now for inter-process you need the interprocess to communicate with each other and one process can do : timeline_inter_sync(utimeline1_p1, utimeline1_p0, (htimeline4, seqno4)) (htimeline1, seqno1_p1) = dev_A_hwblock_X_cmdbuf_ioctl(cmdbuf1_p1, utimeline_p1, buf0)
Then cmdbuf1_p1 will not executed until cmdbuf5 scheduled on htimeline4 by the user timeline of process 0 is done. This will not care at all about any of the other cmd using buf0 in process0.
So now truely i do not see how you can pretend allowing such things, which is what android sync point seems to allow, with associated a fence to buffer.
Hope this make it clear.
Cheers, Jérôme
Thanks, Daniel
Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
On Wed, Aug 13, 2014 at 01:07:20PM -0400, Jerome Glisse wrote:
Let me make this crystal clear this must be a valid kernel page that have a valid kernel mapping for the lifetime of the device. Hence there is no access to mmio space or anything, just a regular kernel page. If can not rely on that this is a sad world.
That being said, yes i am aware that some device incapacity to write to such a page. For those dumb hardware what you need to do is have the irq handler write to this page on behalf of the hardware. But i would like to know any hardware than can not write a single dword from its ring buffer.
The only tricky part in this, is when device is unloaded and driver is removing itself, it obviously need to synchronize itself with anyone possibly waiting on it and possibly reading. But truly this is not that hard to solve.
So tell me once the above is clear what kind of scary thing can happen when cpu or a device _read_ a kernel page ?
It's not reading it, it's making sense of what you read. In i915 we had exactly the (timeline, seqno) value pair design for fences for a long time, and we're switching away from it since it stops working when you have preemption and scheduler. Or at least it gets really interesting to interpret the data written into the page.
So I don't want to expose that to other drivers if we decided that exposing this internally is a stupid idea.
So from that pov (presuming I didn't miss anything) your proposal is identical to what we have, minor some different color choices (like where to place the callback queue).
No callback is the mantra here, and instead of bolting free living fence to buffer object, they are associated with timeline which means you do not need to go over all buffer object to know what you need to wait for.
Ok, then I guess I didn't understand that part of your the proposal. Can you please elaborate a bit more how you want to synchronize mulitple drivers accessing a dma-buf object and what piece of state we need to associate to the dma-buf to make this happen?
Beauty of it you associate ziltch to the buffer. So for existing cs ioctl where the implicit synchronization is the rule it enforce mandatory synchronization accross all hw timeline on which a buffer shows up : for_each_buffer_in_cmdbuffer(buffer, cmdbuf) { if (!cmdbuf_write_to_buffer(buffer, cmdbuf)) continue; for_each_process_sharing_buffer(buffer, process) { schedule_fence(process->implicit_timeline, cmdbuf->fence) } }
Next time another process use current ioctl with implicit sync it will synch with the last fence for any shared resource. This might sounds bad but truely as it is right now this is already how it happens (at least for radeon).
Well i915 is a lot better than that. And I'm not going to implement some special-case for dma-buf shared buffers just because radeon sucks and wants to enforce that suckage on everyone else.
So let's cut this short: If you absolutely insist I guess we could ditch the callback stuff from fences, but I really don't see the problem with radeon just not using that and then being happy. We can easily implement a bit of insulation code _just_ for radeon so that the only thing radeon does is wake up a process (which then calls the callback if it's something special).
Otoh I don't care about what ttm and radeon do, for i915 the important stuff is integration with android syncpts and being able to do explicit fencing for e.g. svm stuff. We can do that with what's merged in 3.17 and I expect that those patches will land in 3.18, at least the internal integration.
It would be cool if we could get tear-free optimus working on desktop linux, but that flat out doesn't pay my bills here. So I think I'll let you guys figure this out yourself.
Cheers, Daniel
On Thu, Aug 14, 2014 at 11:08:34AM +0200, Daniel Vetter wrote:
On Wed, Aug 13, 2014 at 01:07:20PM -0400, Jerome Glisse wrote:
Let me make this crystal clear this must be a valid kernel page that have a valid kernel mapping for the lifetime of the device. Hence there is no access to mmio space or anything, just a regular kernel page. If can not rely on that this is a sad world.
That being said, yes i am aware that some device incapacity to write to such a page. For those dumb hardware what you need to do is have the irq handler write to this page on behalf of the hardware. But i would like to know any hardware than can not write a single dword from its ring buffer.
The only tricky part in this, is when device is unloaded and driver is removing itself, it obviously need to synchronize itself with anyone possibly waiting on it and possibly reading. But truly this is not that hard to solve.
So tell me once the above is clear what kind of scary thing can happen when cpu or a device _read_ a kernel page ?
It's not reading it, it's making sense of what you read. In i915 we had exactly the (timeline, seqno) value pair design for fences for a long time, and we're switching away from it since it stops working when you have preemption and scheduler. Or at least it gets really interesting to interpret the data written into the page.
So I don't want to expose that to other drivers if we decided that exposing this internally is a stupid idea.
I am well aware of that, but context scheduling really is what the timeline i talk about is. The user timeline should be consider like a single cpu thread on to which operation involving different hw are scheduled. You can have the hw switch from one timeline to another and a seqno is only meaningful per timeline.
The whole preemption and scheduling is something bound to happen on gpu and we will want to integrate with core scheduler to manage time slice allocated to process, but the we need the concept of thread in which operation on same hw block are processed in a linear fashion but still allowing concurrency with other hw block.
So from that pov (presuming I didn't miss anything) your proposal is identical to what we have, minor some different color choices (like where to place the callback queue).
No callback is the mantra here, and instead of bolting free living fence to buffer object, they are associated with timeline which means you do not need to go over all buffer object to know what you need to wait for.
Ok, then I guess I didn't understand that part of your the proposal. Can you please elaborate a bit more how you want to synchronize mulitple drivers accessing a dma-buf object and what piece of state we need to associate to the dma-buf to make this happen?
Beauty of it you associate ziltch to the buffer. So for existing cs ioctl where the implicit synchronization is the rule it enforce mandatory synchronization accross all hw timeline on which a buffer shows up : for_each_buffer_in_cmdbuffer(buffer, cmdbuf) { if (!cmdbuf_write_to_buffer(buffer, cmdbuf)) continue; for_each_process_sharing_buffer(buffer, process) { schedule_fence(process->implicit_timeline, cmdbuf->fence) } }
Next time another process use current ioctl with implicit sync it will synch with the last fence for any shared resource. This might sounds bad but truely as it is right now this is already how it happens (at least for radeon).
Well i915 is a lot better than that. And I'm not going to implement some special-case for dma-buf shared buffers just because radeon sucks and wants to enforce that suckage on everyone else.
I guess i am having hard time to express myself, what i am saying here is that implicit synchronization sucks because it has to assume the worst case and this is what current code does, and i am sure intel is doing something similar with today code.
Explicit synchronization allow more flexibility but fence code as it is designed does not allow to fully do down that line. By associating fence to buffer object which is the biggest shortcoming of implicit sync.
So let's cut this short: If you absolutely insist I guess we could ditch the callback stuff from fences, but I really don't see the problem with radeon just not using that and then being happy. We can easily implement a bit of insulation code _just_ for radeon so that the only thing radeon does is wake up a process (which then calls the callback if it's something special).
Like i said feel free to ignore me. I am just genuinely want to have the best solution inside the linux kernel and i do think that fence and callback and buffer association is not that solution. I tried to explain why but i might be failing or missing something.
Otoh I don't care about what ttm and radeon do, for i915 the important stuff is integration with android syncpts and being able to do explicit fencing for e.g. svm stuff. We can do that with what's merged in 3.17 and I expect that those patches will land in 3.18, at least the internal integration.
It would be cool if we could get tear-free optimus working on desktop linux, but that flat out doesn't pay my bills here. So I think I'll let you guys figure this out yourself.
Sad to learn that Intel no longer have any interest in the linux desktop.
Cheers, Jérôme
On Thu, Aug 14, 2014 at 10:23:30AM -0400, Jerome Glisse wrote:
On Thu, Aug 14, 2014 at 11:08:34AM +0200, Daniel Vetter wrote:
On Wed, Aug 13, 2014 at 01:07:20PM -0400, Jerome Glisse wrote:
Let me make this crystal clear this must be a valid kernel page that have a valid kernel mapping for the lifetime of the device. Hence there is no access to mmio space or anything, just a regular kernel page. If can not rely on that this is a sad world.
That being said, yes i am aware that some device incapacity to write to such a page. For those dumb hardware what you need to do is have the irq handler write to this page on behalf of the hardware. But i would like to know any hardware than can not write a single dword from its ring buffer.
The only tricky part in this, is when device is unloaded and driver is removing itself, it obviously need to synchronize itself with anyone possibly waiting on it and possibly reading. But truly this is not that hard to solve.
So tell me once the above is clear what kind of scary thing can happen when cpu or a device _read_ a kernel page ?
It's not reading it, it's making sense of what you read. In i915 we had exactly the (timeline, seqno) value pair design for fences for a long time, and we're switching away from it since it stops working when you have preemption and scheduler. Or at least it gets really interesting to interpret the data written into the page.
So I don't want to expose that to other drivers if we decided that exposing this internally is a stupid idea.
I am well aware of that, but context scheduling really is what the timeline i talk about is. The user timeline should be consider like a single cpu thread on to which operation involving different hw are scheduled. You can have the hw switch from one timeline to another and a seqno is only meaningful per timeline.
The whole preemption and scheduling is something bound to happen on gpu and we will want to integrate with core scheduler to manage time slice allocated to process, but the we need the concept of thread in which operation on same hw block are processed in a linear fashion but still allowing concurrency with other hw block.
Well yeah, it's just that a gpu thread doesn't have a lot to do with a cpu process, at least in the current drm world. It would be nice make that a bit more cross-driver for management, but we didn't even manage to pull that of for gpu memory. So I don't think it'll happen anytime soonish.
So from that pov (presuming I didn't miss anything) your proposal is identical to what we have, minor some different color choices (like where to place the callback queue).
No callback is the mantra here, and instead of bolting free living fence to buffer object, they are associated with timeline which means you do not need to go over all buffer object to know what you need to wait for.
Ok, then I guess I didn't understand that part of your the proposal. Can you please elaborate a bit more how you want to synchronize mulitple drivers accessing a dma-buf object and what piece of state we need to associate to the dma-buf to make this happen?
Beauty of it you associate ziltch to the buffer. So for existing cs ioctl where the implicit synchronization is the rule it enforce mandatory synchronization accross all hw timeline on which a buffer shows up : for_each_buffer_in_cmdbuffer(buffer, cmdbuf) { if (!cmdbuf_write_to_buffer(buffer, cmdbuf)) continue; for_each_process_sharing_buffer(buffer, process) { schedule_fence(process->implicit_timeline, cmdbuf->fence) } }
Next time another process use current ioctl with implicit sync it will synch with the last fence for any shared resource. This might sounds bad but truely as it is right now this is already how it happens (at least for radeon).
Well i915 is a lot better than that. And I'm not going to implement some special-case for dma-buf shared buffers just because radeon sucks and wants to enforce that suckage on everyone else.
I guess i am having hard time to express myself, what i am saying here is that implicit synchronization sucks because it has to assume the worst case and this is what current code does, and i am sure intel is doing something similar with today code.
Explicit synchronization allow more flexibility but fence code as it is designed does not allow to fully do down that line. By associating fence to buffer object which is the biggest shortcoming of implicit sync.
I don't see how your statement that implicit sync sucks maps to the reality of i915 where it doesn't. The only thing you can't do is shared buffer objects that you suballocate. And userspace is always allowed to lie about the access mode (if it lies too badly it will simply read zero page) so can forgoe implicit sync.
So let's cut this short: If you absolutely insist I guess we could ditch the callback stuff from fences, but I really don't see the problem with radeon just not using that and then being happy. We can easily implement a bit of insulation code _just_ for radeon so that the only thing radeon does is wake up a process (which then calls the callback if it's something special).
Like i said feel free to ignore me. I am just genuinely want to have the best solution inside the linux kernel and i do think that fence and callback and buffer association is not that solution. I tried to explain why but i might be failing or missing something.
Otoh I don't care about what ttm and radeon do, for i915 the important stuff is integration with android syncpts and being able to do explicit fencing for e.g. svm stuff. We can do that with what's merged in 3.17 and I expect that those patches will land in 3.18, at least the internal integration.
It would be cool if we could get tear-free optimus working on desktop linux, but that flat out doesn't pay my bills here. So I think I'll let you guys figure this out yourself.
Sad to learn that Intel no longer have any interest in the linux desktop.
Well the current Linux DRI stack works fairly well on Linux with Intel I think, so I don't think anyone can claim we don't care. It's only starts to suck if you mix in prime.
And as a matter of fact the current Linux DRI model uses implicit sync, so that's what dma-buf currently implements. Of course we can implement an explicit sync gpu model (and it will come for at least opencl and similar stuff), and probably also for interactions with v4l, but I don't see any patches to rewrite the current X stack we have.
And yeah we kinda optimized that one a bit on Intel since we actually care about X ;-)
What I'll stop caring about is how exactly radeon integrates into this, not the linux desktop in general. -Daniel
On Thu, Aug 14, 2014 at 05:55:51PM +0200, Daniel Vetter wrote:
On Thu, Aug 14, 2014 at 10:23:30AM -0400, Jerome Glisse wrote:
On Thu, Aug 14, 2014 at 11:08:34AM +0200, Daniel Vetter wrote:
On Wed, Aug 13, 2014 at 01:07:20PM -0400, Jerome Glisse wrote:
Let me make this crystal clear this must be a valid kernel page that have a valid kernel mapping for the lifetime of the device. Hence there is no access to mmio space or anything, just a regular kernel page. If can not rely on that this is a sad world.
That being said, yes i am aware that some device incapacity to write to such a page. For those dumb hardware what you need to do is have the irq handler write to this page on behalf of the hardware. But i would like to know any hardware than can not write a single dword from its ring buffer.
The only tricky part in this, is when device is unloaded and driver is removing itself, it obviously need to synchronize itself with anyone possibly waiting on it and possibly reading. But truly this is not that hard to solve.
So tell me once the above is clear what kind of scary thing can happen when cpu or a device _read_ a kernel page ?
It's not reading it, it's making sense of what you read. In i915 we had exactly the (timeline, seqno) value pair design for fences for a long time, and we're switching away from it since it stops working when you have preemption and scheduler. Or at least it gets really interesting to interpret the data written into the page.
So I don't want to expose that to other drivers if we decided that exposing this internally is a stupid idea.
I am well aware of that, but context scheduling really is what the timeline i talk about is. The user timeline should be consider like a single cpu thread on to which operation involving different hw are scheduled. You can have the hw switch from one timeline to another and a seqno is only meaningful per timeline.
The whole preemption and scheduling is something bound to happen on gpu and we will want to integrate with core scheduler to manage time slice allocated to process, but the we need the concept of thread in which operation on same hw block are processed in a linear fashion but still allowing concurrency with other hw block.
Well yeah, it's just that a gpu thread doesn't have a lot to do with a cpu process, at least in the current drm world. It would be nice make that a bit more cross-driver for management, but we didn't even manage to pull that of for gpu memory. So I don't think it'll happen anytime soonish.
> So from that pov (presuming I didn't miss anything) your proposal is > identical to what we have, minor some different color choices (like where > to place the callback queue).
No callback is the mantra here, and instead of bolting free living fence to buffer object, they are associated with timeline which means you do not need to go over all buffer object to know what you need to wait for.
Ok, then I guess I didn't understand that part of your the proposal. Can you please elaborate a bit more how you want to synchronize mulitple drivers accessing a dma-buf object and what piece of state we need to associate to the dma-buf to make this happen?
Beauty of it you associate ziltch to the buffer. So for existing cs ioctl where the implicit synchronization is the rule it enforce mandatory synchronization accross all hw timeline on which a buffer shows up : for_each_buffer_in_cmdbuffer(buffer, cmdbuf) { if (!cmdbuf_write_to_buffer(buffer, cmdbuf)) continue; for_each_process_sharing_buffer(buffer, process) { schedule_fence(process->implicit_timeline, cmdbuf->fence) } }
Next time another process use current ioctl with implicit sync it will synch with the last fence for any shared resource. This might sounds bad but truely as it is right now this is already how it happens (at least for radeon).
Well i915 is a lot better than that. And I'm not going to implement some special-case for dma-buf shared buffers just because radeon sucks and wants to enforce that suckage on everyone else.
I guess i am having hard time to express myself, what i am saying here is that implicit synchronization sucks because it has to assume the worst case and this is what current code does, and i am sure intel is doing something similar with today code.
Explicit synchronization allow more flexibility but fence code as it is designed does not allow to fully do down that line. By associating fence to buffer object which is the biggest shortcoming of implicit sync.
I don't see how your statement that implicit sync sucks maps to the reality of i915 where it doesn't. The only thing you can't do is shared buffer objects that you suballocate. And userspace is always allowed to lie about the access mode (if it lies too badly it will simply read zero page) so can forgoe implicit sync.
Sucks because you can not do weird synchronization like one i depicted in another mail in this thread and for as long as cmdbuf_ioctl do not give you fence|syncpt you can not such thing cleanly in non hackish way.
Sucks because you have a fence object per buffer object and thus overhead grow with the number of objects. Not even mentioning fence lifetime issue.
Sucks because sub-buffer allocation is just one of many tricks that can not be achieved properly and cleanly with implicit sync.
...
Having code that work around or alieviate some of this, is in no way a testimony that it's the best solution. I do believe explicit sync to be superior in use case it allows way more freedom while the only drawback i see is that you have to put some trust into userspace.
So yes implicit sync sucks and it does map to i915 reality as well.
So let's cut this short: If you absolutely insist I guess we could ditch the callback stuff from fences, but I really don't see the problem with radeon just not using that and then being happy. We can easily implement a bit of insulation code _just_ for radeon so that the only thing radeon does is wake up a process (which then calls the callback if it's something special).
Like i said feel free to ignore me. I am just genuinely want to have the best solution inside the linux kernel and i do think that fence and callback and buffer association is not that solution. I tried to explain why but i might be failing or missing something.
Otoh I don't care about what ttm and radeon do, for i915 the important stuff is integration with android syncpts and being able to do explicit fencing for e.g. svm stuff. We can do that with what's merged in 3.17 and I expect that those patches will land in 3.18, at least the internal integration.
It would be cool if we could get tear-free optimus working on desktop linux, but that flat out doesn't pay my bills here. So I think I'll let you guys figure this out yourself.
Sad to learn that Intel no longer have any interest in the linux desktop.
Well the current Linux DRI stack works fairly well on Linux with Intel I think, so I don't think anyone can claim we don't care. It's only starts to suck if you mix in prime.
And as a matter of fact the current Linux DRI model uses implicit sync, so that's what dma-buf currently implements. Of course we can implement an explicit sync gpu model (and it will come for at least opencl and similar stuff), and probably also for interactions with v4l, but I don't see any patches to rewrite the current X stack we have.
And yeah we kinda optimized that one a bit on Intel since we actually care about X ;-)
What I'll stop caring about is how exactly radeon integrates into this, not the linux desktop in general.
-Daniel
Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
On Thu, Aug 14, 2014 at 8:18 PM, Jerome Glisse j.glisse@gmail.com wrote:
Sucks because you can not do weird synchronization like one i depicted in another mail in this thread and for as long as cmdbuf_ioctl do not give you fence|syncpt you can not such thing cleanly in non hackish way.
Actually i915 can soon will do that that.
Sucks because you have a fence object per buffer object and thus overhead grow with the number of objects. Not even mentioning fence lifetime issue.
Sucks because sub-buffer allocation is just one of many tricks that can not be achieved properly and cleanly with implicit sync.
...
Well I heard all those reasons and I'm well of aware of them. The problem is that with current hardware the kernel needs to know for each buffer how long it needs to be kept around since hw just can't do page faulting. Yeah you can pin them but for an uma design that doesn't go down well with folks.
The other problem is that the Linux Desktop I don't seem to care about any more kinda relies on implicit syncing right now, so we better keep that working fairly well. Of course we could dream up a shiny new world where all of the Linux desktop does explicit syncing, but that world doesn't exist right now. I mean really if you want to right away throw implicit syncing overboard that doesn't bode well for the current linux desktop.
So I don't understand all the implicit syncing bashing. It's here, and it'll stay for a while longer whether you like it or not ...
Of course that doesn't mean we (intel here) won't support explicit syncing too, and I really don't see a conflict here when mixing and matching these two approaches. -Daniel
Having code that work around or alieviate some of this, is in no way a testimony that it's the best solution. I do believe explicit sync to be superior in use case it allows way more freedom while the only drawback i see is that you have to put some trust into userspace.
So yes implicit sync sucks and it does map to i915 reality as well.
On Thu, Aug 14, 2014 at 08:47:16PM +0200, Daniel Vetter wrote:
On Thu, Aug 14, 2014 at 8:18 PM, Jerome Glisse j.glisse@gmail.com wrote:
Sucks because you can not do weird synchronization like one i depicted in another mail in this thread and for as long as cmdbuf_ioctl do not give you fence|syncpt you can not such thing cleanly in non hackish way.
Actually i915 can soon will do that that.
So you will return fence|syncpoint with each cmdbuf_ioctl ?
Sucks because you have a fence object per buffer object and thus overhead grow with the number of objects. Not even mentioning fence lifetime issue.
Sucks because sub-buffer allocation is just one of many tricks that can not be achieved properly and cleanly with implicit sync.
...
Well I heard all those reasons and I'm well of aware of them. The problem is that with current hardware the kernel needs to know for each buffer how long it needs to be kept around since hw just can't do page faulting. Yeah you can pin them but for an uma design that doesn't go down well with folks.
I am not thinking with fancy hw in mind, on contrary i thought about all this with the crappiest hw i could think of, in mind.
Yes you can get rid of fence and not have to pin memory with current hw. What matter for unpinning is to know that all hw block are done using the memory. This is easily achievable with your beloved seqno. Have one seqno per driver (one driver can have different block 3d, video decoding, crtc, ...) each time a buffer is use as part of a command on one block inc the common seqno and tag the buffer with that number. Have each hw block write the lastest seqno that is done to a per block location. Now to determine is buffer is done compare the buffer seqno with the max of all the signaled seqno of all blocks.
Cost 1 uint32 per buffer and simple if without locking to check status of a buffer.
Yes preemption and gpu scheduling would break such scheme, but my point is that when you have such gpu you want to implement a proper solution. Which of course require quite some work accross the stack. So the past can live on but the future needs to get its acts together.
The other problem is that the Linux Desktop I don't seem to care about any more kinda relies on implicit syncing right now, so we better keep that working fairly well. Of course we could dream up a shiny new world where all of the Linux desktop does explicit syncing, but that world doesn't exist right now. I mean really if you want to right away throw implicit syncing overboard that doesn't bode well for the current linux desktop.
Again i fail at expressing myself. I am saying throw things over board, i am well aware of the current reliance on implicit fencing. I am saying if fence wants to be this new thing that should allow to do explicit fencing in the future than it better be done correctly in the first place.
So I don't understand all the implicit syncing bashing. It's here, and it'll stay for a while longer whether you like it or not ...
I am saying this is where we are and it sucks for a number of reasons, then looking at fence and by looking at fence i am saying this try to go in the right direction but do crazy things that i am convince we will regret. In other word if we ever get to the explicit fence better starts on the right path with the right tool. Moreover i am saying that this can be done without breaking implicit sync we have today.
Of course that doesn't mean we (intel here) won't support explicit syncing too, and I really don't see a conflict here when mixing and matching these two approaches.
Again i fail to express myself. I am not saying there is conflict. I am saying better take a path which allow to go full way with explicit fencing while still allowing a less optimal use for an implicit sync model.
My point is the fence code proposed here, keeps the worst thing about implicit fencing we have today. This can be done differently, in what i believe to be better way. And this different approach stills allow to have have implicit sync for existing userspace.
Cheers, Jérôme
-Daniel
Having code that work around or alieviate some of this, is in no way a testimony that it's the best solution. I do believe explicit sync to be superior in use case it allows way more freedom while the only drawback i see is that you have to put some trust into userspace.
So yes implicit sync sucks and it does map to i915 reality as well.
-- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
On 14-08-14 21:15, Jerome Glisse wrote:
On Thu, Aug 14, 2014 at 08:47:16PM +0200, Daniel Vetter wrote:
On Thu, Aug 14, 2014 at 8:18 PM, Jerome Glisse j.glisse@gmail.com wrote:
Sucks because you can not do weird synchronization like one i depicted in another mail in this thread and for as long as cmdbuf_ioctl do not give you fence|syncpt you can not such thing cleanly in non hackish way.
Actually i915 can soon will do that that.
So you will return fence|syncpoint with each cmdbuf_ioctl ?
It might, soon. There have been patches on the ml about it. It can create a userspace android fence backed by a kernel dma fence. And it will be created like any other userspace android fence. ;-)
Yet even with that, it will continue to support the implicit sync model since they're not mutually exclusive.
I also fail to understand why you think a fence should be associated with a buffer object. It's the other way around. TTM currently requires buffer objects to be fenced off to protect against eviction.
reservation_object is used for this, and by sharing the reservation_object pointer with a dma-buf you get cross-device synchronization.
It has a bunch of helpers to make common operations easy, see include/linux/reservation.h and drivers/dma-buf/reservation.c It also allows multiple readers simultaneously across any number of devices. I intend to use this in nouveau.
But there's no requirement to use reservation_object's apart from that's how ttm currently works, and for implicit syncing in dma-buf. If you don't need either go crazy with fence and write your own mechanism on top of fence. Although with android sync and TTM I think I handled all common usecases.
~Maarten
On Thu, Aug 14, 2014 at 09:40:08PM +0200, Maarten Lankhorst wrote:
On 14-08-14 21:15, Jerome Glisse wrote:
On Thu, Aug 14, 2014 at 08:47:16PM +0200, Daniel Vetter wrote:
On Thu, Aug 14, 2014 at 8:18 PM, Jerome Glisse j.glisse@gmail.com wrote:
Sucks because you can not do weird synchronization like one i depicted in another mail in this thread and for as long as cmdbuf_ioctl do not give you fence|syncpt you can not such thing cleanly in non hackish way.
Actually i915 can soon will do that that.
So you will return fence|syncpoint with each cmdbuf_ioctl ?
It might, soon. There have been patches on the ml about it. It can create a userspace android fence backed by a kernel dma fence. And it will be created like any other userspace android fence. ;-)
Android fence are not in my mind a nice thing :)
Yet even with that, it will continue to support the implicit sync model since they're not mutually exclusive.
Again i have fail to express myself. I have tried to repeatly said that what i proposed was not mutualy exclusive.
I also fail to understand why you think a fence should be associated with a buffer object. It's the other way around. TTM currently requires buffer objects to be fenced off to protect against eviction.
Again i fail to properly explain myself. I said no fence should be associted to buffer nor dma-buf wether shared or not. The only things that you really need for a buffer is a seqno and this only have use for binding/unbinding object from GART/GPU address space/... so no i am not saying fence should be associated with a buffer. I am saying fence should be associated with each cmdbuf and there should be no link whatsover btw fence and buffer object ie you do not need to store a pointer to fence inside dma-buf or any buffer object structure.
I am well aware of how ttm works and i am saying it can be replace with simpler seqno.
reservation_object is used for this, and by sharing the reservation_object pointer with a dma-buf you get cross-device synchronization.
My point is that dma-buf should not have reserversion_object pointer it is useless and counter productive and only add complexity. You do not care about buffer, you care about synchronizing cmdbuf/hw block with each other. A buffer is just what those block consume. In other words reservation_object as fence are useless and only complexify things for no added value on contrary they are not as flexible as fence associated to cmdbuf.
It has a bunch of helpers to make common operations easy, see
include/linux/reservation.h and drivers/dma-buf/reservation.c
It also allows multiple readers simultaneously across any number of devices. I intend to use this in nouveau.
As does what i am talking about. The reservation stuff is a limiting factor more than an helper in my eyes.
But there's no requirement to use reservation_object's apart from that's how ttm currently works, and for implicit syncing in dma-buf. If you don't need either go crazy with fence and write your own mechanism on top of fence. Although with android sync and TTM I think I handled all common usecases.
My point is that implicit syncing can be implemented in a saner way that would also allow to implement an explicit syncing with no restriction and only room for extra optimisation. By enforcing to have the cpu involve in hw block sync you are limiting what can be done and clever hardware are force to use sub optimal solution.
~Maarten
On Thu, Aug 14, 2014 at 9:56 PM, Jerome Glisse j.glisse@gmail.com wrote:
Android fence are not in my mind a nice thing :)
Well I'll have a very close look at the proposed ioctl interface on top of these fence fds to make sure it's sane. But it will come that's pretty much for sure. And similar fence integration will very likely happen in the modeset world with acquire/release fence/syncpts around atomic updates and for vblank events. The modeset integration will btw not require any driver changes, at least not the release fences.
And like Maarten says we'll keep that working together with implicit syncing for the Linux Desktop ... -Daniel
On Thu, Aug 14, 2014 at 9:15 PM, Jerome Glisse j.glisse@gmail.com wrote:
Cost 1 uint32 per buffer and simple if without locking to check status of a buffer.
Yeah well except it doesn't and that's why we switch to full blown fence objects internally instead of smacking seqno values all over the place. At least in i915 because I have seen what this "simple" solution looks like if you throw all the complexities into the bin and mix it well. -Daniel
On Thu, Aug 14, 2014 at 11:23:01PM +0200, Daniel Vetter wrote:
On Thu, Aug 14, 2014 at 9:15 PM, Jerome Glisse j.glisse@gmail.com wrote:
Cost 1 uint32 per buffer and simple if without locking to check status of a buffer.
Yeah well except it doesn't and that's why we switch to full blown fence objects internally instead of smacking seqno values all over the place. At least in i915 because I have seen what this "simple" solution looks like if you throw all the complexities into the bin and mix it well.
I am kind of curious on what can go wrong here ? Unless you have thousands of different hw block inside your gpu all of them with different cmd queue then i do not see any issue.
Note that this global seqno i am talking is only useful for bind/unbind of buffer in ideal world of explicit sync, not for synchronization btw command buffer. So pseudo code would be :
// if buffer_can_unbind(buf, dev) return true then buffer is no longer in // use and can be unbind. if false you can wait on the device wait queue. bool buffer_can_unbind(buf, dev) { // Is the last gseqno associated with buffer is smaller than current // smallest signaled seqno ? if (buf->gseqno <= dev->gseqno_cmin) return true; hw_update_gseqno_min(dev); if (buf->gseqno <= dev->gseqno_cmin) return true; for_each_hw_block(hwblock, dev) { // If that hwblock has not signaled a gseqno bigger than the // buffer one's and also has work scheduled that might be using // the buffer (ie the last scheduled gseqno is greater than // buffer gseqno). If that's true than buffer might still be // in use so assume the worst. if (hwblock->gseqno < buf->gseqno && hwblock->gseqno_last_scheduled >= buf->gseqno) return false; // This means either this block is already past the buf->gseqno // or that this block have nothing in its pipeline that will ever // use buf. // As an extra optimization one might add a hwblock mask to buf // and unmask wait for that hwblock so further wait will not wait // on this block as we know for sure it's not involve. } // Well none of the hwblock is still using that buffer. return true; }
hw_update_gseqno_min(dev) { unsigned nmin = -1;
for_each_hw_block(hwblock, dev) { nmin = min(nmin, hwblock->gseqno); } // atomic exchange and compage set new min only if it's bigger then // current min atomic_cmp_xchg(dev->gseqno_cmin, nmin) }
So this is how it looks in my mind, each hwblock write to there dedicated
gseqno and for each hwblock you store the last gseqno that was scheduled.
There is no need for any lock and this is outside any other sync mecanism.
For hw with preemption, i assume you have then have a hwcontext, well you use same code except that now you have a gseqno per context which means you need to store a seqno per context per buffer. With some trickery this might be avoided especialy if hw can do atomic cmp_xchg.
Cheers, Jérôme
-Daniel
Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
On Thu, Aug 14, 2014 at 07:03:44PM -0400, Jerome Glisse wrote:
On Thu, Aug 14, 2014 at 11:23:01PM +0200, Daniel Vetter wrote:
On Thu, Aug 14, 2014 at 9:15 PM, Jerome Glisse j.glisse@gmail.com wrote:
Cost 1 uint32 per buffer and simple if without locking to check status of a buffer.
Yeah well except it doesn't and that's why we switch to full blown fence objects internally instead of smacking seqno values all over the place. At least in i915 because I have seen what this "simple" solution looks like if you throw all the complexities into the bin and mix it well.
I am kind of curious on what can go wrong here ? Unless you have thousands of different hw block inside your gpu all of them with different cmd queue then i do not see any issue.
See my other reply, but because we'll schedule software contexts here and we need to do it with implicit fencing because we can't just break the existing world. The below is pretty much the design we currently have. -Daniel
Note that this global seqno i am talking is only useful for bind/unbind of buffer in ideal world of explicit sync, not for synchronization btw command buffer. So pseudo code would be :
// if buffer_can_unbind(buf, dev) return true then buffer is no longer in // use and can be unbind. if false you can wait on the device wait queue. bool buffer_can_unbind(buf, dev) { // Is the last gseqno associated with buffer is smaller than current // smallest signaled seqno ? if (buf->gseqno <= dev->gseqno_cmin) return true; hw_update_gseqno_min(dev); if (buf->gseqno <= dev->gseqno_cmin) return true; for_each_hw_block(hwblock, dev) { // If that hwblock has not signaled a gseqno bigger than the // buffer one's and also has work scheduled that might be using // the buffer (ie the last scheduled gseqno is greater than // buffer gseqno). If that's true than buffer might still be // in use so assume the worst. if (hwblock->gseqno < buf->gseqno && hwblock->gseqno_last_scheduled >= buf->gseqno) return false; // This means either this block is already past the buf->gseqno // or that this block have nothing in its pipeline that will ever // use buf. // As an extra optimization one might add a hwblock mask to buf // and unmask wait for that hwblock so further wait will not wait // on this block as we know for sure it's not involve. } // Well none of the hwblock is still using that buffer. return true; }
hw_update_gseqno_min(dev) { unsigned nmin = -1;
for_each_hw_block(hwblock, dev) { nmin = min(nmin, hwblock->gseqno); } // atomic exchange and compage set new min only if it's bigger then // current min atomic_cmp_xchg(dev->gseqno_cmin, nmin) }
So this is how it looks in my mind, each hwblock write to there dedicated
gseqno and for each hwblock you store the last gseqno that was scheduled.
There is no need for any lock and this is outside any other sync mecanism.
For hw with preemption, i assume you have then have a hwcontext, well you use same code except that now you have a gseqno per context which means you need to store a seqno per context per buffer. With some trickery this might be avoided especialy if hw can do atomic cmp_xchg.
Cheers, Jérôme
-Daniel
Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
On Fri, Aug 15, 2014 at 10:07:39AM +0200, Daniel Vetter wrote:
On Thu, Aug 14, 2014 at 07:03:44PM -0400, Jerome Glisse wrote:
On Thu, Aug 14, 2014 at 11:23:01PM +0200, Daniel Vetter wrote:
On Thu, Aug 14, 2014 at 9:15 PM, Jerome Glisse j.glisse@gmail.com wrote:
Cost 1 uint32 per buffer and simple if without locking to check status of a buffer.
Yeah well except it doesn't and that's why we switch to full blown fence objects internally instead of smacking seqno values all over the place. At least in i915 because I have seen what this "simple" solution looks like if you throw all the complexities into the bin and mix it well.
I am kind of curious on what can go wrong here ? Unless you have thousands of different hw block inside your gpu all of them with different cmd queue then i do not see any issue.
See my other reply, but because we'll schedule software contexts here and we need to do it with implicit fencing because we can't just break the existing world. The below is pretty much the design we currently have.
Software need the object to be binded to the gpu gart on intel ? That's bad. But even with such thing, you can do it as a secondary lock only for user space.
-Daniel
Note that this global seqno i am talking is only useful for bind/unbind of buffer in ideal world of explicit sync, not for synchronization btw command buffer. So pseudo code would be :
// if buffer_can_unbind(buf, dev) return true then buffer is no longer in // use and can be unbind. if false you can wait on the device wait queue. bool buffer_can_unbind(buf, dev) { // Is the last gseqno associated with buffer is smaller than current // smallest signaled seqno ? if (buf->gseqno <= dev->gseqno_cmin) return true; hw_update_gseqno_min(dev); if (buf->gseqno <= dev->gseqno_cmin) return true; for_each_hw_block(hwblock, dev) { // If that hwblock has not signaled a gseqno bigger than the // buffer one's and also has work scheduled that might be using // the buffer (ie the last scheduled gseqno is greater than // buffer gseqno). If that's true than buffer might still be // in use so assume the worst. if (hwblock->gseqno < buf->gseqno && hwblock->gseqno_last_scheduled >= buf->gseqno) return false; // This means either this block is already past the buf->gseqno // or that this block have nothing in its pipeline that will ever // use buf. // As an extra optimization one might add a hwblock mask to buf // and unmask wait for that hwblock so further wait will not wait // on this block as we know for sure it's not involve. } // Well none of the hwblock is still using that buffer. return true; }
hw_update_gseqno_min(dev) { unsigned nmin = -1;
for_each_hw_block(hwblock, dev) { nmin = min(nmin, hwblock->gseqno); } // atomic exchange and compage set new min only if it's bigger then // current min atomic_cmp_xchg(dev->gseqno_cmin, nmin) }
So this is how it looks in my mind, each hwblock write to there dedicated
gseqno and for each hwblock you store the last gseqno that was scheduled.
There is no need for any lock and this is outside any other sync mecanism.
For hw with preemption, i assume you have then have a hwcontext, well you use same code except that now you have a gseqno per context which means you need to store a seqno per context per buffer. With some trickery this might be avoided especialy if hw can do atomic cmp_xchg.
Cheers, Jérôme
-Daniel
Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
-- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
On Thu, Aug 14, 2014 at 9:15 PM, Jerome Glisse j.glisse@gmail.com wrote:
Yes preemption and gpu scheduling would break such scheme, but my point is that when you have such gpu you want to implement a proper solution. Which of course require quite some work accross the stack. So the past can live on but the future needs to get its acts together.
The other problem is that the Linux Desktop I don't seem to care about any more kinda relies on implicit syncing right now, so we better keep that working fairly well. Of course we could dream up a shiny new world where all of the Linux desktop does explicit syncing, but that world doesn't exist right now. I mean really if you want to right away throw implicit syncing overboard that doesn't bode well for the current linux desktop.
Again i fail at expressing myself. I am saying throw things over board, i am well aware of the current reliance on implicit fencing. I am saying if fence wants to be this new thing that should allow to do explicit fencing in the future than it better be done correctly in the first place.
So I don't understand all the implicit syncing bashing. It's here, and it'll stay for a while longer whether you like it or not ...
I am saying this is where we are and it sucks for a number of reasons, then looking at fence and by looking at fence i am saying this try to go in the right direction but do crazy things that i am convince we will regret. In other word if we ever get to the explicit fence better starts on the right path with the right tool. Moreover i am saying that this can be done without breaking implicit sync we have today.
Of course that doesn't mean we (intel here) won't support explicit syncing too, and I really don't see a conflict here when mixing and matching these two approaches.
Again i fail to express myself. I am not saying there is conflict. I am saying better take a path which allow to go full way with explicit fencing while still allowing a less optimal use for an implicit sync model.
My point is the fence code proposed here, keeps the worst thing about implicit fencing we have today. This can be done differently, in what i believe to be better way. And this different approach stills allow to have have implicit sync for existing userspace.
Well the problem is I can't wait for that shiny new world to finally arrive, I kinda have to work with what's there. Which is ugly, messy, but that's how it is.
So I'm not going to throw away piles of code just because it's ugly and combining it with the new world is a major pain. Instead I'll gobble down the ugly, throw testcases at the problem to make sure it works and ship this frankenstein monster.
If you actually look at development stats we've completely rewritten i915.ko in the past 2 years (again) and will likely continue for a while. But we do that with small (tiny) steps and not with big jumps, because with the current pace and ongoing projects that's plainly too disruptive. So I envy you if you can do this with radeon, but there's no way I can do that with i915, I just have to live with it. -Daniel
On 08/14/2014 09:15 PM, Jerome Glisse wrote:
On Thu, Aug 14, 2014 at 08:47:16PM +0200, Daniel Vetter wrote:
On Thu, Aug 14, 2014 at 8:18 PM, Jerome Glisse j.glisse@gmail.com wrote:
Sucks because you can not do weird synchronization like one i depicted in another mail in this thread and for as long as cmdbuf_ioctl do not give you fence|syncpt you can not such thing cleanly in non hackish way.
Actually i915 can soon will do that that.
So you will return fence|syncpoint with each cmdbuf_ioctl ?
Sucks because you have a fence object per buffer object and thus overhead grow with the number of objects. Not even mentioning fence lifetime issue.
Sucks because sub-buffer allocation is just one of many tricks that can not be achieved properly and cleanly with implicit sync.
...
Well I heard all those reasons and I'm well of aware of them. The problem is that with current hardware the kernel needs to know for each buffer how long it needs to be kept around since hw just can't do page faulting. Yeah you can pin them but for an uma design that doesn't go down well with folks.
I am not thinking with fancy hw in mind, on contrary i thought about all this with the crappiest hw i could think of, in mind.
Yes you can get rid of fence and not have to pin memory with current hw. What matter for unpinning is to know that all hw block are done using the memory. This is easily achievable with your beloved seqno. Have one seqno per driver (one driver can have different block 3d, video decoding, crtc, ...) each time a buffer is use as part of a command on one block inc the common seqno and tag the buffer with that number. Have each hw block write the lastest seqno that is done to a per block location. Now to determine is buffer is done compare the buffer seqno with the max of all the signaled seqno of all blocks.
Cost 1 uint32 per buffer and simple if without locking to check status of a buffer.
Hmm? The trivial and first use of fence objects in the linux DRM was triggered by the fact that a 32-bit seqno wraps pretty quickly and a 32-bit solution just can't be made robust. Now a 64-bit seqno will probably be robust for forseeable future, but when it comes to implement that on 32-bit hardware and compare it to a simple fence object approach,
/Thomas
On Fri, Aug 15, 2014 at 08:54:38AM +0200, Thomas Hellstrom wrote:
On 08/14/2014 09:15 PM, Jerome Glisse wrote:
On Thu, Aug 14, 2014 at 08:47:16PM +0200, Daniel Vetter wrote:
On Thu, Aug 14, 2014 at 8:18 PM, Jerome Glisse j.glisse@gmail.com wrote:
Sucks because you can not do weird synchronization like one i depicted in another mail in this thread and for as long as cmdbuf_ioctl do not give you fence|syncpt you can not such thing cleanly in non hackish way.
Actually i915 can soon will do that that.
So you will return fence|syncpoint with each cmdbuf_ioctl ?
Sucks because you have a fence object per buffer object and thus overhead grow with the number of objects. Not even mentioning fence lifetime issue.
Sucks because sub-buffer allocation is just one of many tricks that can not be achieved properly and cleanly with implicit sync.
...
Well I heard all those reasons and I'm well of aware of them. The problem is that with current hardware the kernel needs to know for each buffer how long it needs to be kept around since hw just can't do page faulting. Yeah you can pin them but for an uma design that doesn't go down well with folks.
I am not thinking with fancy hw in mind, on contrary i thought about all this with the crappiest hw i could think of, in mind.
Yes you can get rid of fence and not have to pin memory with current hw. What matter for unpinning is to know that all hw block are done using the memory. This is easily achievable with your beloved seqno. Have one seqno per driver (one driver can have different block 3d, video decoding, crtc, ...) each time a buffer is use as part of a command on one block inc the common seqno and tag the buffer with that number. Have each hw block write the lastest seqno that is done to a per block location. Now to determine is buffer is done compare the buffer seqno with the max of all the signaled seqno of all blocks.
Cost 1 uint32 per buffer and simple if without locking to check status of a buffer.
Hmm? The trivial and first use of fence objects in the linux DRM was triggered by the fact that a 32-bit seqno wraps pretty quickly and a 32-bit solution just can't be made robust. Now a 64-bit seqno will probably be robust for forseeable future, but when it comes to implement that on 32-bit hardware and compare it to a simple fence object approach,
Using same kind of arithemic as use for jiffies would do it provided that there is a checking that we never let someobject pass above a certain age.
/Thomas
On 08/15/2014 04:52 PM, Jerome Glisse wrote:
On Fri, Aug 15, 2014 at 08:54:38AM +0200, Thomas Hellstrom wrote:
On 08/14/2014 09:15 PM, Jerome Glisse wrote:
On Thu, Aug 14, 2014 at 08:47:16PM +0200, Daniel Vetter wrote:
On Thu, Aug 14, 2014 at 8:18 PM, Jerome Glisse j.glisse@gmail.com wrote:
Sucks because you can not do weird synchronization like one i depicted in another mail in this thread and for as long as cmdbuf_ioctl do not give you fence|syncpt you can not such thing cleanly in non hackish way.
Actually i915 can soon will do that that.
So you will return fence|syncpoint with each cmdbuf_ioctl ?
Sucks because you have a fence object per buffer object and thus overhead grow with the number of objects. Not even mentioning fence lifetime issue.
Sucks because sub-buffer allocation is just one of many tricks that can not be achieved properly and cleanly with implicit sync.
...
Well I heard all those reasons and I'm well of aware of them. The problem is that with current hardware the kernel needs to know for each buffer how long it needs to be kept around since hw just can't do page faulting. Yeah you can pin them but for an uma design that doesn't go down well with folks.
I am not thinking with fancy hw in mind, on contrary i thought about all this with the crappiest hw i could think of, in mind.
Yes you can get rid of fence and not have to pin memory with current hw. What matter for unpinning is to know that all hw block are done using the memory. This is easily achievable with your beloved seqno. Have one seqno per driver (one driver can have different block 3d, video decoding, crtc, ...) each time a buffer is use as part of a command on one block inc the common seqno and tag the buffer with that number. Have each hw block write the lastest seqno that is done to a per block location. Now to determine is buffer is done compare the buffer seqno with the max of all the signaled seqno of all blocks.
Cost 1 uint32 per buffer and simple if without locking to check status of a buffer.
Hmm? The trivial and first use of fence objects in the linux DRM was triggered by the fact that a 32-bit seqno wraps pretty quickly and a 32-bit solution just can't be made robust. Now a 64-bit seqno will probably be robust for forseeable future, but when it comes to implement that on 32-bit hardware and compare it to a simple fence object approach,
Using same kind of arithemic as use for jiffies would do it provided that there is a checking that we never let someobject pass above a certain age.
But wouldn't the search-for-max scheme break if blocks complete out of order?
/Thomas
On Sat, Aug 16, 2014 at 09:01:27AM +0200, Thomas Hellstrom wrote:
On 08/15/2014 04:52 PM, Jerome Glisse wrote:
On Fri, Aug 15, 2014 at 08:54:38AM +0200, Thomas Hellstrom wrote:
On 08/14/2014 09:15 PM, Jerome Glisse wrote:
On Thu, Aug 14, 2014 at 08:47:16PM +0200, Daniel Vetter wrote:
On Thu, Aug 14, 2014 at 8:18 PM, Jerome Glisse j.glisse@gmail.com wrote:
Sucks because you can not do weird synchronization like one i depicted in another mail in this thread and for as long as cmdbuf_ioctl do not give you fence|syncpt you can not such thing cleanly in non hackish way.
Actually i915 can soon will do that that.
So you will return fence|syncpoint with each cmdbuf_ioctl ?
Sucks because you have a fence object per buffer object and thus overhead grow with the number of objects. Not even mentioning fence lifetime issue.
Sucks because sub-buffer allocation is just one of many tricks that can not be achieved properly and cleanly with implicit sync.
...
Well I heard all those reasons and I'm well of aware of them. The problem is that with current hardware the kernel needs to know for each buffer how long it needs to be kept around since hw just can't do page faulting. Yeah you can pin them but for an uma design that doesn't go down well with folks.
I am not thinking with fancy hw in mind, on contrary i thought about all this with the crappiest hw i could think of, in mind.
Yes you can get rid of fence and not have to pin memory with current hw. What matter for unpinning is to know that all hw block are done using the memory. This is easily achievable with your beloved seqno. Have one seqno per driver (one driver can have different block 3d, video decoding, crtc, ...) each time a buffer is use as part of a command on one block inc the common seqno and tag the buffer with that number. Have each hw block write the lastest seqno that is done to a per block location. Now to determine is buffer is done compare the buffer seqno with the max of all the signaled seqno of all blocks.
Cost 1 uint32 per buffer and simple if without locking to check status of a buffer.
Hmm? The trivial and first use of fence objects in the linux DRM was triggered by the fact that a 32-bit seqno wraps pretty quickly and a 32-bit solution just can't be made robust. Now a 64-bit seqno will probably be robust for forseeable future, but when it comes to implement that on 32-bit hardware and compare it to a simple fence object approach,
Using same kind of arithemic as use for jiffies would do it provided that there is a checking that we never let someobject pass above a certain age.
But wouldn't the search-for-max scheme break if blocks complete out of order?
Seems i use wrong word in my email, the pseudo code is correct however, it's not max it's min. So by knowing the minimum global seqno of all hw block you know that at very least all hw block are past that point. Which means the order into which blocks complete is irrevelant. But it also means that you might delay more than needed the unbinding, thought pseudo code shows way to alieviate and minimize this delay.
It's doable i had it kind of working couple years ago when we were reworking fence in radeon driver but never got around to finish something clean because implicit sync needs the reservation scheme which means you kind of need a fence per buffer (well getting rid of that is where things get complicated and painful).
Explicit sync is just so much nicer, but we made decision long ago about implicit, another thing i am deeply regreting now.
Cheers, Jérôme
/Thomas
Op 13-08-14 om 19:07 schreef Jerome Glisse:
On Wed, Aug 13, 2014 at 05:54:20PM +0200, Daniel Vetter wrote:
On Wed, Aug 13, 2014 at 09:36:04AM -0400, Jerome Glisse wrote:
On Wed, Aug 13, 2014 at 10:28:22AM +0200, Daniel Vetter wrote:
On Tue, Aug 12, 2014 at 06:13:41PM -0400, Jerome Glisse wrote:
Hi,
So i want over the whole fence and sync point stuff as it's becoming a pressing issue. I think we first need to agree on what is the problem we want to solve and what would be the requirements to solve it.
Problem : Explicit synchronization btw different hardware block over a buffer object.
Requirements : Share common infrastructure. Allow optimal hardware command stream scheduling accross hardware block. Allow android sync point to be implemented on top of it. Handle/acknowledge exception (like good old gpu lockup). Minimize driver changes.
Glossary : hardware timeline: timeline bound to a specific hardware block. pipeline timeline: timeline bound to a userspace rendering pipeline, each point on that timeline can be a composite of several different hardware pipeline point. pipeline: abstract object representing userspace application graphic pipeline of each of the application graphic operations. fence: specific point in a timeline where synchronization needs to happen.
So now, current include/linux/fence.h implementation is i believe missing the objective by confusing hardware and pipeline timeline and by bolting fence to buffer object while what is really needed is true and proper timeline for both hardware and pipeline. But before going further down that road let me look at things and explain how i see them.
fences can be used free-standing and no one forces you to integrate them with buffers. We actually plan to go this way with the intel svm stuff. Ofc for dma-buf the plan is to synchronize using such fences, but that's somewhat orthogonal I think. At least you only talk about fences and timeline and not dma-buf here.
Current ttm fence have one and a sole purpose, allow synchronization for buffer object move even thought some driver like radeon slightly abuse it and use them for things like lockup detection.
The new fence want to expose an api that would allow some implementation of a timeline. For that it introduces callback and some hard requirement on what the driver have to expose : enable_signaling [signaled] wait
Each of those have to do work inside the driver to which the fence belongs and each of those can be call more or less from unexpected (with restriction like outside irq) context. So we end up with thing like :
Process 1 Process 2 Process 3 I_A_schedule(fence0) CI_A_F_B_signaled(fence0) I_A_signal(fence0) CI_B_F_A_callback(fence0) CI_A_F_B_wait(fence0) Lexique: I_x in driver x (I_A == in driver A) CI_x_F_y call in driver X from driver Y (CI_A_F_B call in driver A from driver B)
So this is an happy mess everyone call everyone and this bound to get messy. Yes i know there is all kind of requirement on what happen once a fence is signaled. But those requirement only looks like they are trying to atone any mess that can happen from the whole callback dance.
While i was too seduced by the whole callback idea long time ago, i think it is a highly dangerous path to take where the combinatorial of what could happen are bound to explode with the increase in the number of players.
So now back to how to solve the problem we are trying to address. First i want to make an observation, almost all GPU that exist today have a command ring on to which userspace command buffer are executed and inside the command ring you can do something like :
if (condition) execute_command_buffer else skip_command_buffer
where condition is a simple expression (memory_address cop value)) with cop one of the generic comparison (==, <, >, <=, >=). I think it is a safe assumption that any gpu that slightly matter can do that. Those who can not should fix there command ring processor.
With that in mind, i think proper solution is implementing timeline and having fence be a timeline object with a way simpler api. For each hardware timeline driver provide a system memory address at which the lastest signaled fence sequence number can be read. Each fence object is uniquely associated with both a hardware and a pipeline timeline. Each pipeline timeline have a wait queue.
When scheduling something that require synchronization on a hardware timeline a fence is created and associated with the pipeline timeline and hardware timeline. Other hardware block that need to wait on a fence can use there command ring conditional execution to directly check the fence sequence from the other hw block so you do optimistic scheduling. If optimistic scheduling fails (which would be reported by hw block specific solution and hidden) then things can fallback to software cpu wait inside what could be considered the kernel thread of the pipeline timeline.
From api point of view there is no inter-driver call. All the driver needs to do is wakeup the pipeline timeline wait_queue when things are signaled or when things go sideway (gpu lockup).
So how to implement that with current driver ? Well easy. Currently we assume implicit synchronization so all we need is an implicit pipeline timeline per userspace process (note this do not prevent inter process synchronization). Everytime a command buffer is submitted it is added to the implicit timeline with the simple fence object :
struct fence { struct list_head list_hwtimeline; struct list_head list_pipetimeline; struct hw_timeline *hw_timeline; uint64_t seq_num; work_t timedout_work; void *csdata; };
So with set of helper function call by each of the driver command execution ioctl you have the implicit timeline that is properly populated and each dirver command execution get the dependency from the implicit timeline.
Of course to take full advantages of all flexibilities this could offer we would need to allow userspace to create pipeline timeline and to schedule against the pipeline timeline of there choice. We could create file for each of the pipeline timeline and have file operation to wait/query progress.
Note that the gpu lockup are considered exceptional event, the implicit timeline will probably want to continue on other job on other hardware block but the explicit one probably will want to decide wether to continue or abort or retry without the fault hw block.
I realize i am late to the party and that i should have taken a serious look at all this long time ago. I apologize for that and if you consider this is to late then just ignore me modulo the big warning the crazyness that callback will introduce an how bad things bound to happen. I am not saying that bad things can not happen with what i propose just that because everything happen inside the process context that is the one asking/requiring synchronization there will be not interprocess kernel callback (a callback that was registered by one process and that is call inside another process time slice because fence signaling is happening inside this other process time slice).
So I read through it all and presuming I understand it correctly your proposal and what we currently have is about the same. The big difference is that you make a timeline a first-class object and move the callback queue from the fence to the timeline, which requires callers to check the fence/seqno/whatever themselves instead of pushing that responsibility to callers.
No, big difference is that there is no callback thus when waiting for a fence you are either inside the process context that need to wait for it or your inside a kernel thread process context. Which means in both case you can do whatever you want. What i hate about the fence code as it is, is the callback stuff, because you never know into which context fence are signaled then you never know into which context callback are executed.
Look at waitqueues a bit closer. They're implemented with callbacks ;-) The only difference is that you're allowed to have spurious wakeups and need to handle that somehow, so need a separate check function.
No this not how wait queue are implemented, ie wait queue do not callback a random function from a random driver, it callback a limited set of function from core linux kernel scheduler so that the process thread that was waiting and out of the scheduler list is added back and marked as something that should be schedule. Unless this part of the kernel drasticly changed for the worse recently.
So this is fundamentaly different, fence as they are now allow random driver callback and this is bound to get ugly this is bound to lead to one driver doing something that seems innocuous but turn out to break heavoc when call from some other driver function.
No, really, look closer.
fence_default_wait adds a callback to fence_default_wait_cb, which wakes up the waiting thread if the fence gets signaled. The callback calls wake_up_state, which calls try_to_wake up.
default_wake_function, which is used by wait queues does something similar, it calls try_to_wake_up.
Fence now has some additional checks, but originally it was implemented as a wait queue.
But because of driver differences I can't implement it as a straight wait queue. Some drivers may not have a reliable interrupt, so they need a custom wait function. (qxl) Some may need to do extra flushing to get fences signaled (vmwgfx), others need some locking to protect against gpu lockup races (radeon, i915??). And nouveau doesn't use wait queues, but rolls its own (nouveau).
Fences also don't imply implicit sync, you can use explicit sync if you want.
I posted a patch for this, but if you want to create an android userspace fence, call
struct sync_fence *sync_fence_create_dma(const char *name, struct fence *pt)
I'll try to get the patch for this in 3.18 through the dma-buf tree, i915 wants to use it.
~Maarten
But because of driver differences I can't implement it as a straight wait queue. Some drivers may not have a reliable interrupt, so they need a custom wait function. (qxl) Some may need to do extra flushing to get fences signaled (vmwgfx), others need some locking to protect against gpu lockup races (radeon, i915??). And nouveau doesn't use wait queues, but rolls its own (nouveau).
But when all those drivers need a special wait function how can you still justify the common callback when a fence is signaled?
If I understood it right the use case for this was waiting for any fence of a list of fences from multiple drivers, but if each driver needs special handling how for it's wait how can that work reliable?
Christian.
Am 14.08.2014 um 11:15 schrieb Maarten Lankhorst:
Op 13-08-14 om 19:07 schreef Jerome Glisse:
On Wed, Aug 13, 2014 at 05:54:20PM +0200, Daniel Vetter wrote:
On Wed, Aug 13, 2014 at 09:36:04AM -0400, Jerome Glisse wrote:
On Wed, Aug 13, 2014 at 10:28:22AM +0200, Daniel Vetter wrote:
On Tue, Aug 12, 2014 at 06:13:41PM -0400, Jerome Glisse wrote:
Hi,
So i want over the whole fence and sync point stuff as it's becoming a pressing issue. I think we first need to agree on what is the problem we want to solve and what would be the requirements to solve it.
Problem : Explicit synchronization btw different hardware block over a buffer object.
Requirements : Share common infrastructure. Allow optimal hardware command stream scheduling accross hardware block. Allow android sync point to be implemented on top of it. Handle/acknowledge exception (like good old gpu lockup). Minimize driver changes.
Glossary : hardware timeline: timeline bound to a specific hardware block. pipeline timeline: timeline bound to a userspace rendering pipeline, each point on that timeline can be a composite of several different hardware pipeline point. pipeline: abstract object representing userspace application graphic pipeline of each of the application graphic operations. fence: specific point in a timeline where synchronization needs to happen.
So now, current include/linux/fence.h implementation is i believe missing the objective by confusing hardware and pipeline timeline and by bolting fence to buffer object while what is really needed is true and proper timeline for both hardware and pipeline. But before going further down that road let me look at things and explain how i see them.
fences can be used free-standing and no one forces you to integrate them with buffers. We actually plan to go this way with the intel svm stuff. Ofc for dma-buf the plan is to synchronize using such fences, but that's somewhat orthogonal I think. At least you only talk about fences and timeline and not dma-buf here.
Current ttm fence have one and a sole purpose, allow synchronization for buffer object move even thought some driver like radeon slightly abuse it and use them for things like lockup detection.
The new fence want to expose an api that would allow some implementation of a timeline. For that it introduces callback and some hard requirement on what the driver have to expose : enable_signaling [signaled] wait
Each of those have to do work inside the driver to which the fence belongs and each of those can be call more or less from unexpected (with restriction like outside irq) context. So we end up with thing like :
Process 1 Process 2 Process 3 I_A_schedule(fence0) CI_A_F_B_signaled(fence0) I_A_signal(fence0) CI_B_F_A_callback(fence0) CI_A_F_B_wait(fence0) Lexique: I_x in driver x (I_A == in driver A) CI_x_F_y call in driver X from driver Y (CI_A_F_B call in driver A from driver B)
So this is an happy mess everyone call everyone and this bound to get messy. Yes i know there is all kind of requirement on what happen once a fence is signaled. But those requirement only looks like they are trying to atone any mess that can happen from the whole callback dance.
While i was too seduced by the whole callback idea long time ago, i think it is a highly dangerous path to take where the combinatorial of what could happen are bound to explode with the increase in the number of players.
So now back to how to solve the problem we are trying to address. First i want to make an observation, almost all GPU that exist today have a command ring on to which userspace command buffer are executed and inside the command ring you can do something like :
if (condition) execute_command_buffer else skip_command_buffer
where condition is a simple expression (memory_address cop value)) with cop one of the generic comparison (==, <, >, <=, >=). I think it is a safe assumption that any gpu that slightly matter can do that. Those who can not should fix there command ring processor.
With that in mind, i think proper solution is implementing timeline and having fence be a timeline object with a way simpler api. For each hardware timeline driver provide a system memory address at which the lastest signaled fence sequence number can be read. Each fence object is uniquely associated with both a hardware and a pipeline timeline. Each pipeline timeline have a wait queue.
When scheduling something that require synchronization on a hardware timeline a fence is created and associated with the pipeline timeline and hardware timeline. Other hardware block that need to wait on a fence can use there command ring conditional execution to directly check the fence sequence from the other hw block so you do optimistic scheduling. If optimistic scheduling fails (which would be reported by hw block specific solution and hidden) then things can fallback to software cpu wait inside what could be considered the kernel thread of the pipeline timeline.
From api point of view there is no inter-driver call. All the driver needs to do is wakeup the pipeline timeline wait_queue when things are signaled or when things go sideway (gpu lockup).
So how to implement that with current driver ? Well easy. Currently we assume implicit synchronization so all we need is an implicit pipeline timeline per userspace process (note this do not prevent inter process synchronization). Everytime a command buffer is submitted it is added to the implicit timeline with the simple fence object :
struct fence { struct list_head list_hwtimeline; struct list_head list_pipetimeline; struct hw_timeline *hw_timeline; uint64_t seq_num; work_t timedout_work; void *csdata; };
So with set of helper function call by each of the driver command execution ioctl you have the implicit timeline that is properly populated and each dirver command execution get the dependency from the implicit timeline.
Of course to take full advantages of all flexibilities this could offer we would need to allow userspace to create pipeline timeline and to schedule against the pipeline timeline of there choice. We could create file for each of the pipeline timeline and have file operation to wait/query progress.
Note that the gpu lockup are considered exceptional event, the implicit timeline will probably want to continue on other job on other hardware block but the explicit one probably will want to decide wether to continue or abort or retry without the fault hw block.
I realize i am late to the party and that i should have taken a serious look at all this long time ago. I apologize for that and if you consider this is to late then just ignore me modulo the big warning the crazyness that callback will introduce an how bad things bound to happen. I am not saying that bad things can not happen with what i propose just that because everything happen inside the process context that is the one asking/requiring synchronization there will be not interprocess kernel callback (a callback that was registered by one process and that is call inside another process time slice because fence signaling is happening inside this other process time slice).
So I read through it all and presuming I understand it correctly your proposal and what we currently have is about the same. The big difference is that you make a timeline a first-class object and move the callback queue from the fence to the timeline, which requires callers to check the fence/seqno/whatever themselves instead of pushing that responsibility to callers.
No, big difference is that there is no callback thus when waiting for a fence you are either inside the process context that need to wait for it or your inside a kernel thread process context. Which means in both case you can do whatever you want. What i hate about the fence code as it is, is the callback stuff, because you never know into which context fence are signaled then you never know into which context callback are executed.
Look at waitqueues a bit closer. They're implemented with callbacks ;-) The only difference is that you're allowed to have spurious wakeups and need to handle that somehow, so need a separate check function.
No this not how wait queue are implemented, ie wait queue do not callback a random function from a random driver, it callback a limited set of function from core linux kernel scheduler so that the process thread that was waiting and out of the scheduler list is added back and marked as something that should be schedule. Unless this part of the kernel drasticly changed for the worse recently.
So this is fundamentaly different, fence as they are now allow random driver callback and this is bound to get ugly this is bound to lead to one driver doing something that seems innocuous but turn out to break heavoc when call from some other driver function.
No, really, look closer.
fence_default_wait adds a callback to fence_default_wait_cb, which wakes up the waiting thread if the fence gets signaled. The callback calls wake_up_state, which calls try_to_wake up.
default_wake_function, which is used by wait queues does something similar, it calls try_to_wake_up.
Fence now has some additional checks, but originally it was implemented as a wait queue.
But because of driver differences I can't implement it as a straight wait queue. Some drivers may not have a reliable interrupt, so they need a custom wait function. (qxl) Some may need to do extra flushing to get fences signaled (vmwgfx), others need some locking to protect against gpu lockup races (radeon, i915??). And nouveau doesn't use wait queues, but rolls its own (nouveau).
Fences also don't imply implicit sync, you can use explicit sync if you want.
I posted a patch for this, but if you want to create an android userspace fence, call
struct sync_fence *sync_fence_create_dma(const char *name, struct fence *pt)
I'll try to get the patch for this in 3.18 through the dma-buf tree, i915 wants to use it.
~Maarten
Op 14-08-14 om 13:53 schreef Christian König:
But because of driver differences I can't implement it as a straight wait queue. Some drivers may not have a reliable interrupt, so they need a custom wait function. (qxl) Some may need to do extra flushing to get fences signaled (vmwgfx), others need some locking to protect against gpu lockup races (radeon, i915??). And nouveau doesn't use wait queues, but rolls its own (nouveau).
But when all those drivers need a special wait function how can you still justify the common callback when a fence is signaled?
If I understood it right the use case for this was waiting for any fence of a list of fences from multiple drivers, but if each driver needs special handling how for it's wait how can that work reliable?
TTM doesn't rely on the callbacks. It will call .enable_signaling when .is_signaled is NULL, to make sure that fence_is_signaled returns true sooner.
QXL is completely awful, I've seen some patches to add dma-buf support but I'll make sure it never supports importing from/exporting to other devices. This should reduce insanity factor there. If I understand QXL correctly, sometimes fences may never signal at all due to virt-hw bugs.
nouveau (pre nv84) has no interrupt for completed work, but it has a reliable is_signaled. So .enable_signaling only checks if fence is signaled here. A custom waiting function makes sure things work correctly, and also signals all unsignaled fences for that context. I preserved the original wait from before the fence conversion. Nouveau keeps a global list of unsignaled fences, so they will all signal eventually. I may have to limit importing/exporting dma-buf's to other devices, or add delayed work that periodically checks all contexts for completed fences for this to work cross-device.
nv84+ has a sane interrupt, so I use it. :-)
Radeon with fence conversion has the delayed work for handling lockups that also checks.
~Maarten
Am 14.08.2014 um 14:37 schrieb Maarten Lankhorst:
Op 14-08-14 om 13:53 schreef Christian König:
But because of driver differences I can't implement it as a straight wait queue. Some drivers may not have a reliable interrupt, so they need a custom wait function. (qxl) Some may need to do extra flushing to get fences signaled (vmwgfx), others need some locking to protect against gpu lockup races (radeon, i915??). And nouveau doesn't use wait queues, but rolls its own (nouveau).
But when all those drivers need a special wait function how can you still justify the common callback when a fence is signaled?
If I understood it right the use case for this was waiting for any fence of a list of fences from multiple drivers, but if each driver needs special handling how for it's wait how can that work reliable?
TTM doesn't rely on the callbacks. It will call .enable_signaling when .is_signaled is NULL, to make sure that fence_is_signaled returns true sooner.
QXL is completely awful, I've seen some patches to add dma-buf support but I'll make sure it never supports importing from/exporting to other devices. This should reduce insanity factor there.
So if I understand you right QXL doesn't really implement the whole fence interface? It just implements enough to make TTM happy and you forbid DMA-buf support because the rest isn't really working?
Sorry, but that just sounds like your fence design just isn't doing the right thing here.
If I understand QXL correctly, sometimes fences may never signal at all due to virt-hw bugs.
Radeon has the same problem, with the hardware scheduler each client essentially has it's own fence sequence number range. If you kill a client the remaining fences not necessarily gets signaled by the hardware.
nouveau (pre nv84) has no interrupt for completed work, but it has a reliable is_signaled. So .enable_signaling only checks if fence is signaled here. A custom waiting function makes sure things work correctly, and also signals all unsignaled fences for that context. I preserved the original wait from before the fence conversion. Nouveau keeps a global list of unsignaled fences, so they will all signal eventually. I may have to limit importing/exporting dma-buf's to other devices, or add delayed work that periodically checks all contexts for completed fences for this to work cross-device.
nv84+ has a sane interrupt, so I use it. :-)
Radeon with fence conversion has the delayed work for handling lockups that also checks.
Which I still don't had time to check completely, but it sounds more and more like those fallbacks for not fired interrupts should be in the common fence code and not in each individual driver.
Christian.
~Maarten
On Thu, Aug 14, 2014 at 11:15:11AM +0200, Maarten Lankhorst wrote:
Op 13-08-14 om 19:07 schreef Jerome Glisse:
On Wed, Aug 13, 2014 at 05:54:20PM +0200, Daniel Vetter wrote:
On Wed, Aug 13, 2014 at 09:36:04AM -0400, Jerome Glisse wrote:
On Wed, Aug 13, 2014 at 10:28:22AM +0200, Daniel Vetter wrote:
On Tue, Aug 12, 2014 at 06:13:41PM -0400, Jerome Glisse wrote:
Hi,
So i want over the whole fence and sync point stuff as it's becoming a pressing issue. I think we first need to agree on what is the problem we want to solve and what would be the requirements to solve it.
Problem : Explicit synchronization btw different hardware block over a buffer object.
Requirements : Share common infrastructure. Allow optimal hardware command stream scheduling accross hardware block. Allow android sync point to be implemented on top of it. Handle/acknowledge exception (like good old gpu lockup). Minimize driver changes.
Glossary : hardware timeline: timeline bound to a specific hardware block. pipeline timeline: timeline bound to a userspace rendering pipeline, each point on that timeline can be a composite of several different hardware pipeline point. pipeline: abstract object representing userspace application graphic pipeline of each of the application graphic operations. fence: specific point in a timeline where synchronization needs to happen.
So now, current include/linux/fence.h implementation is i believe missing the objective by confusing hardware and pipeline timeline and by bolting fence to buffer object while what is really needed is true and proper timeline for both hardware and pipeline. But before going further down that road let me look at things and explain how i see them.
fences can be used free-standing and no one forces you to integrate them with buffers. We actually plan to go this way with the intel svm stuff. Ofc for dma-buf the plan is to synchronize using such fences, but that's somewhat orthogonal I think. At least you only talk about fences and timeline and not dma-buf here.
Current ttm fence have one and a sole purpose, allow synchronization for buffer object move even thought some driver like radeon slightly abuse it and use them for things like lockup detection.
The new fence want to expose an api that would allow some implementation of a timeline. For that it introduces callback and some hard requirement on what the driver have to expose : enable_signaling [signaled] wait
Each of those have to do work inside the driver to which the fence belongs and each of those can be call more or less from unexpected (with restriction like outside irq) context. So we end up with thing like :
Process 1 Process 2 Process 3 I_A_schedule(fence0) CI_A_F_B_signaled(fence0) I_A_signal(fence0) CI_B_F_A_callback(fence0) CI_A_F_B_wait(fence0) Lexique: I_x in driver x (I_A == in driver A) CI_x_F_y call in driver X from driver Y (CI_A_F_B call in driver A from driver B)
So this is an happy mess everyone call everyone and this bound to get messy. Yes i know there is all kind of requirement on what happen once a fence is signaled. But those requirement only looks like they are trying to atone any mess that can happen from the whole callback dance.
While i was too seduced by the whole callback idea long time ago, i think it is a highly dangerous path to take where the combinatorial of what could happen are bound to explode with the increase in the number of players.
So now back to how to solve the problem we are trying to address. First i want to make an observation, almost all GPU that exist today have a command ring on to which userspace command buffer are executed and inside the command ring you can do something like :
if (condition) execute_command_buffer else skip_command_buffer
where condition is a simple expression (memory_address cop value)) with cop one of the generic comparison (==, <, >, <=, >=). I think it is a safe assumption that any gpu that slightly matter can do that. Those who can not should fix there command ring processor.
With that in mind, i think proper solution is implementing timeline and having fence be a timeline object with a way simpler api. For each hardware timeline driver provide a system memory address at which the lastest signaled fence sequence number can be read. Each fence object is uniquely associated with both a hardware and a pipeline timeline. Each pipeline timeline have a wait queue.
When scheduling something that require synchronization on a hardware timeline a fence is created and associated with the pipeline timeline and hardware timeline. Other hardware block that need to wait on a fence can use there command ring conditional execution to directly check the fence sequence from the other hw block so you do optimistic scheduling. If optimistic scheduling fails (which would be reported by hw block specific solution and hidden) then things can fallback to software cpu wait inside what could be considered the kernel thread of the pipeline timeline.
From api point of view there is no inter-driver call. All the driver needs to do is wakeup the pipeline timeline wait_queue when things are signaled or when things go sideway (gpu lockup).
So how to implement that with current driver ? Well easy. Currently we assume implicit synchronization so all we need is an implicit pipeline timeline per userspace process (note this do not prevent inter process synchronization). Everytime a command buffer is submitted it is added to the implicit timeline with the simple fence object :
struct fence { struct list_head list_hwtimeline; struct list_head list_pipetimeline; struct hw_timeline *hw_timeline; uint64_t seq_num; work_t timedout_work; void *csdata; };
So with set of helper function call by each of the driver command execution ioctl you have the implicit timeline that is properly populated and each dirver command execution get the dependency from the implicit timeline.
Of course to take full advantages of all flexibilities this could offer we would need to allow userspace to create pipeline timeline and to schedule against the pipeline timeline of there choice. We could create file for each of the pipeline timeline and have file operation to wait/query progress.
Note that the gpu lockup are considered exceptional event, the implicit timeline will probably want to continue on other job on other hardware block but the explicit one probably will want to decide wether to continue or abort or retry without the fault hw block.
I realize i am late to the party and that i should have taken a serious look at all this long time ago. I apologize for that and if you consider this is to late then just ignore me modulo the big warning the crazyness that callback will introduce an how bad things bound to happen. I am not saying that bad things can not happen with what i propose just that because everything happen inside the process context that is the one asking/requiring synchronization there will be not interprocess kernel callback (a callback that was registered by one process and that is call inside another process time slice because fence signaling is happening inside this other process time slice).
So I read through it all and presuming I understand it correctly your proposal and what we currently have is about the same. The big difference is that you make a timeline a first-class object and move the callback queue from the fence to the timeline, which requires callers to check the fence/seqno/whatever themselves instead of pushing that responsibility to callers.
No, big difference is that there is no callback thus when waiting for a fence you are either inside the process context that need to wait for it or your inside a kernel thread process context. Which means in both case you can do whatever you want. What i hate about the fence code as it is, is the callback stuff, because you never know into which context fence are signaled then you never know into which context callback are executed.
Look at waitqueues a bit closer. They're implemented with callbacks ;-) The only difference is that you're allowed to have spurious wakeups and need to handle that somehow, so need a separate check function.
No this not how wait queue are implemented, ie wait queue do not callback a random function from a random driver, it callback a limited set of function from core linux kernel scheduler so that the process thread that was waiting and out of the scheduler list is added back and marked as something that should be schedule. Unless this part of the kernel drasticly changed for the worse recently.
So this is fundamentaly different, fence as they are now allow random driver callback and this is bound to get ugly this is bound to lead to one driver doing something that seems innocuous but turn out to break heavoc when call from some other driver function.
No, really, look closer.
fence_default_wait adds a callback to fence_default_wait_cb, which wakes up the waiting thread if the fence gets signaled. The callback calls wake_up_state, which calls try_to_wake up.
default_wake_function, which is used by wait queues does something similar, it calls try_to_wake_up.
Fence now has some additional checks, but originally it was implemented as a wait queue.
But because of driver differences I can't implement it as a straight wait queue. Some drivers may not have a reliable interrupt, so they need a custom wait function. (qxl) Some may need to do extra flushing to get fences signaled (vmwgfx), others need some locking to protect against gpu lockup races (radeon, i915??). And nouveau doesn't use wait queues, but rolls its own (nouveau).
This is not how i read the code. Yes the default_wait helper use a wait_queue like implementation but this is not what you encourage to implement. If i understand correctly you want driver to call fence_signal which goes over the list of callback and call each of them. That is the issue. The way you encourage people to do thing is with callback. Yes it can be work around by a driver but then why having a common api if it's for having each driver work around it the way it likes. Make no sense.
Fences also don't imply implicit sync, you can use explicit sync if you want.
I do understand that, now understand this, associating fence with buffer object is wrong for explicit sync, it is broken for explicit sync, it is problematic for explicit sync. It really is, think about it. You do not care about what is the last fence on a given buffer, you care about some operation you sent to some hw block in respect to some other operation you want to execute on some other hw block.
That's the whole probleme android sync-point is trying to solve at least this is how i understand the problem.
So associating fence with buffer object is wrong, it make complex code add over head of going over all buffer that needs to be fence, complex locking and fence lifetime management.
I posted a patch for this, but if you want to create an android userspace fence, call
struct sync_fence *sync_fence_create_dma(const char *name, struct fence *pt)
What i said is exposing fence to userspace is wrong, exposing timeline with sync point on timeline is i believe the right way to do it.
I'll try to get the patch for this in 3.18 through the dma-buf tree, i915 wants to use it.
Like i said feel free to ignore what i am saying here, i am just saying that this is ugly and this it will lead to issue and does not allow the whole flexibility you can get with explicit sync (i am not saying you can not do explicit i am saying you can only do a small number of explicit sync pattern).
~Maarten
On Wed, Aug 13, 2014 at 1:07 PM, Jerome Glisse j.glisse@gmail.com wrote:
So this is fundamentaly different, fence as they are now allow random driver callback and this is bound to get ugly this is bound to lead to one driver doing something that seems innocuous but turn out to break heavoc when call from some other driver function.
tbh, that seems solvable by some strict rules about what you can do in the callback.. ie. don't do anything you couldn't do in atomic, and don't signal another fence.. off the top of my head that seems sufficient.
If the driver getting the callback needs to do more, then it can always schedule a worker..
But I could certainly see the case where the driver waiting on fence sets everything up before installing the cb and then just needs to write one or a couple regs from the cb.
BR, -R
On Thu, Aug 14, 2014 at 09:16:02AM -0400, Rob Clark wrote:
On Wed, Aug 13, 2014 at 1:07 PM, Jerome Glisse j.glisse@gmail.com wrote:
So this is fundamentaly different, fence as they are now allow random driver callback and this is bound to get ugly this is bound to lead to one driver doing something that seems innocuous but turn out to break heavoc when call from some other driver function.
tbh, that seems solvable by some strict rules about what you can do in the callback.. ie. don't do anything you couldn't do in atomic, and don't signal another fence.. off the top of my head that seems sufficient.
If the driver getting the callback needs to do more, then it can always schedule a worker..
But I could certainly see the case where the driver waiting on fence sets everything up before installing the cb and then just needs to write one or a couple regs from the cb.
Yes sane code will do sane things, sadly i fear we can not enforce sane code everywhere especialy with out of tree driver and i would rather force there hand to only allow sane implementation. Providing call back api obviously allows them to do crazy stuff.
BR, -R
On Thu, Aug 14, 2014 at 10:12:06AM -0400, Jerome Glisse wrote:
On Thu, Aug 14, 2014 at 09:16:02AM -0400, Rob Clark wrote:
On Wed, Aug 13, 2014 at 1:07 PM, Jerome Glisse j.glisse@gmail.com wrote:
So this is fundamentaly different, fence as they are now allow random driver callback and this is bound to get ugly this is bound to lead to one driver doing something that seems innocuous but turn out to break heavoc when call from some other driver function.
tbh, that seems solvable by some strict rules about what you can do in the callback.. ie. don't do anything you couldn't do in atomic, and don't signal another fence.. off the top of my head that seems sufficient.
If the driver getting the callback needs to do more, then it can always schedule a worker..
But I could certainly see the case where the driver waiting on fence sets everything up before installing the cb and then just needs to write one or a couple regs from the cb.
Yes sane code will do sane things, sadly i fear we can not enforce sane code everywhere especialy with out of tree driver and i would rather force there hand to only allow sane implementation. Providing call back api obviously allows them to do crazy stuff.
Well then don't support out of tree drivers. Fairly easy problem really, and last time I checked "out of tree drivers suck" isn't a valid objections for upstream code ... It's kinda assumed that they all do, it's why we have staging after all. -Daniel
On Thu, Aug 14, 2014 at 05:58:48PM +0200, Daniel Vetter wrote:
On Thu, Aug 14, 2014 at 10:12:06AM -0400, Jerome Glisse wrote:
On Thu, Aug 14, 2014 at 09:16:02AM -0400, Rob Clark wrote:
On Wed, Aug 13, 2014 at 1:07 PM, Jerome Glisse j.glisse@gmail.com wrote:
So this is fundamentaly different, fence as they are now allow random driver callback and this is bound to get ugly this is bound to lead to one driver doing something that seems innocuous but turn out to break heavoc when call from some other driver function.
tbh, that seems solvable by some strict rules about what you can do in the callback.. ie. don't do anything you couldn't do in atomic, and don't signal another fence.. off the top of my head that seems sufficient.
If the driver getting the callback needs to do more, then it can always schedule a worker..
But I could certainly see the case where the driver waiting on fence sets everything up before installing the cb and then just needs to write one or a couple regs from the cb.
Yes sane code will do sane things, sadly i fear we can not enforce sane code everywhere especialy with out of tree driver and i would rather force there hand to only allow sane implementation. Providing call back api obviously allows them to do crazy stuff.
Well then don't support out of tree drivers. Fairly easy problem really, and last time I checked "out of tree drivers suck" isn't a valid objections for upstream code ... It's kinda assumed that they all do, it's why we have staging after all.
As usual i fail at expressing my point. I am not saying do not merge this because of out of tree drivers, i am saying while doing an api let make it sane and try to make it so that it enforce sanity to anything that lives outside our realm.
And not even thinking outside kernel tree, but someone might come along and start using fence, see the callback stuff and start doing crazy stuff with that all this inside a different obscur kernel subsystem. Then someone in graphic sees that and use that as justification to do crazy thing too, because hey, if he is doing why can't i ?
-Daniel
Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch
On 14-08-14 20:26, Jerome Glisse wrote:
On Thu, Aug 14, 2014 at 05:58:48PM +0200, Daniel Vetter wrote:
On Thu, Aug 14, 2014 at 10:12:06AM -0400, Jerome Glisse wrote:
On Thu, Aug 14, 2014 at 09:16:02AM -0400, Rob Clark wrote:
On Wed, Aug 13, 2014 at 1:07 PM, Jerome Glisse j.glisse@gmail.com wrote:
So this is fundamentaly different, fence as they are now allow random driver callback and this is bound to get ugly this is bound to lead to one driver doing something that seems innocuous but turn out to break heavoc when call from some other driver function.
tbh, that seems solvable by some strict rules about what you can do in the callback.. ie. don't do anything you couldn't do in atomic, and don't signal another fence.. off the top of my head that seems sufficient.
If the driver getting the callback needs to do more, then it can always schedule a worker..
But I could certainly see the case where the driver waiting on fence sets everything up before installing the cb and then just needs to write one or a couple regs from the cb.
Yes sane code will do sane things, sadly i fear we can not enforce sane code everywhere especialy with out of tree driver and i would rather force there hand to only allow sane implementation. Providing call back api obviously allows them to do crazy stuff.
Well then don't support out of tree drivers. Fairly easy problem really, and last time I checked "out of tree drivers suck" isn't a valid objections for upstream code ... It's kinda assumed that they all do, it's why we have staging after all.
As usual i fail at expressing my point. I am not saying do not merge this because of out of tree drivers, i am saying while doing an api let make it sane and try to make it so that it enforce sanity to anything that lives outside our realm.
And not even thinking outside kernel tree, but someone might come along and start using fence, see the callback stuff and start doing crazy stuff with that all this inside a different obscur kernel subsystem. Then someone in graphic sees that and use that as justification to do crazy thing too, because hey, if he is doing why can't i ?
Since when has crazy been contagious?
And here's what stops you: 1. LOCKDEP 2. PROVE_RCU 3. rcu sparse annotations (kbuild test bot) 4. DEBUG_ATOMIC_SLEEP
~Maarten
On Thu, Aug 14, 2014 at 10:12 AM, Jerome Glisse j.glisse@gmail.com wrote:
On Thu, Aug 14, 2014 at 09:16:02AM -0400, Rob Clark wrote:
On Wed, Aug 13, 2014 at 1:07 PM, Jerome Glisse j.glisse@gmail.com wrote:
So this is fundamentaly different, fence as they are now allow random driver callback and this is bound to get ugly this is bound to lead to one driver doing something that seems innocuous but turn out to break heavoc when call from some other driver function.
tbh, that seems solvable by some strict rules about what you can do in the callback.. ie. don't do anything you couldn't do in atomic, and don't signal another fence.. off the top of my head that seems sufficient.
If the driver getting the callback needs to do more, then it can always schedule a worker..
But I could certainly see the case where the driver waiting on fence sets everything up before installing the cb and then just needs to write one or a couple regs from the cb.
Yes sane code will do sane things, sadly i fear we can not enforce sane code everywhere especialy with out of tree driver and i would rather force there hand to only allow sane implementation. Providing call back api obviously allows them to do crazy stuff.
callback's hard, let's go shopping..
But seriously, we've solved problems like this w/ various kernel debug features before. I am pretty sure we could make lock debugging, and maybe some new fence debugging option, catch a lot of things. There is probably a better way, but a dummy spinlock around the callback, or maybe preempt_enable()/disable() around the callback? And we should be able to come up with a way to catch signalling a fence from cb..
A bit of extra rope with a warning sign not to hang yourself and debug features to tell people when they are hanging themselves seems like the better option to me.
BR, -R
BR, -R
dri-devel@lists.freedesktop.org