The busywait in __i915_spin_request() does not respect pending signals and so may consume the entire timeslice for the task instead of returning to userspace to handle the signal.
Fixes regression from commit 2def4ad99befa25775dd2f714fdd4d92faec6e34 [v4.2] Author: Chris Wilson chris@chris-wilson.co.uk Date: Tue Apr 7 16:20:41 2015 +0100
drm/i915: Optimistically spin for the request completion
Signed-off-by: Chris Wilson chris@chris-wilson.co.uk Cc: Jens Axboe axboe@kernel.dk Cc; "Rogozhkin, Dmitry V" dmitry.v.rogozhkin@intel.com Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: Tvrtko Ursulin tvrtko.ursulin@linux.intel.com Cc: Eero Tamminen eero.t.tamminen@intel.com Cc: "Rantala, Valtteri" valtteri.rantala@intel.com Cc: stable@kernel.vger.org --- drivers/gpu/drm/i915/i915_gem.c | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index 5cf4a1998273..740530c571d1 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -1146,7 +1146,7 @@ static bool missed_irq(struct drm_i915_private *dev_priv, return test_bit(ring->id, &dev_priv->gpu_error.missed_irq_rings); }
-static int __i915_spin_request(struct drm_i915_gem_request *req) +static int __i915_spin_request(struct drm_i915_gem_request *req, int state) { unsigned long timeout;
@@ -1158,6 +1158,9 @@ static int __i915_spin_request(struct drm_i915_gem_request *req) if (i915_gem_request_completed(req, true)) return 0;
+ if (signal_pending_state(state, current)) + break; + if (time_after_eq(jiffies, timeout)) break;
@@ -1197,6 +1200,7 @@ int __i915_wait_request(struct drm_i915_gem_request *req, struct drm_i915_private *dev_priv = dev->dev_private; const bool irq_test_in_progress = ACCESS_ONCE(dev_priv->gpu_error.test_irq_rings) & intel_ring_flag(ring); + int state = interruptible ? TASK_INTERRUPTIBLE : TASK_UNINTERRUPTIBLE; DEFINE_WAIT(wait); unsigned long timeout_expire; s64 before, now; @@ -1221,7 +1225,7 @@ int __i915_wait_request(struct drm_i915_gem_request *req, before = ktime_get_raw_ns();
/* Optimistic spin for the next jiffie before touching IRQs */ - ret = __i915_spin_request(req); + ret = __i915_spin_request(req, state); if (ret == 0) goto out;
@@ -1233,8 +1237,7 @@ int __i915_wait_request(struct drm_i915_gem_request *req, for (;;) { struct timer_list timer;
- prepare_to_wait(&ring->irq_queue, &wait, - interruptible ? TASK_INTERRUPTIBLE : TASK_UNINTERRUPTIBLE); + prepare_to_wait(&ring->irq_queue, &wait, state);
/* We need to check whether any gpu reset happened in between * the caller grabbing the seqno and now ... */ @@ -1252,7 +1255,7 @@ int __i915_wait_request(struct drm_i915_gem_request *req, break; }
- if (interruptible && signal_pending(current)) { + if (signal_pending_state(state, current)) { ret = -ERESTARTSYS; break; }
When waiting for high frequency requests, the finite amount of time required to set up the irq and wait upon it limits the response rate. By busywaiting on the request completion for a short while we can service the high frequency waits as quick as possible. However, if it is a slow request, we want to sleep as quickly as possible. The tradeoff between waiting and sleeping is roughly the time it takes to sleep on a request, on the order of a microsecond. Based on measurements from big core, I have set the limit for busywaiting as 2 microseconds.
The code currently uses the jiffie clock, but that is far too coarse (on the order of 10 milliseconds) and results in poor interactivity as the CPU ends up being hogged by slow requests. To get microsecond resolution we need to use a high resolution timer. The cheapest of which is polling local_clock(), but that is only valid on the same CPU. If we switch CPUs because the task was preempted, we can also use that as an indicator that the system is too busy to waste cycles on spinning and we should sleep instead.
__i915_spin_request was introduced in commit 2def4ad99befa25775dd2f714fdd4d92faec6e34 [v4.2] Author: Chris Wilson chris@chris-wilson.co.uk Date: Tue Apr 7 16:20:41 2015 +0100
drm/i915: Optimistically spin for the request completion
Reported-by: Jens Axboe axboe@kernel.dk Link: https://lkml.org/lkml/2015/11/12/621 Cc: Jens Axboe axboe@kernel.dk Cc; "Rogozhkin, Dmitry V" dmitry.v.rogozhkin@intel.com Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: Tvrtko Ursulin tvrtko.ursulin@linux.intel.com Cc: Eero Tamminen eero.t.tamminen@intel.com Cc: "Rantala, Valtteri" valtteri.rantala@intel.com Cc: stable@kernel.vger.org --- drivers/gpu/drm/i915/i915_gem.c | 28 +++++++++++++++++++++++++--- 1 file changed, 25 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index 740530c571d1..2a88158bd1f7 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -1146,14 +1146,36 @@ static bool missed_irq(struct drm_i915_private *dev_priv, return test_bit(ring->id, &dev_priv->gpu_error.missed_irq_rings); }
+static u64 local_clock_us(unsigned *cpu) +{ + u64 t; + + *cpu = get_cpu(); + t = local_clock() >> 10; + put_cpu(); + + return t; +} + +static bool busywait_stop(u64 timeout, unsigned cpu) +{ + unsigned this_cpu; + + if (time_after64(local_clock_us(&this_cpu), timeout)) + return true; + + return this_cpu != cpu; +} + static int __i915_spin_request(struct drm_i915_gem_request *req, int state) { - unsigned long timeout; + u64 timeout; + unsigned cpu;
if (i915_gem_request_get_ring(req)->irq_refcount) return -EBUSY;
- timeout = jiffies + 1; + timeout = local_clock_us(&cpu) + 2; while (!need_resched()) { if (i915_gem_request_completed(req, true)) return 0; @@ -1161,7 +1183,7 @@ static int __i915_spin_request(struct drm_i915_gem_request *req, int state) if (signal_pending_state(state, current)) break;
- if (time_after_eq(jiffies, timeout)) + if (busywait_stop(timeout, cpu)) break;
cpu_relax_lowlatency();
On Sun, Nov 15, 2015 at 01:32:44PM +0000, Chris Wilson wrote:
+static bool busywait_stop(u64 timeout, unsigned cpu) +{
- unsigned this_cpu;
- if (time_after64(local_clock_us(&this_cpu), timeout))
return true;
Just a note to say that we can use the unsigned long version rather than pass around u64 as this test will wraparound correctly (if we discard the high bits on x86-32). -Chris
Hi,
On 15/11/15 13:32, Chris Wilson wrote:
When waiting for high frequency requests, the finite amount of time required to set up the irq and wait upon it limits the response rate. By busywaiting on the request completion for a short while we can service the high frequency waits as quick as possible. However, if it is a slow request, we want to sleep as quickly as possible. The tradeoff between waiting and sleeping is roughly the time it takes to sleep on a request, on the order of a microsecond. Based on measurements from big core, I have set the limit for busywaiting as 2 microseconds.
Sounds like solid reasoning. Would it also be worth finding the trade off limit for small core?
The code currently uses the jiffie clock, but that is far too coarse (on the order of 10 milliseconds) and results in poor interactivity as the CPU ends up being hogged by slow requests. To get microsecond resolution we need to use a high resolution timer. The cheapest of which is polling local_clock(), but that is only valid on the same CPU. If we switch CPUs because the task was preempted, we can also use that as an indicator that the system is too busy to waste cycles on spinning and we should sleep instead.
Hm, need_resched would not cover the CPU switch anyway? Or maybe need_resched means something other than I thought which is "there are other runnable tasks"?
This would also have impact on the patch subject line.I thought we would burn a jiffie of CPU cycles only if there are no other runnable tasks - so how come an impact on interactivity?
Also again I think the commit message needs some data on how this was found and what is the impact.
Btw as it happens, just last week as I was playing with perf, I did notice busy spinning is the top cycle waster in some benchmarks. I was in the process of trying to quantize the difference with it on or off but did not complete it.
__i915_spin_request was introduced in commit 2def4ad99befa25775dd2f714fdd4d92faec6e34 [v4.2] Author: Chris Wilson chris@chris-wilson.co.uk Date: Tue Apr 7 16:20:41 2015 +0100
drm/i915: Optimistically spin for the request completion
Reported-by: Jens Axboe axboe@kernel.dk Link: https://lkml.org/lkml/2015/11/12/621 Cc: Jens Axboe axboe@kernel.dk Cc; "Rogozhkin, Dmitry V" dmitry.v.rogozhkin@intel.com Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: Tvrtko Ursulin tvrtko.ursulin@linux.intel.com Cc: Eero Tamminen eero.t.tamminen@intel.com Cc: "Rantala, Valtteri" valtteri.rantala@intel.com Cc: stable@kernel.vger.org
drivers/gpu/drm/i915/i915_gem.c | 28 +++++++++++++++++++++++++--- 1 file changed, 25 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index 740530c571d1..2a88158bd1f7 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -1146,14 +1146,36 @@ static bool missed_irq(struct drm_i915_private *dev_priv, return test_bit(ring->id, &dev_priv->gpu_error.missed_irq_rings); }
+static u64 local_clock_us(unsigned *cpu) +{
- u64 t;
- *cpu = get_cpu();
- t = local_clock() >> 10;
Needs comment I think to explicitly mention the approximation, or maybe drop the _us suffix?
- put_cpu();
- return t;
+}
+static bool busywait_stop(u64 timeout, unsigned cpu) +{
- unsigned this_cpu;
- if (time_after64(local_clock_us(&this_cpu), timeout))
return true;
- return this_cpu != cpu;
+}
- static int __i915_spin_request(struct drm_i915_gem_request *req, int state) {
- unsigned long timeout;
u64 timeout;
unsigned cpu;
if (i915_gem_request_get_ring(req)->irq_refcount) return -EBUSY;
- timeout = jiffies + 1;
- timeout = local_clock_us(&cpu) + 2; while (!need_resched()) { if (i915_gem_request_completed(req, true)) return 0;
@@ -1161,7 +1183,7 @@ static int __i915_spin_request(struct drm_i915_gem_request *req, int state) if (signal_pending_state(state, current)) break;
if (time_after_eq(jiffies, timeout))
if (busywait_stop(timeout, cpu)) break;
cpu_relax_lowlatency();
Otherwise looks good. Not sure what would you convert to 32-bit from your follow up reply since you need us resolution?
Regards,
Tvrtko
On Mon, Nov 16, 2015 at 10:24:45AM +0000, Tvrtko Ursulin wrote:
Hi,
On 15/11/15 13:32, Chris Wilson wrote:
When waiting for high frequency requests, the finite amount of time required to set up the irq and wait upon it limits the response rate. By busywaiting on the request completion for a short while we can service the high frequency waits as quick as possible. However, if it is a slow request, we want to sleep as quickly as possible. The tradeoff between waiting and sleeping is roughly the time it takes to sleep on a request, on the order of a microsecond. Based on measurements from big core, I have set the limit for busywaiting as 2 microseconds.
Sounds like solid reasoning. Would it also be worth finding the trade off limit for small core?
Takes a bit longer, but 2us seems "ok" on PineView (as in it doesn't lose the boost from spinning rather than sleeping). Have some more testing to do on vlv/byt.
The code currently uses the jiffie clock, but that is far too coarse (on the order of 10 milliseconds) and results in poor interactivity as the CPU ends up being hogged by slow requests. To get microsecond resolution we need to use a high resolution timer. The cheapest of which is polling local_clock(), but that is only valid on the same CPU. If we switch CPUs because the task was preempted, we can also use that as an indicator that the system is too busy to waste cycles on spinning and we should sleep instead.
Hm, need_resched would not cover the CPU switch anyway? Or maybe need_resched means something other than I thought which is "there are other runnable tasks"?
As I understand it, it means that the scheduler tick fired (or something else yielded). I haven't spotted if it gets set as the runqueue changes. As it pertains to us, it means that we need to get to schedule() as quick as possible which along this path means going to sleep.
I'm not sure if need_resched() would catch the cpu switch, if we were preempted as the flag would be set on the idle process not us.
This would also have impact on the patch subject line.I thought we would burn a jiffie of CPU cycles only if there are no other runnable tasks - so how come an impact on interactivity?
I have a couple of ideas for the effect on interactivty:
1. Burning through the time slice is acting as a penalty against running that process (typically the compositor) in the near future, perhaps enough to miss some deadlines.
2. Processor power balancing.
Also again I think the commit message needs some data on how this was found and what is the impact.
The system felt unresponsive. It would be interesting for me to know a few more details about the tick on that system (HZ, tickless?, preemption?) to see if changing the config on my xps13 also produces the lag/jitter/poor interactivty.
Btw as it happens, just last week as I was playing with perf, I did notice busy spinning is the top cycle waster in some benchmarks. I was in the process of trying to quantize the difference with it on or off but did not complete it.
I found that spin-request appearing in profiles makes tracking down the culprit higer in the stack much easier. Otherwise you have to remember to enable a pass with the tracepoint to find the stalls (or use intel-gpu-overlay which does it for you).
+static u64 local_clock_us(unsigned *cpu) +{
- u64 t;
- *cpu = get_cpu();
- t = local_clock() >> 10;
Needs comment I think to explicitly mention the approximation, or maybe drop the _us suffix?
I did consider _approx_us but thought that was overkill. A comment along the lines of /* Approximately convert ns to us - the error is less than the * truncation! */
@@ -1161,7 +1183,7 @@ static int __i915_spin_request(struct drm_i915_gem_request *req, int state) if (signal_pending_state(state, current)) break;
if (time_after_eq(jiffies, timeout))
if (busywait_stop(timeout, cpu)) break;
cpu_relax_lowlatency();
Otherwise looks good. Not sure what would you convert to 32-bit from your follow up reply since you need us resolution?
s/u64/unsigned long/ s/time_after64/time_after/
32bits of us resolution gives us 1000s before wraparound between the two samples. And I hope that a 1000s doesn't pass between loops. Or if it does, the GPU managed to complete its task. -Chris
On 16/11/15 11:12, Chris Wilson wrote:
On Mon, Nov 16, 2015 at 10:24:45AM +0000, Tvrtko Ursulin wrote:
Hi,
On 15/11/15 13:32, Chris Wilson wrote:
When waiting for high frequency requests, the finite amount of time required to set up the irq and wait upon it limits the response rate. By busywaiting on the request completion for a short while we can service the high frequency waits as quick as possible. However, if it is a slow request, we want to sleep as quickly as possible. The tradeoff between waiting and sleeping is roughly the time it takes to sleep on a request, on the order of a microsecond. Based on measurements from big core, I have set the limit for busywaiting as 2 microseconds.
Sounds like solid reasoning. Would it also be worth finding the trade off limit for small core?
Takes a bit longer, but 2us seems "ok" on PineView (as in it doesn't lose the boost from spinning rather than sleeping). Have some more testing to do on vlv/byt.
Cool.
The code currently uses the jiffie clock, but that is far too coarse (on the order of 10 milliseconds) and results in poor interactivity as the CPU ends up being hogged by slow requests. To get microsecond resolution we need to use a high resolution timer. The cheapest of which is polling local_clock(), but that is only valid on the same CPU. If we switch CPUs because the task was preempted, we can also use that as an indicator that the system is too busy to waste cycles on spinning and we should sleep instead.
Hm, need_resched would not cover the CPU switch anyway? Or maybe need_resched means something other than I thought which is "there are other runnable tasks"?
As I understand it, it means that the scheduler tick fired (or something else yielded). I haven't spotted if it gets set as the runqueue changes. As it pertains to us, it means that we need to get to schedule() as quick as possible which along this path means going to sleep.
I'm not sure if need_resched() would catch the cpu switch, if we were preempted as the flag would be set on the idle process not us.
Could be, I wasn't sure at all, just curious and trying to understand it fully. Cpu check is so cheap as implemented that it is not under any criticism.
This would also have impact on the patch subject line.I thought we would burn a jiffie of CPU cycles only if there are no other runnable tasks - so how come an impact on interactivity?
I have a couple of ideas for the effect on interactivty:
- Burning through the time slice is acting as a penalty against running
that process (typically the compositor) in the near future, perhaps enough to miss some deadlines.
- Processor power balancing.
Also again I think the commit message needs some data on how this was found and what is the impact.
The system felt unresponsive. It would be interesting for me to know a few more details about the tick on that system (HZ, tickless?, preemption?) to see if changing the config on my xps13 also produces the lag/jitter/poor interactivty.
Yes interesting but not critical I think. Since the new scheme looks as efficient as the old one so there should be no downside anyway.
Btw as it happens, just last week as I was playing with perf, I did notice busy spinning is the top cycle waster in some benchmarks. I was in the process of trying to quantize the difference with it on or off but did not complete it.
I found that spin-request appearing in profiles makes tracking down the culprit higer in the stack much easier. Otherwise you have to remember to enable a pass with the tracepoint to find the stalls (or use intel-gpu-overlay which does it for you).
I'll put it on my TODO list of things to play with.
+static u64 local_clock_us(unsigned *cpu) +{
- u64 t;
- *cpu = get_cpu();
- t = local_clock() >> 10;
Needs comment I think to explicitly mention the approximation, or maybe drop the _us suffix?
I did consider _approx_us but thought that was overkill. A comment along the lines of /* Approximately convert ns to us - the error is less than the
- truncation!
*/
And the result is not used in subsequent calculations apart from comparing against an approximate timeout?
@@ -1161,7 +1183,7 @@ static int __i915_spin_request(struct drm_i915_gem_request *req, int state) if (signal_pending_state(state, current)) break;
if (time_after_eq(jiffies, timeout))
if (busywait_stop(timeout, cpu)) break;
cpu_relax_lowlatency();
Otherwise looks good. Not sure what would you convert to 32-bit from your follow up reply since you need us resolution?
s/u64/unsigned long/ s/time_after64/time_after/
32bits of us resolution gives us 1000s before wraparound between the two samples. And I hope that a 1000s doesn't pass between loops. Or if it does, the GPU managed to complete its task.
Now I see that you did say low bits.. yes that sounds fine.
Btw while you are optimizing things maybe pick up this micro optimization: http://patchwork.freedesktop.org/patch/64339/
Not in scope of this thread but under the normal development patch flow.
Btw2, any benchmark result changes with this?
Regards,
Tvrtko
On Mon, Nov 16, 2015 at 12:08:08PM +0000, Tvrtko Ursulin wrote:
On 16/11/15 11:12, Chris Wilson wrote:
On Mon, Nov 16, 2015 at 10:24:45AM +0000, Tvrtko Ursulin wrote:
On 15/11/15 13:32, Chris Wilson wrote:
+static u64 local_clock_us(unsigned *cpu) +{
- u64 t;
- *cpu = get_cpu();
- t = local_clock() >> 10;
Needs comment I think to explicitly mention the approximation, or maybe drop the _us suffix?
I did consider _approx_us but thought that was overkill. A comment along the lines of /* Approximately convert ns to us - the error is less than the
- truncation!
*/
And the result is not used in subsequent calculations apart from comparing against an approximate timeout?
Exactly, the timeout is fairly arbitrary and defined in the same units. That we truncate is a much bigger cause for concern in terms of spinning accurately for a definite length of time.
@@ -1161,7 +1183,7 @@ static int __i915_spin_request(struct drm_i915_gem_request *req, int state) if (signal_pending_state(state, current)) break;
if (time_after_eq(jiffies, timeout))
if (busywait_stop(timeout, cpu)) break;
cpu_relax_lowlatency();
Otherwise looks good. Not sure what would you convert to 32-bit from your follow up reply since you need us resolution?
s/u64/unsigned long/ s/time_after64/time_after/
32bits of us resolution gives us 1000s before wraparound between the two samples. And I hope that a 1000s doesn't pass between loops. Or if it does, the GPU managed to complete its task.
Now I see that you did say low bits.. yes that sounds fine.
Btw while you are optimizing things maybe pick up this micro optimization: http://patchwork.freedesktop.org/patch/64339/
Not in scope of this thread but under the normal development patch flow.
There's a different series which looks at tackling the scalabiltiy issue with dozens of concurrent waiters. I have an equivalent patch there and one to tidy up the seqno query.
Btw2, any benchmark result changes with this?
Spinning still gives the dramatic (2x) improvement in the microbenchmarks (over pure interrupt driven waits), so that improvement is preserved. There are a couple of interesting swings in the macro tests (comparedt to the previous jiffie patch) just above the noise level which could well be a change in the throttling/scheduling. (And those tests are also the ones that correspond to the greatest gains (10-40%) using spinning.) -Chris
On 16/11/15 12:55, Chris Wilson wrote:
On Mon, Nov 16, 2015 at 12:08:08PM +0000, Tvrtko Ursulin wrote:
On 16/11/15 11:12, Chris Wilson wrote:
On Mon, Nov 16, 2015 at 10:24:45AM +0000, Tvrtko Ursulin wrote:
On 15/11/15 13:32, Chris Wilson wrote:
+static u64 local_clock_us(unsigned *cpu) +{
- u64 t;
- *cpu = get_cpu();
- t = local_clock() >> 10;
Needs comment I think to explicitly mention the approximation, or maybe drop the _us suffix?
I did consider _approx_us but thought that was overkill. A comment along the lines of /* Approximately convert ns to us - the error is less than the
- truncation!
*/
And the result is not used in subsequent calculations apart from comparing against an approximate timeout?
Exactly, the timeout is fairly arbitrary and defined in the same units. That we truncate is a much bigger cause for concern in terms of spinning accurately for a definite length of time.
Bah sorry that was not supposed to be a question but a suggestion to add to the comment. Must had mistyped the question mark. :)
Regards,
Tvrtko
On Mon, Nov 16, 2015 at 12:55:37PM +0000, Chris Wilson wrote:
On Mon, Nov 16, 2015 at 12:08:08PM +0000, Tvrtko Ursulin wrote:
On 16/11/15 11:12, Chris Wilson wrote:
On Mon, Nov 16, 2015 at 10:24:45AM +0000, Tvrtko Ursulin wrote:
On 15/11/15 13:32, Chris Wilson wrote:
+static u64 local_clock_us(unsigned *cpu) +{
- u64 t;
- *cpu = get_cpu();
- t = local_clock() >> 10;
Needs comment I think to explicitly mention the approximation, or maybe drop the _us suffix?
I did consider _approx_us but thought that was overkill. A comment along the lines of /* Approximately convert ns to us - the error is less than the
- truncation!
*/
And the result is not used in subsequent calculations apart from comparing against an approximate timeout?
Exactly, the timeout is fairly arbitrary and defined in the same units. That we truncate is a much bigger cause for concern in terms of spinning accurately for a definite length of time.
@@ -1161,7 +1183,7 @@ static int __i915_spin_request(struct drm_i915_gem_request *req, int state) if (signal_pending_state(state, current)) break;
if (time_after_eq(jiffies, timeout))
if (busywait_stop(timeout, cpu)) break;
cpu_relax_lowlatency();
Otherwise looks good. Not sure what would you convert to 32-bit from your follow up reply since you need us resolution?
s/u64/unsigned long/ s/time_after64/time_after/
32bits of us resolution gives us 1000s before wraparound between the two samples. And I hope that a 1000s doesn't pass between loops. Or if it does, the GPU managed to complete its task.
Now I see that you did say low bits.. yes that sounds fine.
Btw while you are optimizing things maybe pick up this micro optimization: http://patchwork.freedesktop.org/patch/64339/
Not in scope of this thread but under the normal development patch flow.
There's a different series which looks at tackling the scalabiltiy issue with dozens of concurrent waiters. I have an equivalent patch there and one to tidy up the seqno query.
Btw2, any benchmark result changes with this?
Spinning still gives the dramatic (2x) improvement in the microbenchmarks (over pure interrupt driven waits), so that improvement is preserved.
Previously the spinning also increased power consumption without offering any significant performance difference for some workloads. IIRC on my BYT the average CPU power consumption was ~100mW higher (as reported by RAPL) with xonotic the-big-keybench.dem (1920x1200 w/ "High" settings, IIRC) but average fps wasn't improved. Might be interesting to know how the improved spin code stacks up on that front.
There are a couple of interesting swings in the macro tests (comparedt to the previous jiffie patch) just above the noise level which could well be a change in the throttling/scheduling. (And those tests are also the ones that correspond to the greatest gains (10-40%) using spinning.) -Chris
-- Chris Wilson, Intel Open Source Technology Centre _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/intel-gfx
On 11/15/2015 06:32 AM, Chris Wilson wrote:
When waiting for high frequency requests, the finite amount of time required to set up the irq and wait upon it limits the response rate. By busywaiting on the request completion for a short while we can service the high frequency waits as quick as possible. However, if it is a slow request, we want to sleep as quickly as possible. The tradeoff between waiting and sleeping is roughly the time it takes to sleep on a request, on the order of a microsecond. Based on measurements from big core, I have set the limit for busywaiting as 2 microseconds.
The code currently uses the jiffie clock, but that is far too coarse (on the order of 10 milliseconds) and results in poor interactivity as the CPU ends up being hogged by slow requests. To get microsecond resolution we need to use a high resolution timer. The cheapest of which is polling local_clock(), but that is only valid on the same CPU. If we switch CPUs because the task was preempted, we can also use that as an indicator that the system is too busy to waste cycles on spinning and we should sleep instead.
I tried this (1+2), and it feels better. However, I added some counters just to track how well it's faring:
[ 491.077612] i915: invoked=7168, success=50
so out of 6144 invocations, we only avoided going to sleep 49 of those times. As a percentage, that's 99.3% of the time we spun 2usec for no good reason other than to burn up more of my battery. So the reason there's an improvement for me is that we're just not spinning the 10ms anymore, however we're still just wasting time for my use case.
I'd recommend putting this behind some option so that people can enable it and play with it if they want, but not making it default to on until some more clever tracking has been added to dynamically adapt to on when to poll and when not to. It should not be a default-on type of thing until it's closer to doing the right thing for a normal workload, not just some synthetic benchmark.
Hi!
Reported-by: Jens Axboe axboe@kernel.dk Link: https://lkml.org/lkml/2015/11/12/621 Cc: Jens Axboe axboe@kernel.dk Cc; "Rogozhkin, Dmitry V" dmitry.v.rogozhkin@intel.com Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: Tvrtko Ursulin tvrtko.ursulin@linux.intel.com Cc: Eero Tamminen eero.t.tamminen@intel.com Cc: "Rantala, Valtteri" valtteri.rantala@intel.com Cc: stable@kernel.vger.org
drivers/gpu/drm/i915/i915_gem.c | 28 +++++++++++++++++++++++++--- 1 file changed, 25 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index 740530c571d1..2a88158bd1f7 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -1146,14 +1146,36 @@ static bool missed_irq(struct drm_i915_private *dev_priv, return test_bit(ring->id, &dev_priv->gpu_error.missed_irq_rings); }
+static u64 local_clock_us(unsigned *cpu) +{
- u64 t;
- *cpu = get_cpu();
- t = local_clock() >> 10;
- put_cpu();
- return t;
+}
+static bool busywait_stop(u64 timeout, unsigned cpu) +{
- unsigned this_cpu;
- if (time_after64(local_clock_us(&this_cpu), timeout))
return true;
- return this_cpu != cpu;
+}
Perhaps you want to ask the timekeeping people for the right primitive? I guess you are not the only one needing this.. Pavel
Hi,
On 15/11/15 13:32, Chris Wilson wrote:
The busywait in __i915_spin_request() does not respect pending signals and so may consume the entire timeslice for the task instead of returning to userspace to handle the signal.
Obviously correct to break the spin, but if spending a jiffie to react to signals was the only problem then it is not too severe.
Add something to the commit message about how it was found/reported and about the severity of impact, etc?
Otherwise,
Reviewed-by: Tvrtko Ursulin tvrtko.ursulin@intel.com
Fixes regression from commit 2def4ad99befa25775dd2f714fdd4d92faec6e34 [v4.2] Author: Chris Wilson chris@chris-wilson.co.uk Date: Tue Apr 7 16:20:41 2015 +0100
drm/i915: Optimistically spin for the request completion
Signed-off-by: Chris Wilson chris@chris-wilson.co.uk Cc: Jens Axboe axboe@kernel.dk Cc; "Rogozhkin, Dmitry V" dmitry.v.rogozhkin@intel.com Cc: Daniel Vetter daniel.vetter@ffwll.ch Cc: Tvrtko Ursulin tvrtko.ursulin@linux.intel.com Cc: Eero Tamminen eero.t.tamminen@intel.com Cc: "Rantala, Valtteri" valtteri.rantala@intel.com Cc: stable@kernel.vger.org
drivers/gpu/drm/i915/i915_gem.c | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index 5cf4a1998273..740530c571d1 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -1146,7 +1146,7 @@ static bool missed_irq(struct drm_i915_private *dev_priv, return test_bit(ring->id, &dev_priv->gpu_error.missed_irq_rings); }
-static int __i915_spin_request(struct drm_i915_gem_request *req) +static int __i915_spin_request(struct drm_i915_gem_request *req, int state) { unsigned long timeout;
@@ -1158,6 +1158,9 @@ static int __i915_spin_request(struct drm_i915_gem_request *req) if (i915_gem_request_completed(req, true)) return 0;
if (signal_pending_state(state, current))
break;
- if (time_after_eq(jiffies, timeout)) break;
@@ -1197,6 +1200,7 @@ int __i915_wait_request(struct drm_i915_gem_request *req, struct drm_i915_private *dev_priv = dev->dev_private; const bool irq_test_in_progress = ACCESS_ONCE(dev_priv->gpu_error.test_irq_rings) & intel_ring_flag(ring);
- int state = interruptible ? TASK_INTERRUPTIBLE : TASK_UNINTERRUPTIBLE; DEFINE_WAIT(wait); unsigned long timeout_expire; s64 before, now;
@@ -1221,7 +1225,7 @@ int __i915_wait_request(struct drm_i915_gem_request *req, before = ktime_get_raw_ns();
/* Optimistic spin for the next jiffie before touching IRQs */
- ret = __i915_spin_request(req);
- ret = __i915_spin_request(req, state); if (ret == 0) goto out;
@@ -1233,8 +1237,7 @@ int __i915_wait_request(struct drm_i915_gem_request *req, for (;;) { struct timer_list timer;
prepare_to_wait(&ring->irq_queue, &wait,
interruptible ? TASK_INTERRUPTIBLE : TASK_UNINTERRUPTIBLE);
prepare_to_wait(&ring->irq_queue, &wait, state);
/* We need to check whether any gpu reset happened in between
- the caller grabbing the seqno and now ... */
@@ -1252,7 +1255,7 @@ int __i915_wait_request(struct drm_i915_gem_request *req, break; }
if (interruptible && signal_pending(current)) {
}if (signal_pending_state(state, current)) { ret = -ERESTARTSYS; break;
On Mon, Nov 16, 2015 at 09:54:10AM +0000, Tvrtko Ursulin wrote:
Hi,
On 15/11/15 13:32, Chris Wilson wrote:
The busywait in __i915_spin_request() does not respect pending signals and so may consume the entire timeslice for the task instead of returning to userspace to handle the signal.
Obviously correct to break the spin, but if spending a jiffie to react to signals was the only problem then it is not too severe.
Add something to the commit message about how it was found/reported and about the severity of impact, etc?
Perhaps:
At the worst case this could cause a delay in signal processing of 20ms, which would be a noticeable jitter in cursor tracking. If a higher resolution signal was being used, for example to provide fairness of a server timeslices between clients, we could expect to detect some unfairness between clients. This issue was noticed when inspecting a report of poor interactivity resulting from excessively high __i915_spin_request usage. -Chris
On 16/11/15 11:22, Chris Wilson wrote:
On Mon, Nov 16, 2015 at 09:54:10AM +0000, Tvrtko Ursulin wrote:
Hi,
On 15/11/15 13:32, Chris Wilson wrote:
The busywait in __i915_spin_request() does not respect pending signals and so may consume the entire timeslice for the task instead of returning to userspace to handle the signal.
Obviously correct to break the spin, but if spending a jiffie to react to signals was the only problem then it is not too severe.
Add something to the commit message about how it was found/reported and about the severity of impact, etc?
Perhaps:
At the worst case this could cause a delay in signal processing of 20ms, which would be a noticeable jitter in cursor tracking. If a higher resolution signal was being used, for example to provide fairness of a server timeslices between clients, we could expect to detect some unfairness between clients. This issue was noticed when inspecting a report of poor interactivity resulting from excessively high __i915_spin_request usage.
Oh its the Xorg scheduler tick... I always forget about that. Was thinking that it is only about fatal, or at least infrequent signals.
Regards,
Tvrtko
dri-devel@lists.freedesktop.org