Message: 1 Date: Mon, 20 Dec 2010 19:23:40 -0800 From: Keith Packard keithp@keithp.com Subject: Re: [Intel-gfx] [PATCH] drm: Aggressively disable vblanks To: Andy Lutomirski luto@MIT.EDU, Jesse Barnes jbarnes@virtuousgeek.org, Chris Wilson chris@chris-wilson.co.uk, David Airlie airlied@linux.ie Cc: intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org Message-ID: yunfwtrrepv.fsf@aiko.keithp.com Content-Type: text/plain; charset="us-ascii"
On Mon, 20 Dec 2010 14:00:54 -0500, Andy Lutomirski luto@MIT.EDU wrote:
Enabling and disabling the vblank interrupt (on devices that support it) is cheap. So disable it quickly after each interrupt.
So, the concern (and reason for the original design) was that you might lose count of vblank interrupts as vblanks are counted differently while off than while on. If vblank interrupts get enabled near the interrupt time, is it possible that you'll end up off by one somehow?
Leaving them enabled for 'a while' was supposed to reduce the impact of this potential error.
-- keith.packard@intel.com
On Wed, 22 Dec 2010 05:36:13 +0100 Mario Kleiner mario.kleiner@tuebingen.mpg.de wrote:
Message: 1 Date: Mon, 20 Dec 2010 19:23:40 -0800 From: Keith Packard keithp@keithp.com Subject: Re: [Intel-gfx] [PATCH] drm: Aggressively disable vblanks To: Andy Lutomirski luto@MIT.EDU, Jesse Barnes jbarnes@virtuousgeek.org, Chris Wilson chris@chris-wilson.co.uk, David Airlie airlied@linux.ie Cc: intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org Message-ID: yunfwtrrepv.fsf@aiko.keithp.com Content-Type: text/plain; charset="us-ascii"
On Mon, 20 Dec 2010 14:00:54 -0500, Andy Lutomirski luto@MIT.EDU wrote:
Enabling and disabling the vblank interrupt (on devices that support it) is cheap. So disable it quickly after each interrupt.
So, the concern (and reason for the original design) was that you might lose count of vblank interrupts as vblanks are counted differently while off than while on. If vblank interrupts get enabled near the interrupt time, is it possible that you'll end up off by one somehow?
Leaving them enabled for 'a while' was supposed to reduce the impact of this potential error.
-- keith.packard@intel.com
On Dec 22, 2010, at 6:23 PM, Jesse Barnes wrote:
On Wed, 22 Dec 2010 05:36:13 +0100 Mario Kleiner mario.kleiner@tuebingen.mpg.de wrote:
--
Message: 1 Date: Mon, 20 Dec 2010 19:23:40 -0800 From: Keith Packard keithp@keithp.com Subject: Re: [Intel-gfx] [PATCH] drm: Aggressively disable vblanks To: Andy Lutomirski luto@MIT.EDU, Jesse Barnes jbarnes@virtuousgeek.org, Chris Wilson <chris@chris- wilson.co.uk>, David Airlie airlied@linux.ie Cc: intel-gfx@lists.freedesktop.org, dri-devel@lists.freedesktop.org Message-ID: yunfwtrrepv.fsf@aiko.keithp.com Content-Type: text/plain; charset="us-ascii"
On Mon, 20 Dec 2010 14:00:54 -0500, Andy Lutomirski luto@MIT.EDU wrote:
Enabling and disabling the vblank interrupt (on devices that support it) is cheap. So disable it quickly after each interrupt.
So, the concern (and reason for the original design) was that you might lose count of vblank interrupts as vblanks are counted differently while off than while on. If vblank interrupts get enabled near the interrupt time, is it possible that you'll end up off by one somehow?
Leaving them enabled for 'a while' was supposed to reduce the impact of this potential error.
-- keith.packard@intel.com
On Wed, Dec 22, 2010 at 4:06 PM, Mario Kleiner mario.kleiner@tuebingen.mpg.de wrote:
There's a new drm module parameter for selecting the timeout: echo 50 > /sys/module/drm/parameters/vblankoffdelay would set the timeout to 50 msecs. A setting of zero will disable the timer, so vblank irq's would stay on all the time.
The default setting is still 5000 msecs as before, but reducing this to 100 msecs wouldn't be a real problem imho. At least i didn't observe any miscounting during extensive testing with 100 msecs.
The patches in drm-next fix a couple of races that i observed on intel and radeon during testing and a few that i didn't see but that i could imagine happening. It tries to make sure that the saved final count at vblank irq disable of the software vblank_count and the gpu counter are consistent - no off by one errors. They also try to detect and filter out spurious vblank interrupts at vblank enable time, e.g., on the radeon.
There's still one possible race in the disable path which i will try to fix: We don't know when exactly the hardware counter increments wrt. the processing of the vblank interrupt - it could increment a few (dozen/hundred) microseconds before or after the irq handler runs, so if you happen to query the hardware counter while the gpu is inside the vblank you can't be sure if you picked up the old count or the new count for that vblank.
That's disgusting. Does this affect many GPUs? (I can't imagine why any sensible implementation wouldn't guarantee that the counter increments just before the IRQ.)
This only matters during vblank disable. For that reason it's not such a good idea to disable vblank irq's from within the vblank irq handler. I tried that and it didn't work well --> When doing it from within irq you basically maximize the chance of hitting the time window when the race can happen. Delaying within the irq handler by a millisecond would fix that, but that's not what we want.
Having the disable in a function triggered by a timer like now is the most simple solution i could come up with. There we can burn a few dozen microseconds if neccessary to remove this race.
Maybe I'm missing something, but other than the race above (which seems like it shouldn't exist on sane hardware), the new code seems more complicated than necessary.
Why not do nothing at all on vblank disable and, on enable, do this:
Call get_vblank_counter and declare the return value to be correct. On each vblank IRQ (or maybe just the first one after enable), read the counter again and if it matches the cached value, then assume we raced on enable (i.e. we enabled right at the beginning of the vblank interval, and this interrupt is redundant). In that case, do nothing. Otherwise increment the cached value by one.
On hardware with the silly race where get_vblank_counter in the IRQ can return a number one too low, use the scanout pos to fix it (i.e. if the scanout position is at the end of the frame, assume we hit that race and increment by 1).
This means that it would be safe to disable vblanks from any context at all, because it has no effect other than turning off the interupt.
--Andy
There could be other races that i couldn't think of and that i didn't see during my testing with my 2 radeons and 1 intel gpu. Therefore i think we should keep vblank irq's enabled for at least 2 or maybe 3 refresh cycles if they aren't used by anyone. Apps that want to schedule swaps very precisely and the ddx/drm/kms-driver itself do multiple queries in quick succession for a typical swapbuffers call, i.e., drm_vblank_get() -> query -> drm_vblank_put(), so on an otherwise idle graphics system the refcount will toggle between zero and one multiple times during a swap, usually within a few milliseconds. If the timeout is big enough so that irq's don't get disabled within such a sequence of toggles, even if there's a bit of scheduling delay for the x-server or client, then a client will see at least consistent vblank count during a swap, even if there are still some races hiding somewhere that we don't handle properly. That should be good enough, and paranoid clients can always increase the timeout value or set it to infinite.
E.g., my toolkit schedules a swap for a specific system time like this:
- glXGetSyncValuesOML(... &base_msc, &base_ust);
- calculate target_msc based on user provided swap deadline t and
(base_msc, base_ust) as a baseline. 3. glXSwapBuffersMscOML(...., target_msc,...); 4. glXWaitForSbcOML() or use Intel_swap_events for retrieving the true msc and ust of swap completion.
=> Doesn't matter if there would be an off-by-one error in vblank counting due to an unknown race, as long as it doesn't happen between 1. and 4. As long as there aren't any client/x-server scheduling delays between step 1 and 3 of more than /sys/module/drm/parameters/vblankoffdelay msecs, nothing can go wrong even if there are race conditions left in that area.
=> 50-100 msecs as new default would be probably good enough and at the same time prevent the "blinking cursor keeps vblank irq's on all the time" problem.
I didn't reduce the timeout in the current patches because the filtering for race-conditions and other gpu funkyness relies on somewhat precise vblank timestamps and the timestamping hooks aren't yet implemented in the nouveau kms. Maybe i manage to get this working over christmas. Patches to nouveau would be simple, i just don't know the mmio register addresses for crtc scanout position on nvidia gpu's.
-mario
Mario Kleiner Max Planck Institute for Biological Cybernetics Spemannstr. 38 72076 Tuebingen Germany
e-mail: mario.kleiner@tuebingen.mpg.de office: +49 (0)7071/601-1623 fax: +49 (0)7071/601-616 www: http://www.kyb.tuebingen.mpg.de/~kleinerm
"For a successful technology, reality must take precedence over public relations, for Nature cannot be fooled." (Richard Feynman)
On Dec 26, 2010, at 3:53 PM, Andrew Lutomirski wrote:
On Wed, Dec 22, 2010 at 4:06 PM, Mario Kleiner mario.kleiner@tuebingen.mpg.de wrote:
There's a new drm module parameter for selecting the timeout: echo 50 > /sys/module/drm/parameters/vblankoffdelay would set the timeout to 50 msecs. A setting of zero will disable the timer, so vblank irq's would stay on all the time.
The default setting is still 5000 msecs as before, but reducing this to 100 msecs wouldn't be a real problem imho. At least i didn't observe any miscounting during extensive testing with 100 msecs.
The patches in drm-next fix a couple of races that i observed on intel and radeon during testing and a few that i didn't see but that i could imagine happening. It tries to make sure that the saved final count at vblank irq disable of the software vblank_count and the gpu counter are consistent - no off by one errors. They also try to detect and filter out spurious vblank interrupts at vblank enable time, e.g., on the radeon.
There's still one possible race in the disable path which i will try to fix: We don't know when exactly the hardware counter increments wrt. the processing of the vblank interrupt - it could increment a few (dozen/hundred) microseconds before or after the irq handler runs, so if you happen to query the hardware counter while the gpu is inside the vblank you can't be sure if you picked up the old count or the new count for that vblank.
That's disgusting. Does this affect many GPUs? (I can't imagine why any sensible implementation wouldn't guarantee that the counter increments just before the IRQ.)
;-). I don't know, but at least on the tested R500 and R600 class Radeon's, this was the case, so i assume it's at least this way on many radeon gpu's (probably all avivo parts?) out there. We don't have any evergreen gpu's yet in our lab so i don't know how the more recent parts behave. Also it doesn't matter
I guess it's also a matter of definition when a new video frame starts? Leading edge / trailing edge of vblank? Start of vsync? Something else?
This only matters during vblank disable. For that reason it's not such a good idea to disable vblank irq's from within the vblank irq handler. I tried that and it didn't work well --> When doing it from within irq you basically maximize the chance of hitting the time window when the race can happen. Delaying within the irq handler by a millisecond would fix that, but that's not what we want.
Having the disable in a function triggered by a timer like now is the most simple solution i could come up with. There we can burn a few dozen microseconds if neccessary to remove this race.
Maybe I'm missing something, but other than the race above (which seems like it shouldn't exist on sane hardware), the new code seems more complicated than necessary.
I don't think it's more complicated than necessary for what it tries to achieve, but of course i'm a bit biased. It also started off more simple and grew a bit when i found new issues with the tested gpu's.
The aim is to fix a couple of real races and to make vblank counts and timestamps as trustworthy and oml_sync_control spec-conformant (see http://www.opengl.org/registry/specs/OML/glx_sync_control.txt) as possible. It only consumes a few additional bytes of memory (approx. 40 bytes) per crtc, doesn't use excessive time inside the irq handler and tries to avoid taking locks that are shared between irq and non-irq context to avoid delays in irq execution, also if used with a kernel with the preempt_rt patch applied (important for my use case and other hard realtime apps). It's pretty self-contained and because most of it is driver-independent it can handle similar issues on different gpu's and kms drivers without the need for us to check each single gpu + driver combo if it has such issues or not.
1. There's the hardware vblank counter race, and it's there on lots of existing hardware, regardless if this is sane or not, so it needs to be handled.
2. There are gpu's firing spurious vblank irq's as soon as you enable irq's -- you need to filter those out, otherwise counts and timestamps will be wrong for at least one refresh cycle after vblank irq enable. You don't know when these redundant ones get delivered or if they get delivered at all because a real vblank irq enable gets triggered by some userspace calls, not locked to the video refresh cycle and because the enable code itself holds spin_lock_irqsave locks and may or may not (depending on number of cores and irq routing) prevent delivery of the vblank irq's concurrent to its own execution -> a possible race between the drm_handle_vblank() routine and the drm_update_vblank_count() routine each time you call drm_vblank_get().
3. There's gpu's sometimes, but not on other times, firing the irq too early (1-3 scanlines observed on radeon's) so you get your irq before the associated vblank interval and need to do at least all timestamping as if you are already in that vblank. This may be dependent on both video mode and on the dispatch delay of the vblank irq handler (e.g., due to some other unrelated code holding off irq's, e.g., by holding spin_lock_irqsave() locks).
4. In the old code it could happen that the vblank counter gets incremented after it was "disabled", e.g., because vblank irq's were turned off in the gpu, but there was still a vblank irq pending (e.g., due to some spin_lock_irqsave on the relevant core), incrementing the counter after it was supposedly frozen. With the new oml_sync_control timestamping in place, such off-by-ones would be worse as they would also corrupt timestamps.
There were some more issues which i can't remember from the top of my head which get handled by the current code (Note to myself: Take more notes!).
Why not do nothing at all on vblank disable and, on enable, do this:
Call get_vblank_counter and declare the return value to be correct.
Because declaring it to be correct isn't the same as it being correct. Also the code needs to handle wraparound of the hardware counter and for that it needs correct start values from the vblank disable routine, which is why the disable routine needs to work as race-free as possible. The vblank count for a frame also needs to be consistent with the vblank timestamp for that frame at all times, otherwise the oml_sync_control extension becomes too unreliable to be trustworthy and useful for serious applications.
On each vblank IRQ (or maybe just the first one after enable), read the counter again and if it matches the cached value, then assume we raced on enable (i.e. we enabled right at the beginning of the vblank interval, and this interrupt is redundant). In that case, do nothing. Otherwise increment the cached value by one.
On hardware with the silly race where get_vblank_counter in the IRQ can return a number one too low, use the scanout pos to fix it (i.e. if the scanout position is at the end of the frame, assume we hit that race and increment by 1).
See the other races above. Iirc i tried something similar already and they made it fail/unreliable. The current patch uses the vblank timestamp instead of the hardware vblank counter to detect and filter redundant irq's, because with the timestamping patches at least on intel and ati gpu's (and hopefully nouveau/nvidia and others asap) these are well defined and accurate (to define the end of a vblank interval) so they can serve as a reference point. The tbd fix for race condition #1 will also use scanout position as a fixed reference.
The sample client code (below) for scheduling accurately timed bufferswaps needs precise and trustworthy return values from glXGetSyncValuesOML() for it to work. If vblank's are disabled at time of invocation of glXGetSyncValuesOML() then that function will trigger the real vblank irq enable path and use its return values for swap scheduling - i.e. unless it is called within or close to a vblank, it uses the vblank count and timestamp computed in drm_update_vblank_count(), usually before the vblank irq had a chance to run. For that reason it is important for my kind of applications that it really delivers the right counts and timestamps especially in the enable path.
This is the context in case you're interested why i'm so protective of the current implementation: The toolkit i'm developing is probably one of the currently most demanding clients of the new dri2 swap & sync bits and it is used for neuroscience research. Many of the experiments there require very precise visual timing and often sub- millisecond accurate timestamping. Too many unhandled races in the wrong places could really spoil the work of the scientists that are my users. As long as we have a disable timeout of 5 seconds, vblank irq's probably won't disable at all during most experiment sessions and even if they do, the frequency of possible screwups is probably small enough to be annoying but manageable with statistics (detecting/ discarding outliers in experimental results etc.). At a disable timeout of around 50 msecs, the error rate would be unbearable. That's why i would like to make sure that the implementation can handle at least the already known quirks of a large number of the gpu's out there if we go down to 50 msecs. But i assume races in that code would affect the quality of "normal" media players as well, once we choose such low timeouts.
Thanks and belated happy x-mas, -mario
This means that it would be safe to disable vblanks from any context at all, because it has no effect other than turning off the interupt.
--Andy
There could be other races that i couldn't think of and that i didn't see during my testing with my 2 radeons and 1 intel gpu. Therefore i think we should keep vblank irq's enabled for at least 2 or maybe 3 refresh cycles if they aren't used by anyone. Apps that want to schedule swaps very precisely and the ddx/drm/kms-driver itself do multiple queries in quick succession for a typical swapbuffers call, i.e., drm_vblank_get() -> query -> drm_vblank_put(), so on an otherwise idle graphics system the refcount will toggle between zero and one multiple times during a swap, usually within a few milliseconds. If the timeout is big enough so that irq's don't get disabled within such a sequence of toggles, even if there's a bit of scheduling delay for the x-server or client, then a client will see at least consistent vblank count during a swap, even if there are still some races hiding somewhere that we don't handle properly. That should be good enough, and paranoid clients can always increase the timeout value or set it to infinite.
E.g., my toolkit schedules a swap for a specific system time like this:
- glXGetSyncValuesOML(... &base_msc, &base_ust);
- calculate target_msc based on user provided swap deadline t and
(base_msc, base_ust) as a baseline. 3. glXSwapBuffersMscOML(...., target_msc,...); 4. glXWaitForSbcOML() or use Intel_swap_events for retrieving the true msc and ust of swap completion.
=> Doesn't matter if there would be an off-by-one error in vblank counting due to an unknown race, as long as it doesn't happen between 1. and 4. As long as there aren't any client/x-server scheduling delays between step 1 and 3 of more than /sys/module/drm/parameters/vblankoffdelay msecs, nothing can go wrong even if there are race conditions left in that area.
=> 50-100 msecs as new default would be probably good enough and at the same time prevent the "blinking cursor keeps vblank irq's on all the time" problem.
I didn't reduce the timeout in the current patches because the filtering for race-conditions and other gpu funkyness relies on somewhat precise vblank timestamps and the timestamping hooks aren't yet implemented in the nouveau kms. Maybe i manage to get this working over christmas. Patches to nouveau would be simple, i just don't know the mmio register addresses for crtc scanout position on nvidia gpu's.
-mario
Mario Kleiner Max Planck Institute for Biological Cybernetics Spemannstr. 38 72076 Tuebingen Germany
e-mail: mario.kleiner@tuebingen.mpg.de office: +49 (0)7071/601-1623 fax: +49 (0)7071/601-616 www: http://www.kyb.tuebingen.mpg.de/~kleinerm
"For a successful technology, reality must take precedence over public relations, for Nature cannot be fooled." (Richard Feynman)
********************************************************************* Mario Kleiner Max Planck Institute for Biological Cybernetics Spemannstr. 38 72076 Tuebingen Germany
e-mail: mario.kleiner@tuebingen.mpg.de office: +49 (0)7071/601-1623 fax: +49 (0)7071/601-616 www: http://www.kyb.tuebingen.mpg.de/~kleinerm ********************************************************************* "For a successful technology, reality must take precedence over public relations, for Nature cannot be fooled." (Richard Feynman)
On Mon, Dec 27, 2010 at 12:58:10AM +0100, Mario Kleiner wrote:
- There are gpu's firing spurious vblank irq's as soon as you enable
irq's
You're sure this isn't simply a matter of the driver forgetting to ack the irq just before enabling it?
On Dec 27, 2010, at 12:16 PM, Ville Syrjälä wrote:
On Mon, Dec 27, 2010 at 12:58:10AM +0100, Mario Kleiner wrote:
- There are gpu's firing spurious vblank irq's as soon as you enable
irq's
You're sure this isn't simply a matter of the driver forgetting to ack the irq just before enabling it?
Good point. This was on radeon. I can't remember for certain if it happened always, or only frequently. I can check that later this week when i'm back at the test machine.
Anyway, it's good to be robust against such problems, regardless if it is gpu quirks or driver bugs. The current implementation would filter the redundant vblank irq and DRM_DEBUG a message if the drm.debug parameter is set.
thanks, -mario
-- Ville Syrjälä syrjala@sci.fi http://www.sci.fi/~syrjala/
********************************************************************* Mario Kleiner Max Planck Institute for Biological Cybernetics Spemannstr. 38 72076 Tuebingen Germany
e-mail: mario.kleiner@tuebingen.mpg.de office: +49 (0)7071/601-1623 fax: +49 (0)7071/601-616 www: http://www.kyb.tuebingen.mpg.de/~kleinerm ********************************************************************* "For a successful technology, reality must take precedence over public relations, for Nature cannot be fooled." (Richard Feynman)
On Mon, Dec 27, 2010 at 7:58 AM, Mario Kleiner mario.kleiner@tuebingen.mpg.de wrote:
On Dec 26, 2010, at 3:53 PM, Andrew Lutomirski wrote:
On Wed, Dec 22, 2010 at 4:06 PM, Mario Kleiner mario.kleiner@tuebingen.mpg.de wrote:
There's a new drm module parameter for selecting the timeout: echo 50 > /sys/module/drm/parameters/vblankoffdelay would set the timeout to 50 msecs. A setting of zero will disable the timer, so vblank irq's would stay on all the time.
The default setting is still 5000 msecs as before, but reducing this to 100 msecs wouldn't be a real problem imho. At least i didn't observe any miscounting during extensive testing with 100 msecs.
The patches in drm-next fix a couple of races that i observed on intel and radeon during testing and a few that i didn't see but that i could imagine happening. It tries to make sure that the saved final count at vblank irq disable of the software vblank_count and the gpu counter are consistent - no off by one errors. They also try to detect and filter out spurious vblank interrupts at vblank enable time, e.g., on the radeon.
There's still one possible race in the disable path which i will try to fix: We don't know when exactly the hardware counter increments wrt. the processing of the vblank interrupt - it could increment a few (dozen/hundred) microseconds before or after the irq handler runs, so if you happen to query the hardware counter while the gpu is inside the vblank you can't be sure if you picked up the old count or the new count for that vblank.
That's disgusting. Does this affect many GPUs? (I can't imagine why any sensible implementation wouldn't guarantee that the counter increments just before the IRQ.)
;-). I don't know, but at least on the tested R500 and R600 class Radeon's, this was the case, so i assume it's at least this way on many radeon gpu's (probably all avivo parts?) out there. We don't have any evergreen gpu's yet in our lab so i don't know how the more recent parts behave. Also it doesn't matter
I guess it's also a matter of definition when a new video frame starts? Leading edge / trailing edge of vblank? Start of vsync? Something else?
This only matters during vblank disable. For that reason it's not such a good idea to disable vblank irq's from within the vblank irq handler. I tried that and it didn't work well --> When doing it from within irq you basically maximize the chance of hitting the time window when the race can happen. Delaying within the irq handler by a millisecond would fix that, but that's not what we want.
Having the disable in a function triggered by a timer like now is the most simple solution i could come up with. There we can burn a few dozen microseconds if neccessary to remove this race.
Maybe I'm missing something, but other than the race above (which seems like it shouldn't exist on sane hardware), the new code seems more complicated than necessary.
I don't think it's more complicated than necessary for what it tries to achieve, but of course i'm a bit biased. It also started off more simple and grew a bit when i found new issues with the tested gpu's.
The aim is to fix a couple of real races and to make vblank counts and timestamps as trustworthy and oml_sync_control spec-conformant (see http://www.opengl.org/registry/specs/OML/glx_sync_control.txt) as possible. It only consumes a few additional bytes of memory (approx. 40 bytes) per crtc, doesn't use excessive time inside the irq handler and tries to avoid taking locks that are shared between irq and non-irq context to avoid delays in irq execution, also if used with a kernel with the preempt_rt patch applied (important for my use case and other hard realtime apps). It's pretty self-contained and because most of it is driver-independent it can handle similar issues on different gpu's and kms drivers without the need for us to check each single gpu + driver combo if it has such issues or not.
OK, having read the glx_sync_control spec this makes a lot more sense. I reread the code and I have two questions right off the bat:
1. What's the point of vblank_disable_and_save? AFAICT all that it does (other than disabling vblanks) is to possibly increment _vblank_count, ensuring that (barring races) it's correct as soon as the call returns. But there are only three ways that vblank_disable_and_save gets called: the disable timer, and two paths in i915 that turn off vblanks when disabling crtcs. The latter two shouldn't really care because the crtc is about to turn off, and the former, being a timer, doesn't return into anything that cares about the counter.
2. Why are the timestamps coming from do_gettimeofday instead of ktime_get? Currently, if someone changes the system time, then weird things could happen, I think.
--Andy
On Son, 2010-12-26 at 09:53 -0500, Andrew Lutomirski wrote:
On Wed, Dec 22, 2010 at 4:06 PM, Mario Kleiner mario.kleiner@tuebingen.mpg.de wrote:
There's a new drm module parameter for selecting the timeout: echo 50 > /sys/module/drm/parameters/vblankoffdelay would set the timeout to 50 msecs. A setting of zero will disable the timer, so vblank irq's would stay on all the time.
The default setting is still 5000 msecs as before, but reducing this to 100 msecs wouldn't be a real problem imho. At least i didn't observe any miscounting during extensive testing with 100 msecs.
The patches in drm-next fix a couple of races that i observed on intel and radeon during testing and a few that i didn't see but that i could imagine happening. It tries to make sure that the saved final count at vblank irq disable of the software vblank_count and the gpu counter are consistent - no off by one errors. They also try to detect and filter out spurious vblank interrupts at vblank enable time, e.g., on the radeon.
There's still one possible race in the disable path which i will try to fix: We don't know when exactly the hardware counter increments wrt. the processing of the vblank interrupt - it could increment a few (dozen/hundred) microseconds before or after the irq handler runs, so if you happen to query the hardware counter while the gpu is inside the vblank you can't be sure if you picked up the old count or the new count for that vblank.
That's disgusting. Does this affect many GPUs? (I can't imagine why any sensible implementation wouldn't guarantee that the counter increments just before the IRQ.)
Actually, while we want the interrupt to trigger at the beginning of vblank (so we get a chance to do work within the vblank period), at least on Radeons, the hardware frame counter increments at the beginning of scanout, i.e. at the end of vblank.
At some point we tried to compensate for that by adding 1 to the hardware frame counter while inside vblank, but IIRC that wasn't 100% reliable either but would sometimes result in the counter being considered too high by 1. So instead we've tried to keep the infrastructure robust against the interrupt and hardware frame counter increment not necessarily occurring at the same time.
(Even if both events always occurred at the same time on the GPU, we might not be able to take advantage of that due to interrupt latencies)
dri-devel@lists.freedesktop.org