[PATCH] xf86drm.c: add counter for ioctl restarting

List overview All Threads
Download

newer

older

[Bug 17902] [830M] need support...

[PATCH resend 1/2] drm: Fix EDID...

Anton V. Boyarshinov

13 Apr 2012 13 Apr '12

1:26 p.m.

In some cases ioclt->alarm->ioctl loop can be infinite: ioctl(7, 0x40086482, 0xbfb62738) = ? ERESTARTSYS (To be restarted) --- SIGALRM (Alarm clock) @ 0 (0) --- sigreturn() = ? (mask now []) ioctl(7, 0x40086482, 0xbfb62738) = ? ERESTARTSYS (To be restarted) and forever.

It seems, that limiting ioctl restarting by some resonable number of trys is a dirty but working way to prevent Xorg lockups.

Signed-off-by: Anton V. Boyarshinov boyarsh@altlinux.org --- xf86drm.c | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/xf86drm.c b/xf86drm.c index 6ea068f..9663f21 100644 --- a/xf86drm.c +++ b/xf86drm.c @@ -162,10 +162,11 @@ int drmIoctl(int fd, unsigned long request, void *arg) { int ret; + int count=0;

do { ret = ioctl(fd, request, arg); - } while (ret == -1 && (errno == EINTR || errno == EAGAIN)); + } while (ret == -1 && (errno == EINTR || errno == EAGAIN) && ++count < 100 ); return ret; }

-- 1.7.5.4

Show replies by date

Chris Wilson

13 Apr 13 Apr

1:40 p.m.

On Fri, 13 Apr 2012 17:26:42 +0400, "Anton V. Boyarshinov" boyarsh@altlinux.org wrote:

...

In some cases ioclt->alarm->ioctl loop can be infinite: ioctl(7, 0x40086482, 0xbfb62738) = ? ERESTARTSYS (To be restarted) --- SIGALRM (Alarm clock) @ 0 (0) --- sigreturn() = ? (mask now []) ioctl(7, 0x40086482, 0xbfb62738) = ? ERESTARTSYS (To be restarted) and forever.

It seems, that limiting ioctl restarting by some resonable number of trys is a dirty but working way to prevent Xorg lockups.

And you have audited all callpaths to make sure that they can handle EINTR? Up until now it was part of the libdrm api that it did not return EINTR...

...

From my naive pov, we could just fix the root cause of the bug rather

than escalating the bug into a random failure. -Chris

-- Chris Wilson, Intel Open Source Technology Centre

Anton V. Boyarshinov

16 Apr 16 Apr

8:45 a.m.

Hello

...

...
It seems, that limiting ioctl restarting by some resonable number of trys is a dirty but working way to prevent Xorg lockups.

And you have audited all callpaths to make sure that they can handle EINTR? Up until now it was part of the libdrm api that it did not return EINTR...

Speaking strictly no but this code works fine on my Nvidia ION box with nouveau about a 4 monthes without user-visible problems. It also have some logging code that shows that all ioctls which have been restarted more then 5 times are restarting forewer (actually 100 times), but this logging code really unpstreamable.

...

From my naive pov, we could just fix the root cause of the bug rather than escalating the bug into a random failure.

Yes. This is a dirty trick. But from the user scope, Xorg lockup is a nasty thing. I have some lockups in a day a year ago, then some bugs was fixed and a have some lockups in a week. Better, but not so good. With this patch i have no lockups.

And, from the other side, this patch can be a tool for finding real bugs if someone add proper logging to if (i don't know the right way to log problems from libdrm) with ioctl, parameters and restart count. Users will be able to add this informaition to the bugreports without accessing by network to locked box.

Regards Anton

Daniel Vetter

13 Apr 13 Apr

1:42 p.m.

On Fri, Apr 13, 2012 at 05:26:42PM +0400, Anton V. Boyarshinov wrote:

...

In some cases ioclt->alarm->ioctl loop can be infinite: ioctl(7, 0x40086482, 0xbfb62738) = ? ERESTARTSYS (To be restarted) --- SIGALRM (Alarm clock) @ 0 (0) --- sigreturn() = ? (mask now []) ioctl(7, 0x40086482, 0xbfb62738) = ? ERESTARTSYS (To be restarted) and forever.

It seems, that limiting ioctl restarting by some resonable number of trys is a dirty but working way to prevent Xorg lockups.

Signed-off-by: Anton V. Boyarshinov boyarsh@altlinux.org

xf86drm.c | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/xf86drm.c b/xf86drm.c index 6ea068f..9663f21 100644 --- a/xf86drm.c +++ b/xf86drm.c @@ -162,10 +162,11 @@ int drmIoctl(int fd, unsigned long request, void *arg) { int ret;

int count=0;

do { ret = ioctl(fd, request, arg);

} while (ret == -1 && (errno == EINTR || errno == EAGAIN));

} while (ret == -1 && (errno == EINTR || errno == EAGAIN) && ++count < 100 );

We rely on restarting after signals when blocking for the gpu, busy gpu plus mouse wiggling can easily reach that.

NACKed-by: Daniel Vetter daniel.vetter@ffwll.ch

Obviously if we have a dead gpu, we need to break out of this loop. But detecting a dead gpu (and returning an appropriate error like EIO) is the kernel's job.

Cheers, Daniel

...

 return ret;
}

-- 1.7.5.4

dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel

-- Daniel Vetter Mail: daniel@ffwll.ch Mobile: +41 (0)79 365 57 48

Ville Syrjälä

2:45 p.m.

On Fri, Apr 13, 2012 at 03:42:16PM +0200, Daniel Vetter wrote:

...

On Fri, Apr 13, 2012 at 05:26:42PM +0400, Anton V. Boyarshinov wrote:

...
In some cases ioclt->alarm->ioctl loop can be infinite: ioctl(7, 0x40086482, 0xbfb62738) = ? ERESTARTSYS (To be restarted) --- SIGALRM (Alarm clock) @ 0 (0) --- sigreturn() = ? (mask now []) ioctl(7, 0x40086482, 0xbfb62738) = ? ERESTARTSYS (To be restarted) and forever.

It seems, that limiting ioctl restarting by some resonable number of trys is a dirty but working way to prevent Xorg lockups.

Signed-off-by: Anton V. Boyarshinov boyarsh@altlinux.org

xf86drm.c | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/xf86drm.c b/xf86drm.c index 6ea068f..9663f21 100644 --- a/xf86drm.c +++ b/xf86drm.c @@ -162,10 +162,11 @@ int drmIoctl(int fd, unsigned long request, void *arg) { int ret;

int count=0;

do { ret = ioctl(fd, request, arg);

} while (ret == -1 && (errno == EINTR || errno == EAGAIN));

} while (ret == -1 && (errno == EINTR || errno == EAGAIN) && ++count < 100 );

We rely on restarting after signals when blocking for the gpu, busy gpu plus mouse wiggling can easily reach that.

NACKed-by: Daniel Vetter daniel.vetter@ffwll.ch

Obviously if we have a dead gpu, we need to break out of this loop. But detecting a dead gpu (and returning an appropriate error like EIO) is the kernel's job.

The problem with EINTR is that someone else could be poking at the device at the same time, causing the restarted ioctls to wait some more, and again be interrupted by a signal. Or the hardware could be itself responsible for this problem.

I ran into this issue once with a "wait for vblank" ioctl, which had a fixed relative timeout value. So every time I would start to wait, a signal would arrive causing the vblank to be missed. The restarted ioctl would then start waiting using the original timeout, allowing the signal to interrupt it again.

Any syscall which has a relative timeout should be desgigned so that the timeout gets updated by the kernel when interrupted, so that the originally specified timeout would really be the total timeout.

-- Ville Syrjälä Intel OTC

Anton V. Boyarshinov

16 Apr 16 Apr

8:54 a.m.

...

Obviously if we have a dead gpu, we need to break out of this loop. But detecting a dead gpu (and returning an appropriate error like EIO) is the kernel's job.

In my case gpu isn't really dead. It works after some ioctl skip. I understend that it is a driver bug in any case, but i thing that working around this bugs with proper logging (absent in this patch) much better than lockup.

Regards, Anton

Adam Jackson

2:12 p.m.

On Mon, 2012-04-16 at 12:54 +0400, Anton V. Boyarshinov wrote:

...

Hi

...
Obviously if we have a dead gpu, we need to break out of this loop. But detecting a dead gpu (and returning an appropriate error like EIO) is the kernel's job.

In my case gpu isn't really dead. It works after some ioctl skip. I understend that it is a driver bug in any case, but i thing that working around this bugs with proper logging (absent in this patch) much better than lockup.

But your workaround isn't harmless. It adds new failure modes for other cases that currently work. That's worse, not better.

- ajax

Anton V. Boyarshinov

3:05 p.m.

...

...
around this bugs with proper logging (absent in this patch) much better than lockup.

But your workaround isn't harmless. It adds new failure modes for other cases that currently work. That's worse, not better.

I think that from user scope nothing (including Xorg crash) is worth then lock with 2 abilities: reset or network login and kill -9 Xorg.

I think that even if ioctl's should be restarted forever as present, some error detecting and logging code should be in this place.

Regards, Anton

4763

Age (days ago)

4766

Last active (days ago)

dri-devel@lists.freedesktop.org

7 comments

5 participants

tags (0)

participants (5)

Adam Jackson
Anton V. Boyarshinov
Chris Wilson
Daniel Vetter
Ville Syrjälä