https://bugzilla.kernel.org/show_bug.cgi?id=196777
Bug ID: 196777 Summary: Virtual guest using video device QXL does not reach GDM Product: Drivers Version: 2.5 Kernel Version: 4.12.5 Hardware: x86-64 OS: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: Video(DRI - non Intel) Assignee: drivers_video-dri@kernel-bugs.osdl.org Reporter: jfrieben@hotmail.com Regression: No
Description of problem: When booting current Fedora 26 as a virtual guest in gnome-boxes, the system does not reach GDM. It gets stuck during the graphical boot procedure. After removing boot option "rhgb", the system gets frozen after attempting to launch the graphical login manager GDM.
Version-Release number of selected component (if applicable): kernel-4.12.5-300.fc26
How reproducible: Always
Steps to Reproduce: 1. Boot current Fedora 26 virtual guest in gnome-boxes.
Actual results: System gets stuck during the graphical boot procedure.
Expected results: System launches GDM successfully.
Additional info: - After adding boot option "nomodeset", the system launches GDM on Xorg successfully. - After removing "rhgb" from the kernel options and running 'startx' at run level 3, the GNOME on Xorg session starts up as expected. - After changing the video device from "qxl" to "virtio", the system launches GDM on Wayland successfully. - This issue was introduced in the 4.12.x kernel series and now also affects 4.13.x kernel series.
https://bugzilla.kernel.org/show_bug.cgi?id=196777
Joachim Frieben (jfrieben@hotmail.com) changed:
What |Removed |Added ---------------------------------------------------------------------------- Tree|Mainline |Fedora Regression|No |Yes
https://bugzilla.kernel.org/show_bug.cgi?id=196777
Krzysztof Nowicki (krissn@op.pl) changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |krissn@op.pl
--- Comment #1 from Krzysztof Nowicki (krissn@op.pl) --- I have bisected this to a series of commits introducing atomic modesetting to the QXL driver (more specifically commit 3538e80a869be74764ae7db484b371894f04d0f8).
https://bugzilla.kernel.org/show_bug.cgi?id=196777
Gerd Hoffmann (kraxel@redhat.com) changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |kraxel@redhat.com
--- Comment #2 from Gerd Hoffmann (kraxel@redhat.com) --- can you check whenever this patch fixes it?
https://www.kraxel.org/cgit/linux/commit/?h=drm-qxl-atomic&id=b16a0bb7a9...
https://bugzilla.kernel.org/show_bug.cgi?id=196777
--- Comment #3 from Krzysztof Nowicki (krissn@op.pl) --- (In reply to Gerd Hoffmann from comment #2)
can you check whenever this patch fixes it?
https://www.kraxel.org/cgit/linux/commit/?h=drm-qxl- atomic&id=b16a0bb7a9d54d9dd256059b35adf6f96fddc22e
I have applied this patch against a clean 4.12.0 and unfortunately the problem is still easily reproducible.
https://bugzilla.kernel.org/show_bug.cgi?id=196777
--- Comment #4 from Gerd Hoffmann (kraxel@redhat.com) --- Retested 4.13 + comment #2 patch.
plymouth (aka graphical boot) hangs the machine indeed.
when disabling rhgb gdm comes up just fine though, in both wayland and xorg mode. so apparently we have two issues here, and the patch fixes only one of them.
The plymouth hang appears to be pretty serious, the whole machine appears to be toast. I can't login over network to see what is going on, so it's not only the display which is f*cked up. Nothing written to the logs either. When enabling the serial console to see the logs plymouth skips the splash screen though, so the issue doesn't trigger any more. Hmm, I'm running out of ideas ...
https://bugzilla.kernel.org/show_bug.cgi?id=196777
Takashi Iwai (tiwai@suse.de) changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |tiwai@suse.de
--- Comment #5 from Takashi Iwai (tiwai@suse.de) --- Isn't the plymouth hang the dup of bug 102338? If so, it's BUG_ON() in ttm_bo_kmap() for non-empty bo->swap list.
BTW, with the patch in comment 2 applied, qemu itself crashes on my machine, not the VM :-<
id 0, group 0, virt start 0, virt end ffffffffffffffff, generation 0, delta 0 id 1, group 1, virt start 7fec96800000, virt end 7fec9a7fe000, generation 0, delta 7fec96800000 id 2, group 1, virt start 7fec92400000, virt end 7fec96400000, generation 0, delta 7fec92400000 ((null):6072): Spice-Warning **: red_memslots.c:69:validate_virt: virtual address out of range virt=0x7fec96b07018+0xff000000 slot_id=1 group_id=1 slot=0x7fec96800000-0x7fec9a7fe000 delta=0x7fec96800000 ((null):6072): Spice-ERROR **: red_parse_qxl.c:334:red_get_clip_rects: assertion `num_rects * sizeof(QXLRect) == size' failed Thread 24 (Thread 0x7fec8189c700 (LWP 6097)): ....
https://bugzilla.kernel.org/show_bug.cgi?id=196777
--- Comment #6 from Takashi Iwai (tiwai@suse.de) --- (In reply to Takashi Iwai from comment #5)
Isn't the plymouth hang the dup of bug 102338?
Erm, I meant the fdo bugzilla 102338, https://bugs.freedesktop.org/show_bug.cgi?id=102338
https://bugzilla.kernel.org/show_bug.cgi?id=196777
Justin M. Forbes (jmforbes@linuxtx.org) changed:
What |Removed |Added ---------------------------------------------------------------------------- CC| |jmforbes@linuxtx.org
--- Comment #7 from Justin M. Forbes (jmforbes@linuxtx.org) --- (In reply to Gerd Hoffmann from comment #4)
Retested 4.13 + comment #2 patch.
plymouth (aka graphical boot) hangs the machine indeed.
when disabling rhgb gdm comes up just fine though, in both wayland and xorg mode. so apparently we have two issues here, and the patch fixes only one of them.
The plymouth hang appears to be pretty serious, the whole machine appears to be toast. I can't login over network to see what is going on, so it's not only the display which is f*cked up. Nothing written to the logs either. When enabling the serial console to see the logs plymouth skips the splash screen though, so the issue doesn't trigger any more. Hmm, I'm running out of ideas ...
I dropped the patch into the 4.12.12 update for Fedora kernels. The complaints I am seeing are screen strobing with this patch, it has been backed out for now. The plymouth issue is also a big one.
https://bugzilla.kernel.org/show_bug.cgi?id=196777
--- Comment #8 from Gerd Hoffmann (kraxel@redhat.com) --- https://www.kraxel.org/cgit/linux/log/?h=qxl-4.13 please test
https://bugzilla.kernel.org/show_bug.cgi?id=196777
--- Comment #9 from Krzysztof Nowicki (krissn@op.pl) --- (In reply to Gerd Hoffmann from comment #8)
https://www.kraxel.org/cgit/linux/log/?h=qxl-4.13 please test
Applied over vanilla 4.12 on top of the patch from comment #2.
SDDM started fine, I was able to login to the Plasma session and use it for some time. Test repeated twice with the same results. No errors found in dmesg and system is stable.
As for me - TEST PASSED
Thanks :)
https://bugzilla.kernel.org/show_bug.cgi?id=196777
--- Comment #10 from Justin M. Forbes (jmforbes@linuxtx.org) --- After going through these with a number of users:
qxl: fix primary surface handling - This patch is widely reported to cause serious screen flickering that is not there without it, making the system unusable.
qxl: fix pinning: This patch resolves the GDM login issues with plymouth.
https://bugzilla.kernel.org/show_bug.cgi?id=196777
--- Comment #11 from Gerd Hoffmann (kraxel@redhat.com) --- (In reply to Justin M. Forbes from comment #10)
After going through these with a number of users:
qxl: fix primary surface handling - This patch is widely reported to cause serious screen flickering that is not there without it, making the system unusable.
Workaround #1: turn off wayland. Workaround #2: use virtio-vga instead. wayland doesn't use qxl 2d accel anyway.
Fundamental problem here is that the qxl virtual hardware simply doesn't support pageflip, we have to destroy + re-create the primary surface instead. This is where the flicker comes from.
Commit "058e9f5c82 drm/qxl: simple crtc page flipping emulated using buffer copy" handles the issue with a pretty gross hack, blitting one framebuffer over the other instead of a proper primary surface update. With atomic modesetting that doesn't work any more.
We could possibly decouple the primary surface from the drm framebuffers, so the drm framebuffers effectively become shadow framebuffers, and every display update becomes a drm framebuffer -> primary surface blit. Not sure whenever that scheme can work properly with xorg though. Also has a high chance to cause xorg performance regressions.
qxl: fix pinning: This patch resolves the GDM login issues with plymouth.
Good.
https://bugzilla.kernel.org/show_bug.cgi?id=196777
--- Comment #12 from Justin M. Forbes (jmforbes@linuxtx.org) --- (In reply to Gerd Hoffmann from comment #11)
(In reply to Justin M. Forbes from comment #10)
After going through these with a number of users:
qxl: fix primary surface handling - This patch is widely reported to cause serious screen flickering that is not there without it, making the system unusable.
Workaround #1: turn off wayland.
Possible as a short term fix, but with wayland being pretty much "the way forward" it doesn't seem to be a workable long term solution.
Workaround #2: use virtio-vga instead. wayland doesn't use qxl 2d accel anyway.
Fundamental problem here is that the qxl virtual hardware simply doesn't support pageflip, we have to destroy + re-create the primary surface instead. This is where the flicker comes from.
Commit "058e9f5c82 drm/qxl: simple crtc page flipping emulated using buffer copy" handles the issue with a pretty gross hack, blitting one framebuffer over the other instead of a proper primary surface update. With atomic modesetting that doesn't work any more.
We could possibly decouple the primary surface from the drm framebuffers, so the drm framebuffers effectively become shadow framebuffers, and every display update becomes a drm framebuffer -> primary surface blit. Not sure whenever that scheme can work properly with xorg though. Also has a high chance to cause xorg performance regressions.
So this brings up an interesting problem in how things are to move forward. It came up as a blocker in Fedora 27 today. Let's say we find a way to force boxes to revert to virtio-vga. That wouldn't change any existing VMs, and it is something we have no control over when the host is not Fedora as well. It also would be a problem for non wayland guests.
https://bugzilla.kernel.org/show_bug.cgi?id=196777
--- Comment #13 from Gerd Hoffmann (kraxel@redhat.com) ---
Workaround #1: turn off wayland.
Possible as a short term fix, but with wayland being pretty much "the way forward" it doesn't seem to be a workable long term solution.
Yes.
So this brings up an interesting problem in how things are to move forward.
Kicked discussion on spice-devel list. https://lists.freedesktop.org/archives/spice-devel/2017-October/040310.html
It came up as a blocker in Fedora 27 today. Let's say we find a way to force boxes to revert to virtio-vga. That wouldn't change any existing VMs, and it is something we have no control over when the host is not Fedora as well.
That would probably best done via libosinfo (because for guests without virtio-vga guest drivers we better don't do the switch). Which should be picked up by other distros and projects too.
It also would be a problem for non wayland guests.
Why? The xorg modesetting driver works just fine with virtio-vga.
https://bugzilla.kernel.org/show_bug.cgi?id=196777
--- Comment #14 from Gerd Hoffmann (kraxel@redhat.com) ---
We could possibly decouple the primary surface from the drm framebuffers, so the drm framebuffers effectively become shadow framebuffers, and every display update becomes a drm framebuffer -> primary surface blit. Not sure whenever that scheme can work properly with xorg though. Also has a high chance to cause xorg performance regressions.
Turns out there is an easy way out: shadow dumb framebuffers only.
https://www.kraxel.org/cgit/linux/log/?h=drm-qxl-atomic
dri-devel@lists.freedesktop.org