On 10/03/16 16:15, Laszlo Ersek wrote:
On 10/03/16 13:34, Emil Velikov wrote:
On 30 September 2016 at 18:26, Laszlo Ersek lersek@redhat.com wrote:
On 09/30/16 18:38, Hans de Goede wrote:
Hi,
On 30-09-16 17:33, Laszlo Ersek wrote:
On 09/30/16 16:59, Hans de Goede wrote:
Hi,
On 30-09-16 16:51, Laszlo Ersek wrote: > On 09/30/16 12:35, Hans de Goede wrote: > >> Attached are 2 patches against the xserver which should fix this, >> please give them a try. > > Sorry about the delay. > > The patches don't seem to fix the issue for me. Please see the Xorg log > attached. > > I tested the patches as follows. Given that my bisection had been done > in a Fedora 24 guest, using > > xorg-x11-server-1.18.4-4.fc24 > http://koji.fedoraproject.org/koji/buildinfo?buildID=794494 > > I now rebuilt the guest kernel exactly at the failing commit (a325725 > "drm: Lobotomize set_busid nonsense for !pci drivers"), and first > reproduced the issue with the above X server. > > Then, I ported your patches to "xorg-server-1.18.4" (using the upstream > xserver tree), and rebuilt the Fedora package with the backport. For > the > backport, I had to cherry-pick the following two patches from master > first: > > 1 ca8d88e50310 xfree86: recognize primary BUS_PCI device in > xf86IsPrimaryPlatform() > 2 ea91db4b8331 config: fix GPUDevice fail when AutoAddGPU off + BusID > > This way your patches applied cleanly. (Cherry pick #1 above is > actually > necessary for semantics, while cherry pick #2 is needed for a clean > context only, and has no impact for this test.) > > That is, in total, I added the following four patches to the Fedora 24 > package: > > 1 xfree86: recognize primary BUS_PCI device in xf86IsPrimaryPlatform() > 2 config: fix GPUDevice fail when AutoAddGPU off + BusID > 3 xfree86: Make adding unclaimed devices as GPU devices a separate step > 4 xfree86: Try harder to find atleast 1 non GPU Screen > > You can find the scratch build that I used for testing here: > > xorg-x11-server-1.18.4-4.hans_bz1366842_2.fc24 > http://koji.fedoraproject.org/koji/taskinfo?taskID=15875087 > > Another reason I used F24's X server as basis, rather than upstream > HEAD, is that Fedora 24 is pretty young, and it's already on kernel > 4.7.4, and I believe it will soon move to kernel 4.8, without > (necessarily) rebasing its X server package to upstream. IOW the kernel > upgrade to 4.8 will break X in Fedora 24 too, and then I expect the > Fedora X maintainers would have to cherry pick those two patches as > dependencies just the same. > > To summarize, the patches don't seem to help. I shall nonetheless thank > you for spending your Friday on this!
Hmm, do you have a xorg.conf file lying around somewhere, the message about the xserver not being able to find an entry for screen 0 does not make sense ...
Good catch, I actually had two files under "/etc/X11/xorg.conf.d/":
- "00-keyboard.conf", from package "systemd-229-13.fc24.x86_64", with
contents
# Read and parsed by systemd-localed. It's probably wise not to edit this file # manually too freely. Section "InputClass" Identifier "system-keyboard" MatchIsKeyboard "on" Option "XkbLayout" "us" EndSection
- "01-resolution.conf", which I had created, in order to set the
preferred display resolution:
Section "Screen" Identifier "Default Screen" Device "Default Device" Monitor "Default Monitor" EndSection
Section "Device" Identifier "Default Device" Driver "modesetting" EndSection
Section "Monitor" Identifier "Default Monitor" Option "PreferredMode" "640x480" # Option "PreferredMode" "1440x900" EndSection
I removed these files now, and repeated the test. Again, the X server wouldn't start, but I think the log file looks a bit different now. Attached.
Ah, ok so it seems that my initial analysis is wrong, the problem is not a re-occuring of the device getting identified as a GPU screen, libdrm sorta depends on bus-ids and the lack of one is causing the server to misbehave. I guess that even with a xorg.conf things will fail with the troublesome kernel version (might be worth trying).
Emil's analysis seems to be spot on. This does not seem easily fixable in userspace / does seem like a real regression as it even breaks things when specifying the device through xorg.conf (I or so I believe) which is something which uses to work ...
In order to check this hypothesis, I did the following:
- I downgraded my xorg-x11-server installation to the most recent
official F24 packages, that is, "1.18.4-4.fc24",
- I kept the kernel that I built exactly at the regressive commit
(a325725633c2)
- I modified "01-resolution.conf" (see it above in the context) like this:
Section "Device" Identifier "Default Device" Driver "modesetting" BusID "PCI:00:02:0" <------------ new option added EndSection
where BusID matches the B/D/F of the virtio-vga device from "lspci".
This setup (modulo the kernel of course) was known to work, but now the X server actually segfaults (apparently in the xf86PlatformDeviceCheckBusID() function). Please find the logfile attached.
(NB: this is unrelated to upstream commit de9ce6757c2e -- which the pristine FC24 build lacks -- because I don't set AutoAddGPU to "off" -- it is left at its default "on" value.)
Where is this upstream commit again - it shows as unknown for the kernel, xserver and libdrm ?
So my theory was a bit off - SetVersion is the one responsible to set the "BusID", as retrieved by drmGetBusID, regardless if drmOpen or open is used.
Here's a bit of a brain dump from the other day:
- The commit mentioned 'affects' the drmSetBusid/DRM_IOCTL_SET_UNIQUE
userspace codepaths.
- The latter itself is dri1/legacy (xserver hw/xfree86/dri/) which is
not functional for platform devices. The latter of which seems to be the case for virt-gpu based on the kernel module.
- The modesetting driver should/cannot reach the above xserver codepath
That said, it seems that (at least some) userspace expects a PCI device despite the kernel module 'advertising' itself as platform one :-\
With kernel commit a325725633c2 applied, the drmGetBusid() call in get_drm_info() [hw/xfree86/os-support/linux/lnx_platform.c] returns "virtio0".
Without kernel commit a325725633c2 in place, the same function call produces "pci:0000:00:02.0".
I think that the idea of kernel commit a325725633c2:
We already have a fallback in place to fill out the unique from dev->unique, which is set to something reasonable in drm_dev_alloc.
is inappropriate for virtio devices.
While it is true that "virtioN" is unique across the guest system, those identifiers are not stable.
# ls -1d /sys/devices/pci*/*/virtio* /sys/devices/pci0000:00/0000:00:02.0/virtio0 /sys/devices/pci0000:00/0000:00:03.0/virtio1 /sys/devices/pci0000:00/0000:00:05.0/virtio2 /sys/devices/pci0000:00/0000:00:06.0/virtio3 /sys/devices/pci0000:00/0000:00:09.0/virtio4
If you tweak the PCI addresses of other virtio devices on the QEMU command line, while keeping the PCI address of the virtio-vga device intact, the "virtioN" identifier of the virtio-vga device may change in an unspecified / unexpected manner.
From the above listing, it seems like higher PCI Device numbers happen to end up with higher "virtioN" identifiers. While this is unspecified / non-guaranteed behavior, in practice it means the following:
Moving the "virtio1" device from PCI address 00:03.0 to 00:01.0, while preserving the PCI address of virtio-vga as 00:02.0, will bump virtio-vga to "virtio1" from "virtio0" (and sink that other device from "virtio1" to "virtio0").
I think that unstable identifiers don't qualify for BusID use, regardless of the actual format of the IDs in question.
Searching the patch quickly, drm_virtio_set_busid() is the only implementation of the "drm_driver.set_busid" callback that delegates the task to drm_pci_set_busid(). All other implementations of this callback were provided by drm_platform_set_busid().
drm_platform_set_busid() would ultimately format "dev->platformdev->name" and "dev->platformdev->id" into the bus identifier. Replacing that common logic with the drm_dev_alloc() fallback that is already in place is probably justified: judging from "virtio0", the fallback likely captures the exact same information, just with a different format.
However, drm_virtio_set_busid() used to implement a different logic (encoding different information). It intentionally used stable PCI addresses rather than (platformdev->name, platformdev->id).
So, I think that removing drm_virtio_set_busid() was an actual regression. I propose that the related hunks of a325725633c2 be reverted. For virtio-vga and virtio-gpu-pci, "virtioN" is an unstable and inappropraite BusID pattern.
In support of my argument, please see the following remark in commit message:
v4: Drop accidental amdgpu hunk (Emil).
Now, if you look up the v3 posting for that amdgpu hunk:
https://lists.freedesktop.org/archives/dri-devel/2016-June/111200.html https://lists.freedesktop.org/archives/dri-devel/2016-June/111486.html
the v3 patch accidentally removed the similarly customized set_busid callback for amdgpu -- drm_pci_set_busid(). Emil caught that error in review, hence the v4 patch wouldn't contain the same error.
My argument is exactly the same -- just as the amdgpu hunk shouldn't be in the patch, the drm_virtio_set_busid() hunk shouldn't be there either. Both amdgpu and virtio-gpu implement a set_busid callback -- by delegating to drm_pci_set_busid() -- that *cannot* be replaced with the fallback in drm_dev_alloc().
Please revert the drm_virtio_set_busid() part of kernel commit a325725633c2; at this point I'm 95% sure it was an oversight that didn't get caught in review.
The rest of kernel commit a325725633c2 looks sane to me; while it changes the formatting of the busid for the affected platform devices, relying on the fallback in those cases actually preserves the same information content. (This is not true for amdgpu and virtio-gpu.)
Thanks Laszlo