Reuse framebuffer after a kexec (amdgpu / efifb)

List overview All Threads
Download

newer

older

[drm-intel:topic/core-for-CI...

[PATCH 0/7] Fix stealing guc_ids +...

Guilherme G. Piccoli

9 Dec 2021 9 Dec '21

4 p.m.

Hi all, I have a question about the possibility of reusing a framebuffer after a regular (or panic) kexec - my case is with amdgpu (APU, aka, not a separate GPU hardware), but I guess the question is kinda generic hence I've looped most of the lists / people I think does make sense (apologies for duplicates).

The context is: we have a hardware that has an amdgpu-controlled device (Vangogh model) and as soon as the machine boots, efifb is providing graphics - I understand the UEFI/GRUB outputs rely in EFI framebuffer as well. As soon amdgpu module is available, kernel loads it and it takes over the GPU, providing graphics. The kexec_file_load syscall allows to pass a valid screen_info structure, so by kexec'ing a new kernel, we have again efifb taking over on boot time, but this time I see nothing in the screen. I've manually blacklisted amdgpu in this new kexec'ed kernel, I'd like to rely in the simple framebuffer - the goal is to have a tiny kernel kexec'ed. I'm using kernel version 5.16.0-rc4.

I've done some other experiments, for exemple: I've forced screen_info model to match VLFB, so vesafb took over after the kexec, with the same result. Also noticed that BusMaster bit was off after kexec, in the AMD APU PCIe device, so I've set it on efifb before probe, and finally tested the same things in qemu, with qxl, all with the same result (blank screen). The most interesting result I got (both with amdgpu and qemu/qxl) is that if I blacklist these drivers and let the machine continue using efifb since the beginning, after kexec the efifb is still able to produce graphics.

Which then led me to think that likely there's something fundamentally "blocking" the reuse of the simple framebuffer after kexec, like maybe DRM stack is destroying the old framebuffer somehow? What kind of preparation is required at firmware level to make the simple EFI VGA framebuffer work, and could we perform this in a kexec (or "save it" before the amdgpu/qxl drivers take over and reuse later)?

Any advice is greatly appreciated! Thanks in advance,

Guilherme

Show replies by date

Christian König

9 Dec 9 Dec

4:09 p.m.

Hi Guilherme,

Am 09.12.21 um 17:00 schrieb Guilherme G. Piccoli:

...

Hi all, I have a question about the possibility of reusing a framebuffer after a regular (or panic) kexec - my case is with amdgpu (APU, aka, not a separate GPU hardware), but I guess the question is kinda generic hence I've looped most of the lists / people I think does make sense (apologies for duplicates).

The context is: we have a hardware that has an amdgpu-controlled device (Vangogh model) and as soon as the machine boots, efifb is providing graphics - I understand the UEFI/GRUB outputs rely in EFI framebuffer as well. As soon amdgpu module is available, kernel loads it and it takes over the GPU, providing graphics. The kexec_file_load syscall allows to pass a valid screen_info structure, so by kexec'ing a new kernel, we have again efifb taking over on boot time, but this time I see nothing in the screen. I've manually blacklisted amdgpu in this new kexec'ed kernel, I'd like to rely in the simple framebuffer - the goal is to have a tiny kernel kexec'ed. I'm using kernel version 5.16.0-rc4.

I've done some other experiments, for exemple: I've forced screen_info model to match VLFB, so vesafb took over after the kexec, with the same result. Also noticed that BusMaster bit was off after kexec, in the AMD APU PCIe device, so I've set it on efifb before probe, and finally tested the same things in qemu, with qxl, all with the same result (blank screen). The most interesting result I got (both with amdgpu and qemu/qxl) is that if I blacklist these drivers and let the machine continue using efifb since the beginning, after kexec the efifb is still able to produce graphics.

Which then led me to think that likely there's something fundamentally "blocking" the reuse of the simple framebuffer after kexec, like maybe DRM stack is destroying the old framebuffer somehow? What kind of preparation is required at firmware level to make the simple EFI VGA framebuffer work, and could we perform this in a kexec (or "save it" before the amdgpu/qxl drivers take over and reuse later)?

unfortunately what you try here will most likely not work easily.

During bootup the ASIC is initialized in a VGA compatibility mode by the VBIOS which also allows efifb to display something. And among the first things amdgpu does is to disable this compatibility mode :)

What you need to do to get this working again is to issue a PCIe reset of the GPU and then re-init the ASIC with the VBIOS tables.

Alex should know more details about how to do this.

Regards, Christian.

...

Any advice is greatly appreciated! Thanks in advance,

Guilherme

Alex Deucher

5:31 p.m.

On Thu, Dec 9, 2021 at 12:04 PM Guilherme G. Piccoli gpiccoli@igalia.com wrote:

...

Hi all, I have a question about the possibility of reusing a framebuffer after a regular (or panic) kexec - my case is with amdgpu (APU, aka, not a separate GPU hardware), but I guess the question is kinda generic hence I've looped most of the lists / people I think does make sense (apologies for duplicates).

The context is: we have a hardware that has an amdgpu-controlled device (Vangogh model) and as soon as the machine boots, efifb is providing graphics - I understand the UEFI/GRUB outputs rely in EFI framebuffer as well. As soon amdgpu module is available, kernel loads it and it takes over the GPU, providing graphics. The kexec_file_load syscall allows to pass a valid screen_info structure, so by kexec'ing a new kernel, we have again efifb taking over on boot time, but this time I see nothing in the screen. I've manually blacklisted amdgpu in this new kexec'ed kernel, I'd like to rely in the simple framebuffer - the goal is to have a tiny kernel kexec'ed. I'm using kernel version 5.16.0-rc4.

I've done some other experiments, for exemple: I've forced screen_info model to match VLFB, so vesafb took over after the kexec, with the same result. Also noticed that BusMaster bit was off after kexec, in the AMD APU PCIe device, so I've set it on efifb before probe, and finally tested the same things in qemu, with qxl, all with the same result (blank screen). The most interesting result I got (both with amdgpu and qemu/qxl) is that if I blacklist these drivers and let the machine continue using efifb since the beginning, after kexec the efifb is still able to produce graphics.

Which then led me to think that likely there's something fundamentally "blocking" the reuse of the simple framebuffer after kexec, like maybe DRM stack is destroying the old framebuffer somehow? What kind of preparation is required at firmware level to make the simple EFI VGA framebuffer work, and could we perform this in a kexec (or "save it" before the amdgpu/qxl drivers take over and reuse later)?

Once the driver takes over, none of the pre-driver state is retained. You'll need to load the driver in the new kernel to initialize the displays. Note the efifb doesn't actually have the ability to program any hardware, it just takes over the memory region that was used for the pre-OS framebuffer and whatever display timing was set up by the GOP driver prior to the OS loading. Once that OS driver has loaded the area is gone and the display configuration may have changed.

Alex

...

Any advice is greatly appreciated! Thanks in advance,

Guilherme

Guilherme G. Piccoli

5:59 p.m.

On 09/12/2021 14:31, Alex Deucher wrote:

...

[...] Once the driver takes over, none of the pre-driver state is retained. You'll need to load the driver in the new kernel to initialize the displays. Note the efifb doesn't actually have the ability to program any hardware, it just takes over the memory region that was used for the pre-OS framebuffer and whatever display timing was set up by the GOP driver prior to the OS loading. Once that OS driver has loaded the area is gone and the display configuration may have changed.

Hi Christian and Alex, thanks for the clarifications!

Is there any way to save/retain this state before amdgpu takes over? Would simpledrm be able to program the device again, to a working state?

Finally, do you have any example of such a GOP driver (open source) so I can take a look? I tried to find something like that in Tianocore project, but didn't find anything that seemed useful for my issue.

Thanks again!

Alex Deucher

6:06 p.m.

On Thu, Dec 9, 2021 at 1:00 PM Guilherme G. Piccoli gpiccoli@igalia.com wrote:

...

On 09/12/2021 14:31, Alex Deucher wrote:

...
[...] Once the driver takes over, none of the pre-driver state is retained. You'll need to load the driver in the new kernel to initialize the displays. Note the efifb doesn't actually have the ability to program any hardware, it just takes over the memory region that was used for the pre-OS framebuffer and whatever display timing was set up by the GOP driver prior to the OS loading. Once that OS driver has loaded the area is gone and the display configuration may have changed.

Hi Christian and Alex, thanks for the clarifications!

Is there any way to save/retain this state before amdgpu takes over?

Not really in a generic way. It's asic and platform specific. In addition most modern displays require link training to bring up the display, so you can't just save and restore registers.

...

Would simpledrm be able to program the device again, to a working state?

No. You need an asic specific driver that knows how to program the specific hardware. It's also platform specific in that you need to determine platform specific details such as the number and type of display connectors and encoders that are present on the system.

...

Finally, do you have any example of such a GOP driver (open source) so I can take a look? I tried to find something like that in Tianocore project, but didn't find anything that seemed useful for my issue.

The drivers are asic and platform specific. E.g., the driver for vangogh is different from renoir is different from skylake, etc. The display programming interfaces are asic specific.

Alex

Guilherme G. Piccoli

6:17 p.m.

Thanks again Alex! Some comments inlined below:

On 09/12/2021 15:06, Alex Deucher wrote:

...

Not really in a generic way. It's asic and platform specific. In addition most modern displays require link training to bring up the display, so you can't just save and restore registers.

Oh sure, I understand that. My question is more like: is there a way, inside amdgpu driver, to save this state before taking over/overwriting/reprogramming the device? So we could (again, from inside the amdgpu driver) dump this pre-saved state in the shutdown handler, for example, having the device in a "pre-OS" state when the new kexec'ed kernel starts.

...

The drivers are asic and platform specific. E.g., the driver for vangogh is different from renoir is different from skylake, etc. The display programming interfaces are asic specific.

Cool, that makes sense! But if you (or anybody here) know some of these GOP drivers, e.g. for the qemu/qxl device, I'm just curious to see/understand how complex is the FW driver to just put the device/screen in a usable state.

Cheers,

Guilherme

Alex Deucher

7:20 p.m.

On Thu, Dec 9, 2021 at 1:18 PM Guilherme G. Piccoli gpiccoli@igalia.com wrote:

...

Thanks again Alex! Some comments inlined below:

On 09/12/2021 15:06, Alex Deucher wrote:

...
Not really in a generic way. It's asic and platform specific. In addition most modern displays require link training to bring up the display, so you can't just save and restore registers.

Oh sure, I understand that. My question is more like: is there a way, inside amdgpu driver, to save this state before taking over/overwriting/reprogramming the device? So we could (again, from inside the amdgpu driver) dump this pre-saved state in the shutdown handler, for example, having the device in a "pre-OS" state when the new kexec'ed kernel starts.

Sure, it could be done, it's just a fair amount of work. Things like legacy vga text mode is a bit more of a challenge, but that tends to be less relevant as non-legacy UEFI becomes more pervasive.

...

...
The drivers are asic and platform specific. E.g., the driver for vangogh is different from renoir is different from skylake, etc. The display programming interfaces are asic specific.

Cool, that makes sense! But if you (or anybody here) know some of these GOP drivers, e.g. for the qemu/qxl device, I'm just curious to see/understand how complex is the FW driver to just put the device/screen in a usable state.

Most of the asic init and display setup on AMD GPUs is handled via atombios command tables (basically little scripted stored in the vbios) which are shared by the driver and the GOP driver for most programming sequences. In our case, the GOP driver is pretty simple. Take a look at the pre-DC display code in amdgpu to see what a basic display driver would look like (e.g., dce_v11_0.c). The GOP driver would call the atombios asic_init table to make sure the chip itself is initialized (e.g., memory controller, etc.), then walk the display data tables in the vbios to determine the display configuration specific to this board, then probe the displays and use the atombios display command tables to light them up.

Alex

Gerd Hoffmann

10 Dec 10 Dec

7:19 a.m.

Hi,

...

...
The drivers are asic and platform specific. E.g., the driver for vangogh is different from renoir is different from skylake, etc. The display programming interfaces are asic specific.

Cool, that makes sense! But if you (or anybody here) know some of these GOP drivers, e.g. for the qemu/qxl device,

OvmfPkg/QemuVideoDxe in tianocore source tree.

...

I'm just curious to see/understand how complex is the FW driver to just put the device/screen in a usable state.

Note that qemu has a paravirtual interface for vesa vga mode programming where you basically program a handful of registers with xres, yres, depth etc. (after resetting the device to put it into vga compatibility mode) and you are done.

Initializing physical hardware is an order of magnitude harder than that.

With qxl you could also go figure the current state of the hardware and fill screen_info with that to get a working boot framebuffer in the kexec'ed kernel.

Problem with this approach is this works only in case the framebuffer happens to be in a format usable by vesafb/efifb. So no modifiers (tiling etc.) and continuous in physical address space. That is true for qxl. With virtio-gpu it wouldn't work though (framebuffer can be scattered), and I expect with most modern physical hardware it wouldn't work either.

take care, Gerd

Thomas Zimmermann

8:24 a.m.

Am 09.12.21 um 19:17 schrieb Guilherme G. Piccoli:

...

Thanks again Alex! Some comments inlined below:

On 09/12/2021 15:06, Alex Deucher wrote:

...
Not really in a generic way. It's asic and platform specific. In addition most modern displays require link training to bring up the display, so you can't just save and restore registers.

Oh sure, I understand that. My question is more like: is there a way, inside amdgpu driver, to save this state before taking over/overwriting/reprogramming the device? So we could (again, from inside the amdgpu driver) dump this pre-saved state in the shutdown handler, for example, having the device in a "pre-OS" state when the new kexec'ed kernel starts.

We have have been talking about reading out and storing state of active devices within DRM. So far nothing usable has emerged. In a distant future, kexec might be able to store information about the active framebuffer and the new kernel's simpledrm (or some other driver) could use it as output.

But don't hold your breath for it. It won't happen anytime soon.

Best regards Thomas

...

...
The drivers are asic and platform specific. E.g., the driver for vangogh is different from renoir is different from skylake, etc. The display programming interfaces are asic specific.

Cool, that makes sense! But if you (or anybody here) know some of these GOP drivers, e.g. for the qemu/qxl device, I'm just curious to see/understand how complex is the FW driver to just put the device/screen in a usable state.

Cheers,

Guilherme

-- Thomas Zimmermann Graphics Driver Developer SUSE Software Solutions Germany GmbH Maxfeldstr. 5, 90409 Nürnberg, Germany (HRB 36809, AG Nürnberg) Geschäftsführer: Ivo Totev

Guilherme G. Piccoli

2:08 p.m.

Thanks a lot Alex / Gerd and Thomas, very informative stuff! I'm glad there are projects to collect/save the data and reuse after a kdump, this is very useful.

I'll continue my study on the atombios thing of AMD and QXL, maybe at least we can make it work in qemu, that'd be great (like a small initdriver to reprogram de paravirtual device on kexec boot).

Cheers,

Guilherme

Alex Deucher

2:16 p.m.

On Fri, Dec 10, 2021 at 9:09 AM Guilherme G. Piccoli gpiccoli@igalia.com wrote:

...

Thanks a lot Alex / Gerd and Thomas, very informative stuff! I'm glad there are projects to collect/save the data and reuse after a kdump, this is very useful.

I'll continue my study on the atombios thing of AMD and QXL, maybe at least we can make it work in qemu, that'd be great (like a small initdriver to reprogram de paravirtual device on kexec boot).

Why not just reload the driver after kexec?

Alex

...

Cheers,

Guilherme

Guilherme G. Piccoli

2:25 p.m.

On 10/12/2021 11:16, Alex Deucher wrote:> [...]

...

Why not just reload the driver after kexec?

Alex

Because the original issue is the kdump case, and we want a very very tiny kernel - also, the crash originally could have been caused by amdgpu itself, so if it's a GPU issue, we don't want to mess with that in kdump. And I confess I tried modprobe amdgpu after a kdump, no success - kdump won't call shutdown handlers, so GPU will be in a "rogue" state...

My question was about regular kexec because it's much simpler usually, we can do whatever we want there. My line of thought was: if I make it work in regular kexec with a simple framebuffer, I might be able to get it working on kdump heheh

Christian König

3:13 p.m.

Am 10.12.21 um 15:25 schrieb Guilherme G. Piccoli:

...

On 10/12/2021 11:16, Alex Deucher wrote:> [...]

...
Why not just reload the driver after kexec?

Alex

Because the original issue is the kdump case, and we want a very very tiny kernel - also, the crash originally could have been caused by amdgpu itself, so if it's a GPU issue, we don't want to mess with that in kdump. And I confess I tried modprobe amdgpu after a kdump, no success - kdump won't call shutdown handlers, so GPU will be in a "rogue" state...

My question was about regular kexec because it's much simpler usually, we can do whatever we want there. My line of thought was: if I make it work in regular kexec with a simple framebuffer, I might be able to get it working on kdump heheh

How about issuing a PCIe reset and re-initializing the ASIC with just the VBIOS?

That should be pretty straightforward I think.

Christian.

Guilherme G. Piccoli

3:24 p.m.

On 10/12/2021 12:13, Christian König wrote:

...

[...] How about issuing a PCIe reset and re-initializing the ASIC with just the VBIOS?

That should be pretty straightforward I think.

Christian.

Thanks Christian, that'd be perfect! Is it feasible? Per Alex comment, we'd need to run atombios commands to reprogram the timings, display info, etc...like a small driver would do, a full init.

Also, what kind of PCIe reset is recommended for this adapter? Like a hot reset, powering-off/re-power, FLR or that MODE2 reset present in amdgpu code? Remembering this is an APU device.

Thanks a lot!

Christian König

3:32 p.m.

Am 10.12.21 um 16:24 schrieb Guilherme G. Piccoli:

...

On 10/12/2021 12:13, Christian König wrote:

...
[...] How about issuing a PCIe reset and re-initializing the ASIC with just the VBIOS?

That should be pretty straightforward I think.

Christian.

Thanks Christian, that'd be perfect! Is it feasible? Per Alex comment, we'd need to run atombios commands to reprogram the timings, display info, etc...like a small driver would do, a full init.

Also, what kind of PCIe reset is recommended for this adapter? Like a hot reset, powering-off/re-power, FLR or that MODE2 reset present in amdgpu code? Remembering this is an APU device.

Well, Alex is the expert on that.

APU makes the whole thing pretty tricky since the VBIOS is part of the system BIOS there and I'm not sure you can only re-initialize the GPU without a complete reset.

On dGPUs just making sure the ROM is mapped and calling the VESA modeset BIOS functions might already do the trick.

Christian.

...

Thanks a lot!

Alex Deucher

7:11 p.m.

On Fri, Dec 10, 2021 at 10:24 AM Guilherme G. Piccoli gpiccoli@igalia.com wrote:

...

On 10/12/2021 12:13, Christian König wrote:

...
[...] How about issuing a PCIe reset and re-initializing the ASIC with just the VBIOS?

That should be pretty straightforward I think.

Christian.

Thanks Christian, that'd be perfect! Is it feasible? Per Alex comment, we'd need to run atombios commands to reprogram the timings, display info, etc...like a small driver would do, a full init.

You need the equivalent of a GOP driver or a full GPU driver. I think it would be less effort to just fix up any problems amdgpu has when trying to load after the crash than to write a new mini driver. By the time you add everything you'd need, you'd be pretty close to a full GPU driver.

...

Also, what kind of PCIe reset is recommended for this adapter? Like a hot reset, powering-off/re-power, FLR or that MODE2 reset present in amdgpu code? Remembering this is an APU device.

You'd need to issue the relevant device specific reset sequence. It would be a mode2 reset on vangogh, but varies on other asics. It would probably be easiest to just fix up the logic in amdgpu to detect bad GPU state on driver load and do a GPU reset before driver init. We already have the logic in place for some dGPUs, but APUs only recently got full GPU reset support due to architectural limitations and hardware bugs.

Alex

Felix Kuehling

11 Dec 11 Dec

12:54 a.m.

On 2021-12-10 10:13 a.m., Christian König wrote:

...

Am 10.12.21 um 15:25 schrieb Guilherme G. Piccoli:

...
On 10/12/2021 11:16, Alex Deucher wrote:> [...]

...
Why not just reload the driver after kexec?

Alex

Because the original issue is the kdump case, and we want a very very tiny kernel - also, the crash originally could have been caused by amdgpu itself, so if it's a GPU issue, we don't want to mess with that in kdump. And I confess I tried modprobe amdgpu after a kdump, no success - kdump won't call shutdown handlers, so GPU will be in a "rogue" state...

My question was about regular kexec because it's much simpler usually, we can do whatever we want there. My line of thought was: if I make it work in regular kexec with a simple framebuffer, I might be able to get it working on kdump heheh

How about issuing a PCIe reset and re-initializing the ASIC with just the VBIOS?

That should be pretty straightforward I think.

Do you actually need to restore the exact boot-up mode? If you have the same framebuffer memory layout (width, height, bpp, stride) the precise display timing doesn't really matter. So we "just" need to switch to a mode that's compatible with the efifb framebuffer parameters and point the display engine at the efifb as the scan-out buffer.

Regards, Felix

...

Christian.

Gerd Hoffmann

9:20 a.m.

On Fri, Dec 10, 2021 at 07:54:34PM -0500, Felix Kuehling wrote:

...

Do you actually need to restore the exact boot-up mode? If you have the same framebuffer memory layout (width, height, bpp, stride) the precise display timing doesn't really matter. So we "just" need to switch to a mode that's compatible with the efifb framebuffer parameters and point the display engine at the efifb as the scan-out buffer.

That'll probably doable for a normal kexec but in case of a crashdump kexec I don't think it is a good idea to touch the gpu using the driver of the kernel which just crashed ...

take care, Gerd

Alex Deucher

10 Dec 10 Dec

7:05 p.m.

On Fri, Dec 10, 2021 at 9:25 AM Guilherme G. Piccoli gpiccoli@igalia.com wrote:

...

On 10/12/2021 11:16, Alex Deucher wrote:> [...]

...
Why not just reload the driver after kexec?

Alex

Because the original issue is the kdump case, and we want a very very tiny kernel - also, the crash originally could have been caused by amdgpu itself, so if it's a GPU issue, we don't want to mess with that in kdump. And I confess I tried modprobe amdgpu after a kdump, no success - kdump won't call shutdown handlers, so GPU will be in a "rogue" state...

My question was about regular kexec because it's much simpler usually, we can do whatever we want there. My line of thought was: if I make it work in regular kexec with a simple framebuffer, I might be able to get it working on kdump heheh

Well if the GPU is hung, I'm not sure if you'll be able to get back the display environment without a GPU reset and once you do that, you've lost any state you might have been trying to preserve.

Alex

1236

Age (days ago)

1238

Last active (days ago)

dri-devel@lists.freedesktop.org

18 comments

6 participants

tags (0)

participants (6)

Alex Deucher
Christian König
Felix Kuehling
Gerd Hoffmann
Guilherme G. Piccoli
Thomas Zimmermann