[PATCH 0/3] drm/radeon kexec fixes

List overview All Threads
Download

newer

older

[PATCH/RFC v3 00/19] Common...

[GIT PULL]: dma-buf updates for...

Markus Trippelsdorf

8 Sep 2013 8 Sep '13

12:09 p.m.

Here are a couple of patches that get kexec working with radeon devices. I've tested this on my RS780. Comments or flames are welcome. Thanks.

Markus Trippelsdorf (3): kexec: get rid of late printk drm/radeon: Implement radeon_pci_shutdown drm/radeon: get rid of r100_restore_sanity hack

-- Markus

Show replies by date

Markus Trippelsdorf

8 Sep 8 Sep

12:10 p.m.

New subject: [PATCH 1/3] kexec: get rid of late printk

kexec calls: printk(KERN_EMERG "Starting new kernel\n"); late before calling machine_shutdown(). However at this point the underlying fb device may have already been shutdown. This causes the kernel to hang. Fix by simply getting rid of the printk call.

Signed-off-by: Markus Trippelsdorf markus@trippelsdorf.de --- kernel/kexec.c | 1 - 1 file changed, 1 deletion(-)

diff --git a/kernel/kexec.c b/kernel/kexec.c index 59f7b55..f33fa9f 100644 --- a/kernel/kexec.c +++ b/kernel/kexec.c @@ -1679,7 +1679,6 @@ int kernel_kexec(void) #endif { kernel_restart_prepare(NULL); - printk(KERN_EMERG "Starting new kernel\n"); machine_shutdown(); }

-- Markus

Daniel Vetter

8:11 p.m.

New subject: [PATCH 1/3] kexec: get rid of late printk

On Sun, Sep 8, 2013 at 2:10 PM, Markus Trippelsdorf markus@trippelsdorf.de wrote:

...

kexec calls: printk(KERN_EMERG "Starting new kernel\n"); late before calling machine_shutdown(). However at this point the underlying fb device may have already been shutdown. This causes the kernel to hang. Fix by simply getting rid of the printk call.

Signed-off-by: Markus Trippelsdorf markus@trippelsdorf.de

Shouldn't this be taken care of with the suspend/resume_console calls? At least that's my understanding how it works in the suspend/hibernate code, maybe kexec needs similar treatment ... -Daniel

-- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch

Bruno Prémont

8:42 p.m.

New subject: [PATCH 1/3] kexec: get rid of late printk

On Sun, 08 September 2013 Daniel Vetter daniel@ffwll.ch wrote:

...

On Sun, Sep 8, 2013 at 2:10 PM, Markus Trippelsdorf markus@trippelsdorf.de wrote:

...
kexec calls: printk(KERN_EMERG "Starting new kernel\n"); late before calling machine_shutdown(). However at this point the underlying fb device may have already been shutdown. This causes the kernel to hang. Fix by simply getting rid of the printk call.

Signed-off-by: Markus Trippelsdorf markus@trippelsdorf.de

Shouldn't this be taken care of with the suspend/resume_console calls? At least that's my understanding how it works in the suspend/hibernate code, maybe kexec needs similar treatment ...

Is it suspend/resume_console? Shouldn't the fbcon be short-circuited or disabled once there is no more underlying fb? Serial console, if present, as well as netconsole if network device is still alive should continue working I would say.

Bruno

...

-Daniel

Markus Trippelsdorf

12:10 p.m.

New subject: [PATCH 2/3] drm/radeon: Implement radeon_pci_shutdown

Currently radeon devices are not properbly shutdown during kexec. This cases a varity of issues, e.g. dpm initialization failures. Fix this by implementing a radeon_pci_shutdown function, that unloads the driver cleanly.

Signed-off-by: Markus Trippelsdorf markus@trippelsdorf.de --- drivers/gpu/drm/radeon/radeon_drv.c | 10 ++++++++++ 1 file changed, 10 insertions(+)

diff --git a/drivers/gpu/drm/radeon/radeon_drv.c b/drivers/gpu/drm/radeon/radeon_drv.c index cb4445f..d71edee 100644 --- a/drivers/gpu/drm/radeon/radeon_drv.c +++ b/drivers/gpu/drm/radeon/radeon_drv.c @@ -380,6 +380,15 @@ static const struct file_operations radeon_driver_kms_fops = { #endif };

+ +static void +radeon_pci_shutdown(struct pci_dev *pdev) +{ + struct drm_device *dev = pci_get_drvdata(pdev); + + radeon_driver_unload_kms(dev); +} + static struct drm_driver kms_driver = { .driver_features = DRIVER_USE_AGP | @@ -453,6 +462,7 @@ static struct pci_driver radeon_kms_pci_driver = { .remove = radeon_pci_remove, .suspend = radeon_pci_suspend, .resume = radeon_pci_resume, + .shutdown = radeon_pci_shutdown, };

static int __init radeon_init(void)

-- Markus

Konrad Rzeszutek Wilk

9 Sep 9 Sep

1:32 p.m.

New subject: [PATCH 2/3] drm/radeon: Implement radeon_pci_shutdown

On Sun, Sep 08, 2013 at 02:10:58PM +0200, Markus Trippelsdorf wrote:

...

Currently radeon devices are not properbly shutdown during kexec. This

properly

...

cases a varity of issues, e.g. dpm initialization failures. Fix this by implementing a radeon_pci_shutdown function, that unloads the driver cleanly.

Signed-off-by: Markus Trippelsdorf markus@trippelsdorf.de

drivers/gpu/drm/radeon/radeon_drv.c | 10 ++++++++++ 1 file changed, 10 insertions(+)

diff --git a/drivers/gpu/drm/radeon/radeon_drv.c b/drivers/gpu/drm/radeon/radeon_drv.c index cb4445f..d71edee 100644 --- a/drivers/gpu/drm/radeon/radeon_drv.c +++ b/drivers/gpu/drm/radeon/radeon_drv.c @@ -380,6 +380,15 @@ static const struct file_operations radeon_driver_kms_fops = { #endif };

+static void +radeon_pci_shutdown(struct pci_dev *pdev) +{

struct drm_device *dev = pci_get_drvdata(pdev);

radeon_driver_unload_kms(dev);

+}

static struct drm_driver kms_driver = { .driver_features = DRIVER_USE_AGP | @@ -453,6 +462,7 @@ static struct pci_driver radeon_kms_pci_driver = { .remove = radeon_pci_remove, .suspend = radeon_pci_suspend, .resume = radeon_pci_resume,

.shutdown = radeon_pci_shutdown,

};

static int __init radeon_init(void)

Markus _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel

Markus Trippelsdorf

8 Sep 8 Sep

12:11 p.m.

New subject: [PATCH 3/3] drm/radeon: get rid of r100_restore_sanity hack

Now that radeon devices are properly shutdown during kexec, we can get rid of r100_restore_sanity.

Signed-off-by: Markus Trippelsdorf markus@trippelsdorf.de --- drivers/gpu/drm/radeon/r100.c | 27 --------------------------- drivers/gpu/drm/radeon/r300.c | 2 -- drivers/gpu/drm/radeon/r420.c | 2 -- drivers/gpu/drm/radeon/r520.c | 2 -- drivers/gpu/drm/radeon/radeon_asic.h | 1 - drivers/gpu/drm/radeon/rs400.c | 2 -- drivers/gpu/drm/radeon/rs600.c | 2 -- drivers/gpu/drm/radeon/rs690.c | 2 -- drivers/gpu/drm/radeon/rv515.c | 2 -- 9 files changed, 42 deletions(-)

diff --git a/drivers/gpu/drm/radeon/r100.c b/drivers/gpu/drm/radeon/r100.c index 9fc61dd..d53dcd8 100644 --- a/drivers/gpu/drm/radeon/r100.c +++ b/drivers/gpu/drm/radeon/r100.c @@ -3938,31 +3938,6 @@ void r100_fini(struct radeon_device *rdev) rdev->bios = NULL; }

-/* - * Due to how kexec works, it can leave the hw fully initialised when it - * boots the new kernel. However doing our init sequence with the CP and - * WB stuff setup causes GPU hangs on the RN50 at least. So at startup - * do some quick sanity checks and restore sane values to avoid this - * problem. - */ -void r100_restore_sanity(struct radeon_device *rdev) -{ - u32 tmp; - - tmp = RREG32(RADEON_CP_CSQ_CNTL); - if (tmp) { - WREG32(RADEON_CP_CSQ_CNTL, 0); - } - tmp = RREG32(RADEON_CP_RB_CNTL); - if (tmp) { - WREG32(RADEON_CP_RB_CNTL, 0); - } - tmp = RREG32(RADEON_SCRATCH_UMSK); - if (tmp) { - WREG32(RADEON_SCRATCH_UMSK, 0); - } -} - int r100_init(struct radeon_device *rdev) { int r; @@ -3975,8 +3950,6 @@ int r100_init(struct radeon_device *rdev) radeon_scratch_init(rdev); /* Initialize surface registers */ radeon_surface_init(rdev); - /* sanity check some register to avoid hangs like after kexec */ - r100_restore_sanity(rdev); /* TODO: disable VGA need to use VGA request */ /* BIOS*/ if (!radeon_get_bios(rdev)) { diff --git a/drivers/gpu/drm/radeon/r300.c b/drivers/gpu/drm/radeon/r300.c index d8dd269..57ba534 100644 --- a/drivers/gpu/drm/radeon/r300.c +++ b/drivers/gpu/drm/radeon/r300.c @@ -1480,8 +1480,6 @@ int r300_init(struct radeon_device *rdev) /* Initialize surface registers */ radeon_surface_init(rdev); /* TODO: disable VGA need to use VGA request */ - /* restore some register to sane defaults */ - r100_restore_sanity(rdev); /* BIOS*/ if (!radeon_get_bios(rdev)) { if (ASIC_IS_AVIVO(rdev)) diff --git a/drivers/gpu/drm/radeon/r420.c b/drivers/gpu/drm/radeon/r420.c index 4e796ec..9ee3360 100644 --- a/drivers/gpu/drm/radeon/r420.c +++ b/drivers/gpu/drm/radeon/r420.c @@ -371,8 +371,6 @@ int r420_init(struct radeon_device *rdev) /* Initialize surface registers */ radeon_surface_init(rdev); /* TODO: disable VGA need to use VGA request */ - /* restore some register to sane defaults */ - r100_restore_sanity(rdev); /* BIOS*/ if (!radeon_get_bios(rdev)) { if (ASIC_IS_AVIVO(rdev)) diff --git a/drivers/gpu/drm/radeon/r520.c b/drivers/gpu/drm/radeon/r520.c index e1aece7..4709c10 100644 --- a/drivers/gpu/drm/radeon/r520.c +++ b/drivers/gpu/drm/radeon/r520.c @@ -256,8 +256,6 @@ int r520_init(struct radeon_device *rdev) radeon_scratch_init(rdev); /* Initialize surface registers */ radeon_surface_init(rdev); - /* restore some register to sane defaults */ - r100_restore_sanity(rdev); /* TODO: disable VGA need to use VGA request */ /* BIOS*/ if (!radeon_get_bios(rdev)) { diff --git a/drivers/gpu/drm/radeon/radeon_asic.h b/drivers/gpu/drm/radeon/radeon_asic.h index 818bbe6..6eee9e2 100644 --- a/drivers/gpu/drm/radeon/radeon_asic.h +++ b/drivers/gpu/drm/radeon/radeon_asic.h @@ -122,7 +122,6 @@ void r100_mc_resume(struct radeon_device *rdev, struct r100_mc_save *save); void r100_vram_init_sizes(struct radeon_device *rdev); int r100_cp_reset(struct radeon_device *rdev); void r100_vga_render_disable(struct radeon_device *rdev); -void r100_restore_sanity(struct radeon_device *rdev); int r100_cs_track_check_pkt3_indx_buffer(struct radeon_cs_parser *p, struct radeon_cs_packet *pkt, struct radeon_bo *robj); diff --git a/drivers/gpu/drm/radeon/rs400.c b/drivers/gpu/drm/radeon/rs400.c index b8074a8..23bbf89 100644 --- a/drivers/gpu/drm/radeon/rs400.c +++ b/drivers/gpu/drm/radeon/rs400.c @@ -510,8 +510,6 @@ int rs400_init(struct radeon_device *rdev) /* Initialize surface registers */ radeon_surface_init(rdev); /* TODO: disable VGA need to use VGA request */ - /* restore some register to sane defaults */ - r100_restore_sanity(rdev); /* BIOS*/ if (!radeon_get_bios(rdev)) { if (ASIC_IS_AVIVO(rdev)) diff --git a/drivers/gpu/drm/radeon/rs600.c b/drivers/gpu/drm/radeon/rs600.c index 670b555..cfe6f80 100644 --- a/drivers/gpu/drm/radeon/rs600.c +++ b/drivers/gpu/drm/radeon/rs600.c @@ -1018,8 +1018,6 @@ int rs600_init(struct radeon_device *rdev) radeon_scratch_init(rdev); /* Initialize surface registers */ radeon_surface_init(rdev); - /* restore some register to sane defaults */ - r100_restore_sanity(rdev); /* BIOS */ if (!radeon_get_bios(rdev)) { if (ASIC_IS_AVIVO(rdev)) diff --git a/drivers/gpu/drm/radeon/rs690.c b/drivers/gpu/drm/radeon/rs690.c index d8ddfb3..e1f4036 100644 --- a/drivers/gpu/drm/radeon/rs690.c +++ b/drivers/gpu/drm/radeon/rs690.c @@ -789,8 +789,6 @@ int rs690_init(struct radeon_device *rdev) radeon_scratch_init(rdev); /* Initialize surface registers */ radeon_surface_init(rdev); - /* restore some register to sane defaults */ - r100_restore_sanity(rdev); /* TODO: disable VGA need to use VGA request */ /* BIOS*/ if (!radeon_get_bios(rdev)) { diff --git a/drivers/gpu/drm/radeon/rv515.c b/drivers/gpu/drm/radeon/rv515.c index 8ea1573..f9692d7 100644 --- a/drivers/gpu/drm/radeon/rv515.c +++ b/drivers/gpu/drm/radeon/rv515.c @@ -627,8 +627,6 @@ int rv515_init(struct radeon_device *rdev) /* Initialize surface registers */ radeon_surface_init(rdev); /* TODO: disable VGA need to use VGA request */ - /* restore some register to sane defaults */ - r100_restore_sanity(rdev); /* BIOS*/ if (!radeon_get_bios(rdev)) { if (ASIC_IS_AVIVO(rdev))

-- Markus

ebiederm＠xmission.com

9 Sep 9 Sep

12:32 a.m.

Markus Trippelsdorf markus@trippelsdorf.de writes:

...

Here are a couple of patches that get kexec working with radeon devices. I've tested this on my RS780. Comments or flames are welcome. Thanks.

A couple of high level comments.

This looks promising for the usual case.

Removing the printk at the end of the kexec path seems a little dubious, what of other cpus, interrupt handlers, etc. Basically estabilishing a new rule on when printk is allowed seems a little dubious at this point, even if it is a useful debugging trick.

Having a clean shutdown of the radeon definitely seems worth doing, because the cases where we care abouty video are when a person is in front of the system.

I don't know if you want to remove the sanity checks. They seem cheap and safe regardless. Are they expensive or ineffective? Moreover if they work a reasonable amount of the time that means that the kexec on panic case (where we don't shut anything down) can actually use the video, and that in general the driver will be more robust. I don't expect anyone much cares as kexec on panic is mostly used to just write a core file to the network, or the local disk. But if it is easy to keep that case working most of the time, why not.

Eric

...

Markus Trippelsdorf (3): kexec: get rid of late printk drm/radeon: Implement radeon_pci_shutdown drm/radeon: get rid of r100_restore_sanity hack

drivers/gpu/drm/radeon/r100.c | 27 --------------------------- drivers/gpu/drm/radeon/r300.c | 2 -- drivers/gpu/drm/radeon/r420.c | 2 -- drivers/gpu/drm/radeon/r520.c | 2 -- drivers/gpu/drm/radeon/radeon_asic.h | 1 - drivers/gpu/drm/radeon/radeon_drv.c | 10 ++++++++++ drivers/gpu/drm/radeon/rs400.c | 2 -- drivers/gpu/drm/radeon/rs600.c | 2 -- drivers/gpu/drm/radeon/rs690.c | 2 -- drivers/gpu/drm/radeon/rv515.c | 2 -- kernel/kexec.c | 1 - 11 files changed, 10 insertions(+), 43 deletions(-)

Markus Trippelsdorf

9:21 a.m.

On 2013.09.08 at 17:32 -0700, Eric W. Biederman wrote:

...

Markus Trippelsdorf markus@trippelsdorf.de writes:

...
Here are a couple of patches that get kexec working with radeon devices. I've tested this on my RS780. Comments or flames are welcome. Thanks.

A couple of high level comments.

This looks promising for the usual case.

Removing the printk at the end of the kexec path seems a little dubious, what of other cpus, interrupt handlers, etc. Basically estabilishing a new rule on when printk is allowed seems a little dubious at this point, even if it is a useful debugging trick.

OK. I will drop this patch. It doesn't seem to be necessary, because I cannot reproduce the printk related hang anymore.

...

Having a clean shutdown of the radeon definitely seems worth doing, because the cases where we care abouty video are when a person is in front of the system.

Yes. But please note that even with radeon_pci_shutdown implemented, I still get ring test failures on roughly every eighth kexec boot:

[drm:r600_dma_ring_test] *ERROR* radeon: ring 3 test failed (0xCAFEDEAD) radeon 0000:01:05.0: disabling GPU acceleration

That's definitely better than the current state of affairs, with ring test failures on every second boot. But I haven't figured out the reason for these failures yet. It's curious that once a ring test failure occurs, it will reliably fail after each kexec invocation, no matter how often repeated. Only a reboot brings the machine back to normal.

...

I don't know if you want to remove the sanity checks. They seem cheap and safe regardless. Are they expensive or ineffective? Moreover if they work a reasonable amount of the time that means that the kexec on panic case (where we don't shut anything down) can actually use the video, and that in general the driver will be more robust. I don't expect anyone much cares as kexec on panic is mostly used to just write a core file to the network, or the local disk. But if it is easy to keep that case working most of the time, why not.

IIRC Alex said the sanity checks are expensive and boot-time could be improved by dropping them. Maybe he can chime in?

-- Markus

Christian König

9:38 a.m.

Am 09.09.2013 11:21, schrieb Markus Trippelsdorf:

...

On 2013.09.08 at 17:32 -0700, Eric W. Biederman wrote:

...
Markus Trippelsdorf markus@trippelsdorf.de writes:

...
Here are a couple of patches that get kexec working with radeon devices. I've tested this on my RS780. Comments or flames are welcome. Thanks.

A couple of high level comments.

This looks promising for the usual case.

Removing the printk at the end of the kexec path seems a little dubious, what of other cpus, interrupt handlers, etc. Basically estabilishing a new rule on when printk is allowed seems a little dubious at this point, even if it is a useful debugging trick.

OK. I will drop this patch. It doesn't seem to be necessary, because I cannot reproduce the printk related hang anymore.

...
Having a clean shutdown of the radeon definitely seems worth doing, because the cases where we care abouty video are when a person is in front of the system.

Yes. But please note that even with radeon_pci_shutdown implemented, I still get ring test failures on roughly every eighth kexec boot:

[drm:r600_dma_ring_test] *ERROR* radeon: ring 3 test failed (0xCAFEDEAD) radeon 0000:01:05.0: disabling GPU acceleration

That's definitely better than the current state of affairs, with ring test failures on every second boot. But I haven't figured out the reason for these failures yet. It's curious that once a ring test failure occurs, it will reliably fail after each kexec invocation, no matter how often repeated. Only a reboot brings the machine back to normal.

The main problem here is that the AMD gfx hardware doesn't really support being reinitialized once booted (for various reasons). It's a (intended) limitation of the hardware design that you can only initialize certain blocks once every power cycle, so the whole approach actually will never work 100% reliable.

All you can hope for is that stopping the hardware while shutting down the old kernel and starting it again results in exactly the same hardware parameters (offsets, clock etc...) otherwise starting the blocks will just fail and you end up with disabled acceleration like above.

Sorry, but there isn't much we can do about this, Christian.

...

...
I don't know if you want to remove the sanity checks. They seem cheap and safe regardless. Are they expensive or ineffective? Moreover if they work a reasonable amount of the time that means that the kexec on panic case (where we don't shut anything down) can actually use the video, and that in general the driver will be more robust. I don't expect anyone much cares as kexec on panic is mostly used to just write a core file to the network, or the local disk. But if it is easy to keep that case working most of the time, why not.

IIRC Alex said the sanity checks are expensive and boot-time could be improved by dropping them. Maybe he can chime in?

Markus Trippelsdorf

11 Sep 11 Sep

9:01 a.m.

On 2013.09.09 at 11:38 +0200, Christian König wrote:

...

Am 09.09.2013 11:21, schrieb Markus Trippelsdorf:

...
On 2013.09.08 at 17:32 -0700, Eric W. Biederman wrote:

...
Markus Trippelsdorf markus@trippelsdorf.de writes:

...
Here are a couple of patches that get kexec working with radeon devices. I've tested this on my RS780. Comments or flames are welcome. Thanks.

A couple of high level comments.

This looks promising for the usual case.

Removing the printk at the end of the kexec path seems a little dubious, what of other cpus, interrupt handlers, etc. Basically estabilishing a new rule on when printk is allowed seems a little dubious at this point, even if it is a useful debugging trick.

OK. I will drop this patch. It doesn't seem to be necessary, because I cannot reproduce the printk related hang anymore.

...
Having a clean shutdown of the radeon definitely seems worth doing, because the cases where we care abouty video are when a person is in front of the system.

Yes. But please note that even with radeon_pci_shutdown implemented, I still get ring test failures on roughly every eighth kexec boot:

[drm:r600_dma_ring_test] *ERROR* radeon: ring 3 test failed (0xCAFEDEAD) radeon 0000:01:05.0: disabling GPU acceleration

That's definitely better than the current state of affairs, with ring test failures on every second boot. But I haven't figured out the reason for these failures yet. It's curious that once a ring test failure occurs, it will reliably fail after each kexec invocation, no matter how often repeated. Only a reboot brings the machine back to normal.

The main problem here is that the AMD gfx hardware doesn't really support being reinitialized once booted (for various reasons). It's a (intended) limitation of the hardware design that you can only initialize certain blocks once every power cycle, so the whole approach actually will never work 100% reliable.

All you can hope for is that stopping the hardware while shutting down the old kernel and starting it again results in exactly the same hardware parameters (offsets, clock etc...) otherwise starting the blocks will just fail and you end up with disabled acceleration like above.

Sorry, but there isn't much we can do about this,

I've tested this further and it turned out that if I revert commit f5d9b7f0f9 on top of my "drm/radeon: Implement radeon_pci_shutdown" patch, the initialization failures seem to go away completely.

Any idea what's going on?

Here's the patch:

diff --git a/drivers/gpu/drm/radeon/r600_dpm.c b/drivers/gpu/drm/radeon/r600_dpm.c index fa0de46..4e8c1988 100644 --- a/drivers/gpu/drm/radeon/r600_dpm.c +++ b/drivers/gpu/drm/radeon/r600_dpm.c @@ -296,9 +296,9 @@ bool r600_dynamicpm_enabled(struct radeon_device *rdev) void r600_enable_sclk_control(struct radeon_device *rdev, bool enable) { if (enable) - WREG32_P(SCLK_PWRMGT_CNTL, 0, ~SCLK_PWRMGT_OFF); + WREG32_P(GENERAL_PWRMGT, 0, ~SCLK_PWRMGT_OFF); else - WREG32_P(SCLK_PWRMGT_CNTL, SCLK_PWRMGT_OFF, ~SCLK_PWRMGT_OFF); + WREG32_P(GENERAL_PWRMGT, SCLK_PWRMGT_OFF, ~SCLK_PWRMGT_OFF); }

void r600_enable_mclk_control(struct radeon_device *rdev, bool enable)

-- Markus

Christian König

9:10 a.m.

Am 11.09.2013 11:01, schrieb Markus Trippelsdorf:

...

On 2013.09.09 at 11:38 +0200, Christian König wrote:

...
Am 09.09.2013 11:21, schrieb Markus Trippelsdorf:

...
On 2013.09.08 at 17:32 -0700, Eric W. Biederman wrote:

...
Markus Trippelsdorf markus@trippelsdorf.de writes:

...
Here are a couple of patches that get kexec working with radeon devices. I've tested this on my RS780. Comments or flames are welcome. Thanks.

A couple of high level comments.

This looks promising for the usual case.

Removing the printk at the end of the kexec path seems a little dubious, what of other cpus, interrupt handlers, etc. Basically estabilishing a new rule on when printk is allowed seems a little dubious at this point, even if it is a useful debugging trick.

OK. I will drop this patch. It doesn't seem to be necessary, because I cannot reproduce the printk related hang anymore.

...
Having a clean shutdown of the radeon definitely seems worth doing, because the cases where we care abouty video are when a person is in front of the system.

Yes. But please note that even with radeon_pci_shutdown implemented, I still get ring test failures on roughly every eighth kexec boot:

[drm:r600_dma_ring_test] *ERROR* radeon: ring 3 test failed (0xCAFEDEAD) radeon 0000:01:05.0: disabling GPU acceleration

That's definitely better than the current state of affairs, with ring test failures on every second boot. But I haven't figured out the reason for these failures yet. It's curious that once a ring test failure occurs, it will reliably fail after each kexec invocation, no matter how often repeated. Only a reboot brings the machine back to normal.

The main problem here is that the AMD gfx hardware doesn't really support being reinitialized once booted (for various reasons). It's a (intended) limitation of the hardware design that you can only initialize certain blocks once every power cycle, so the whole approach actually will never work 100% reliable.

All you can hope for is that stopping the hardware while shutting down the old kernel and starting it again results in exactly the same hardware parameters (offsets, clock etc...) otherwise starting the blocks will just fail and you end up with disabled acceleration like above.

Sorry, but there isn't much we can do about this,

I've tested this further and it turned out that if I revert commit f5d9b7f0f9 on top of my "drm/radeon: Implement radeon_pci_shutdown" patch, the initialization failures seem to go away completely.

Any idea what's going on?

Well DPM is mostly Alex domain, but if I have to guess I would say that the SCLK is gated by the hardware when the driver is unloaded and since DPM initialized only later not ungated when the driver loads again.

...

Here's the patch:

diff --git a/drivers/gpu/drm/radeon/r600_dpm.c b/drivers/gpu/drm/radeon/r600_dpm.c index fa0de46..4e8c1988 100644 --- a/drivers/gpu/drm/radeon/r600_dpm.c +++ b/drivers/gpu/drm/radeon/r600_dpm.c @@ -296,9 +296,9 @@ bool r600_dynamicpm_enabled(struct radeon_device *rdev) void r600_enable_sclk_control(struct radeon_device *rdev, bool enable) { if (enable)
WREG32_P(SCLK_PWRMGT_CNTL, 0, ~SCLK_PWRMGT_OFF);
WREG32_P(GENERAL_PWRMGT, 0, ~SCLK_PWRMGT_OFF);
else
WREG32_P(SCLK_PWRMGT_CNTL, SCLK_PWRMGT_OFF, ~SCLK_PWRMGT_OFF);
WREG32_P(GENERAL_PWRMGT, SCLK_PWRMGT_OFF, ~SCLK_PWRMGT_OFF);
}

void r600_enable_mclk_control(struct radeon_device *rdev, bool enable)

The patch just breaks SCLK gating on R6xx again, so no gain here.

Christian.

Alex Deucher

1:30 p.m.

On Wed, Sep 11, 2013 at 5:01 AM, Markus Trippelsdorf markus@trippelsdorf.de wrote:

...

On 2013.09.09 at 11:38 +0200, Christian König wrote:

...
Am 09.09.2013 11:21, schrieb Markus Trippelsdorf:

...
On 2013.09.08 at 17:32 -0700, Eric W. Biederman wrote:

...
Markus Trippelsdorf markus@trippelsdorf.de writes:

...
Here are a couple of patches that get kexec working with radeon devices. I've tested this on my RS780. Comments or flames are welcome. Thanks.

A couple of high level comments.

This looks promising for the usual case.

Removing the printk at the end of the kexec path seems a little dubious, what of other cpus, interrupt handlers, etc. Basically estabilishing a new rule on when printk is allowed seems a little dubious at this point, even if it is a useful debugging trick.

OK. I will drop this patch. It doesn't seem to be necessary, because I cannot reproduce the printk related hang anymore.

...
Having a clean shutdown of the radeon definitely seems worth doing, because the cases where we care abouty video are when a person is in front of the system.

Yes. But please note that even with radeon_pci_shutdown implemented, I still get ring test failures on roughly every eighth kexec boot:

[drm:r600_dma_ring_test] *ERROR* radeon: ring 3 test failed (0xCAFEDEAD) radeon 0000:01:05.0: disabling GPU acceleration

That's definitely better than the current state of affairs, with ring test failures on every second boot. But I haven't figured out the reason for these failures yet. It's curious that once a ring test failure occurs, it will reliably fail after each kexec invocation, no matter how often repeated. Only a reboot brings the machine back to normal.

The main problem here is that the AMD gfx hardware doesn't really support being reinitialized once booted (for various reasons). It's a (intended) limitation of the hardware design that you can only initialize certain blocks once every power cycle, so the whole approach actually will never work 100% reliable.

All you can hope for is that stopping the hardware while shutting down the old kernel and starting it again results in exactly the same hardware parameters (offsets, clock etc...) otherwise starting the blocks will just fail and you end up with disabled acceleration like above.

Sorry, but there isn't much we can do about this,

I've tested this further and it turned out that if I revert commit f5d9b7f0f9 on top of my "drm/radeon: Implement radeon_pci_shutdown" patch, the initialization failures seem to go away completely.

Any idea what's going on?

You are disabling dynamic power management with that patch reverted. The patch fixed a copy paste typo in the register. Bit 0 (SCLK_PWRMGT_OFF) of register SCLK_PWRMGT_CNTL controls whether dynamic engine clock control is enabled. Bit 0 (GLOBAL_PWRMGT_EN) of register GENERAL_PWRMGT controls whether dynamic power management (dynamic engine/memory/voltage, controls etc.) is enabled at all.

Alex

...

Here's the patch:

diff --git a/drivers/gpu/drm/radeon/r600_dpm.c b/drivers/gpu/drm/radeon/r600_dpm.c index fa0de46..4e8c1988 100644 --- a/drivers/gpu/drm/radeon/r600_dpm.c +++ b/drivers/gpu/drm/radeon/r600_dpm.c @@ -296,9 +296,9 @@ bool r600_dynamicpm_enabled(struct radeon_device *rdev) void r600_enable_sclk_control(struct radeon_device *rdev, bool enable) { if (enable)
          WREG32_P(SCLK_PWRMGT_CNTL, 0, ~SCLK_PWRMGT_OFF);
          WREG32_P(GENERAL_PWRMGT, 0, ~SCLK_PWRMGT_OFF);
  else
          WREG32_P(SCLK_PWRMGT_CNTL, SCLK_PWRMGT_OFF, ~SCLK_PWRMGT_OFF);
          WREG32_P(GENERAL_PWRMGT, SCLK_PWRMGT_OFF, ~SCLK_PWRMGT_OFF);
}

void r600_enable_mclk_control(struct radeon_device *rdev, bool enable)

-- Markus _______________________________________________ dri-devel mailing list dri-devel@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/dri-devel

Alex Deucher

9 Sep 9 Sep

1:04 p.m.

On Mon, Sep 9, 2013 at 5:21 AM, Markus Trippelsdorf markus@trippelsdorf.de wrote:

...

On 2013.09.08 at 17:32 -0700, Eric W. Biederman wrote:

...
Markus Trippelsdorf markus@trippelsdorf.de writes:

...
Here are a couple of patches that get kexec working with radeon devices. I've tested this on my RS780. Comments or flames are welcome. Thanks.

A couple of high level comments.

This looks promising for the usual case.

Removing the printk at the end of the kexec path seems a little dubious, what of other cpus, interrupt handlers, etc. Basically estabilishing a new rule on when printk is allowed seems a little dubious at this point, even if it is a useful debugging trick.

OK. I will drop this patch. It doesn't seem to be necessary, because I cannot reproduce the printk related hang anymore.

...
Having a clean shutdown of the radeon definitely seems worth doing, because the cases where we care abouty video are when a person is in front of the system.

Yes. But please note that even with radeon_pci_shutdown implemented, I still get ring test failures on roughly every eighth kexec boot:

[drm:r600_dma_ring_test] *ERROR* radeon: ring 3 test failed (0xCAFEDEAD) radeon 0000:01:05.0: disabling GPU acceleration

That's definitely better than the current state of affairs, with ring test failures on every second boot. But I haven't figured out the reason for these failures yet. It's curious that once a ring test failure occurs, it will reliably fail after each kexec invocation, no matter how often repeated. Only a reboot brings the machine back to normal.

...
I don't know if you want to remove the sanity checks. They seem cheap and safe regardless. Are they expensive or ineffective? Moreover if they work a reasonable amount of the time that means that the kexec on panic case (where we don't shut anything down) can actually use the video, and that in general the driver will be more robust. I don't expect anyone much cares as kexec on panic is mostly used to just write a core file to the network, or the local disk. But if it is easy to keep that case working most of the time, why not.

IIRC Alex said the sanity checks are expensive and boot-time could be improved by dropping them. Maybe he can chime in?

They shouldn't be necessary with a proper shutdown, but in this particular case, they are not very expensive. What is expensive is having a separate sanity check functions for all the various hw blocks to teardown everything on startup prior to starting it up in case kexec, etc. left the system in a bad state. It ends up amounting to a full tear down sequence followed by a full start up sequence every time you load the driver.

I can't really comment on the first patch, but the rest seem fine.

Alex

ebiederm＠xmission.com

10 Sep 10 Sep

6:27 p.m.

Alex Deucher alexdeucher@gmail.com writes:

...

On Mon, Sep 9, 2013 at 5:21 AM, Markus Trippelsdorf markus@trippelsdorf.de wrote:

...
IIRC Alex said the sanity checks are expensive and boot-time could be improved by dropping them. Maybe he can chime in?

They shouldn't be necessary with a proper shutdown, but in this particular case, they are not very expensive. What is expensive is having a separate sanity check functions for all the various hw blocks to teardown everything on startup prior to starting it up in case kexec, etc. left the system in a bad state. It ends up amounting to a full tear down sequence followed by a full start up sequence every time you load the driver.

I can't really comment on the first patch, but the rest seem fine.

Let me reask the question just a little bit.

Is it the sanity checks that are expensive? Or is it the reinitialization that is triggered by the sanity checks that is expensive?

...

From what Christian said in the other reply it sounds like this is a

game we will never completely win, but it would be nice to have half a chance in the kexec on panic case to have video. So I am curious to know if the checks are expensive when we are coming at hardware in a clean state.

Eric

Alex Deucher

8:40 p.m.

On Tue, Sep 10, 2013 at 2:27 PM, Eric W. Biederman ebiederm@xmission.com wrote:

...

Alex Deucher alexdeucher@gmail.com writes:

...
On Mon, Sep 9, 2013 at 5:21 AM, Markus Trippelsdorf markus@trippelsdorf.de wrote:

...
IIRC Alex said the sanity checks are expensive and boot-time could be improved by dropping them. Maybe he can chime in?

They shouldn't be necessary with a proper shutdown, but in this particular case, they are not very expensive. What is expensive is having a separate sanity check functions for all the various hw blocks to teardown everything on startup prior to starting it up in case kexec, etc. left the system in a bad state. It ends up amounting to a full tear down sequence followed by a full start up sequence every time you load the driver.

I can't really comment on the first patch, but the rest seem fine.

Let me reask the question just a little bit.

Is it the sanity checks that are expensive? Or is it the reinitialization that is triggered by the sanity checks that is expensive?

From what Christian said in the other reply it sounds like this is a game we will never completely win, but it would be nice to have half a chance in the kexec on panic case to have video. So I am curious to know if the checks are expensive when we are coming at hardware in a clean state.

The particular sanity checks from this patch set are not expensive, but we had previously discussed more extensive sanity checks for other aspects of the chips in prior conversations. Prior to this patch set, the hw is not torn down properly (might have been in the middle of DMA for example) when kexec happens. That's why the sanity checks were added in the first place. With this patch set, the sanity checks shouldn't be necessary.

Alex

Markus Trippelsdorf

11 Sep 11 Sep

8:53 a.m.

On 2013.09.10 at 16:40 -0400, Alex Deucher wrote:

...

On Tue, Sep 10, 2013 at 2:27 PM, Eric W. Biederman ebiederm@xmission.com wrote:

...
Alex Deucher alexdeucher@gmail.com writes:

...
On Mon, Sep 9, 2013 at 5:21 AM, Markus Trippelsdorf markus@trippelsdorf.de wrote:

...
IIRC Alex said the sanity checks are expensive and boot-time could be improved by dropping them. Maybe he can chime in?

They shouldn't be necessary with a proper shutdown, but in this particular case, they are not very expensive. What is expensive is having a separate sanity check functions for all the various hw blocks to teardown everything on startup prior to starting it up in case kexec, etc. left the system in a bad state. It ends up amounting to a full tear down sequence followed by a full start up sequence every time you load the driver.

I can't really comment on the first patch, but the rest seem fine.

Let me reask the question just a little bit.

Is it the sanity checks that are expensive? Or is it the reinitialization that is triggered by the sanity checks that is expensive?

From what Christian said in the other reply it sounds like this is a game we will never completely win, but it would be nice to have half a chance in the kexec on panic case to have video. So I am curious to know if the checks are expensive when we are coming at hardware in a clean state.

The particular sanity checks from this patch set are not expensive, but we had previously discussed more extensive sanity checks for other aspects of the chips in prior conversations. Prior to this patch set, the hw is not torn down properly (might have been in the middle of DMA for example) when kexec happens. That's why the sanity checks were added in the first place. With this patch set, the sanity checks shouldn't be necessary.

I think you're talking past each other. What Eric worries about is the »kexec on panic« case, where the shutdown method *isn't* called. In this case the sanity checks, that are only superfluous when the hardware was shutdown normally during kexec (the default case), may actually help. And because the checks aren't expensive, it wouldn't hurt to just leave them in place.

-- Markus

Christian König

9:21 a.m.

Am 11.09.2013 10:53, schrieb Markus Trippelsdorf:

...

On 2013.09.10 at 16:40 -0400, Alex Deucher wrote:

...
On Tue, Sep 10, 2013 at 2:27 PM, Eric W. Biederman ebiederm@xmission.com wrote:

...
Alex Deucher alexdeucher@gmail.com writes:

...
On Mon, Sep 9, 2013 at 5:21 AM, Markus Trippelsdorf markus@trippelsdorf.de wrote:

...
IIRC Alex said the sanity checks are expensive and boot-time could be improved by dropping them. Maybe he can chime in?

They shouldn't be necessary with a proper shutdown, but in this particular case, they are not very expensive. What is expensive is having a separate sanity check functions for all the various hw blocks to teardown everything on startup prior to starting it up in case kexec, etc. left the system in a bad state. It ends up amounting to a full tear down sequence followed by a full start up sequence every time you load the driver.

I can't really comment on the first patch, but the rest seem fine.

Let me reask the question just a little bit.

Is it the sanity checks that are expensive? Or is it the reinitialization that is triggered by the sanity checks that is expensive?

From what Christian said in the other reply it sounds like this is a game we will never completely win, but it would be nice to have half a chance in the kexec on panic case to have video. So I am curious to know if the checks are expensive when we are coming at hardware in a clean state.

The particular sanity checks from this patch set are not expensive, but we had previously discussed more extensive sanity checks for other aspects of the chips in prior conversations. Prior to this patch set, the hw is not torn down properly (might have been in the middle of DMA for example) when kexec happens. That's why the sanity checks were added in the first place. With this patch set, the sanity checks shouldn't be necessary.

I think you're talking past each other. What Eric worries about is the »kexec on panic« case, where the shutdown method *isn't* called. In this case the sanity checks, that are only superfluous when the hardware was shutdown normally during kexec (the default case), may actually help. And because the checks aren't expensive, it wouldn't hurt to just leave them in place.

When we don't know the state the hardware is in it is really hard to get into a working configuration again, even with the sanity checks in place.

For example you can't just cut power or reset the UVD block without knowing it's state, cause then you have a good chance that you hit into the middle of a register or memory access of the VCPU. This usually results in the next few registers accesses only return either 0x0 or 0xffffffff and after that the system completely locks up.

Isn't it possible to know that we are in a "kexec after panic" case and only try to initialize the display hardware and not all the other blocks? That would at least allow us to get an error message of what happened on the screen and otherwise advise the user to do a clean reset.

Christian.

Alex Deucher

1:40 p.m.

On Wed, Sep 11, 2013 at 4:53 AM, Markus Trippelsdorf markus@trippelsdorf.de wrote:

...

On 2013.09.10 at 16:40 -0400, Alex Deucher wrote:

...
On Tue, Sep 10, 2013 at 2:27 PM, Eric W. Biederman ebiederm@xmission.com wrote:

...
Alex Deucher alexdeucher@gmail.com writes:

...
On Mon, Sep 9, 2013 at 5:21 AM, Markus Trippelsdorf markus@trippelsdorf.de wrote:

...
IIRC Alex said the sanity checks are expensive and boot-time could be improved by dropping them. Maybe he can chime in?

They shouldn't be necessary with a proper shutdown, but in this particular case, they are not very expensive. What is expensive is having a separate sanity check functions for all the various hw blocks to teardown everything on startup prior to starting it up in case kexec, etc. left the system in a bad state. It ends up amounting to a full tear down sequence followed by a full start up sequence every time you load the driver.

I can't really comment on the first patch, but the rest seem fine.

Let me reask the question just a little bit.

Is it the sanity checks that are expensive? Or is it the reinitialization that is triggered by the sanity checks that is expensive?

From what Christian said in the other reply it sounds like this is a game we will never completely win, but it would be nice to have half a chance in the kexec on panic case to have video. So I am curious to know if the checks are expensive when we are coming at hardware in a clean state.

The particular sanity checks from this patch set are not expensive, but we had previously discussed more extensive sanity checks for other aspects of the chips in prior conversations. Prior to this patch set, the hw is not torn down properly (might have been in the middle of DMA for example) when kexec happens. That's why the sanity checks were added in the first place. With this patch set, the sanity checks shouldn't be necessary.

I think you're talking past each other. What Eric worries about is the »kexec on panic« case, where the shutdown method *isn't* called. In this case the sanity checks, that are only superfluous when the hardware was shutdown normally during kexec (the default case), may actually help. And because the checks aren't expensive, it wouldn't hurt to just leave them in place.

For the particular sanity check in this patch set, that's fine. Even with that in place, it only covers a small subset of GPUs and it doesn't cover all possible problematic areas. In general kexec on panic seems like a recipe for trouble. You may end up in the middle of several dma transactions across multiple engines on the GPU. We could add code to tear down every block on the chip on start up in case such an event happened, but like Christian said, some blocks can't really be recovered depending on how things were ended, and then we've doubled the start up time for the driver. Additionally, most of the tear down is dependent on various driver state which we wouldn't have at start up, so most of code would need to re-architected so that it could be called independently of the relevant driver state.

Alex

4262

Age (days ago)

4265

Last active (days ago)

dri-devel@lists.freedesktop.org

18 comments

7 participants

tags (0)

participants (7)

Alex Deucher
Bruno Prémont
Christian König
Daniel Vetter
ebiederm＠xmission.com
Konrad Rzeszutek Wilk
Markus Trippelsdorf