Hi All,
Our last night's test on rpi4 had a nasty trace. The test was with 7c636d4d20f8 ("Merge tag 'dt-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc"). Previous test was on 9e9fb7655ed5 ("Merge tag 'net-next-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next") and it did not have this trace.
[ 40.975161] Unable to handle kernel access to user memory outside uaccess routines at virtual address 0000000000000348 [ 40.986187] Mem abort info: [ 40.989062] ESR = 0x96000004 [ 40.992233] EC = 0x25: DABT (current EL), IL = 32 bits [ 40.997699] SET = 0, FnV = 0 [ 41.001205] EA = 0, S1PTW = 0 [ 41.004428] FSC = 0x04: level 0 translation fault [ 41.009468] Data abort info: [ 41.012410] ISV = 0, ISS = 0x00000004 [ 41.016325] CM = 0, WnR = 0 [ 41.019358] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000042ae1000 [ 41.025926] [0000000000000348] pgd=0000000000000000, p4d=0000000000000000 [ 41.032845] Internal error: Oops: 96000004 [#1] PREEMPT SMP [ 41.038495] Modules linked in: overlay algif_hash algif_skcipher af_alg bnep sch_fq_codel ppdev lp parport ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 multipath linear uas usb_storage snd_soc_hdmi_codec btsdio brcmfmac brcmutil hci_uart btqca btrtl bcm2835_v4l2(C) btbcm crct10dif_ce bcm2835_mmal_vchiq(C) btintel raspberrypi_hwmon videobuf2_vmalloc videobuf2_memops bluetooth videobuf2_v4l2 videobuf2_common cfg80211 ecdh_generic ecc vc4 drm_kms_helper videodev dwc2 cec snd_bcm2835(C) i2c_brcmstb udc_core roles drm xhci_pci mc pwm_bcm2835 xhci_pci_renesas snd_soc_core ac97_bus snd_pcm_dmaengine snd_pcm phy_generic snd_timer uio_pdrv_genirq snd fb_sys_fops syscopyarea sysfillrect sysimgblt uio aes_neon_bs aes_neon_blk crypto_simd cryptd [ 41.116584] CPU: 0 PID: 1569 Comm: pulseaudio Tainted: G C 5.14.0-7c636d4d20f8 #1 [ 41.125494] Hardware name: Raspberry Pi 4 Model B (DT) [ 41.130699] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 41.137756] pc : vc4_hdmi_audio_prepare+0x3c0/0xba4 [vc4] [ 41.143256] lr : vc4_hdmi_audio_prepare+0x308/0xba4 [vc4] [ 41.148747] sp : ffff800012f73a50 [ 41.152099] x29: ffff800012f73a50 x28: ffff0000562ecc00 x27: 0000000000000000 [ 41.159338] x26: 0000000000000000 x25: 000000000000ac44 x24: 0000000021002003 [ 41.166574] x23: ffff800012f73b40 x22: 0000000000000003 x21: ffff000059400080 [ 41.173811] x20: ffff0000594004c8 x19: 0005833333380600 x18: 0000000000000000 [ 41.181047] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 [ 41.188283] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000991 [ 41.195520] x11: 0000000000000001 x10: 000000000001d4c0 x9 : ffff800009047838 [ 41.202757] x8 : 0000000000000031 x7 : 000000000001d4c0 x6 : 0000000000000030 [ 41.209993] x5 : ffff800012f73a98 x4 : ffff80000905bb60 x3 : 0000000010624dd3 [ 41.217230] x2 : 00000000000003e8 x1 : 0000000000000000 x0 : 0000000000562200 [ 41.224466] Call trace: [ 41.226939] vc4_hdmi_audio_prepare+0x3c0/0xba4 [vc4] [ 41.232080] hdmi_codec_prepare+0xe8/0x1b0 [snd_soc_hdmi_codec] [ 41.238083] snd_soc_pcm_dai_prepare+0x5c/0x10c [snd_soc_core] [ 41.244038] soc_pcm_prepare+0x5c/0x130 [snd_soc_core] [ 41.249276] snd_pcm_prepare+0x150/0x1f0 [snd_pcm] [ 41.254149] snd_pcm_common_ioctl+0x1644/0x1d14 [snd_pcm] [ 41.259635] snd_pcm_ioctl+0x3c/0x5c [snd_pcm] [ 41.264152] __arm64_sys_ioctl+0xb4/0x100 [ 41.268216] invoke_syscall+0x50/0x120 [ 41.272014] el0_svc_common+0x18c/0x1a4 [ 41.275899] do_el0_svc+0x34/0x9c [ 41.279254] el0_svc+0x2c/0xc0 [ 41.282348] el0t_64_sync_handler+0xa4/0x12c [ 41.286673] el0t_64_sync+0x1a4/0x1a8 [ 41.290385] Code: 52807d02 72a20c43 f9400421 9ba37c13 (f941a423) [ 41.296563] ---[ end trace dcfe08f10aaf6873 ]---
You can see the complete dmesg at https://openqa.qa.codethink.co.uk/tests/76#step/dmesg/8
Hi Sudip,
On Thu, Sep 02, 2021 at 10:08:19AM +0100, Sudip Mukherjee wrote:
Hi All,
Our last night's test on rpi4 had a nasty trace. The test was with 7c636d4d20f8 ("Merge tag 'dt-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc"). Previous test was on 9e9fb7655ed5 ("Merge tag 'net-next-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next") and it did not have this trace.
[ 40.975161] Unable to handle kernel access to user memory outside uaccess routines at virtual address 0000000000000348 [ 40.986187] Mem abort info: [ 40.989062] ESR = 0x96000004 [ 40.992233] EC = 0x25: DABT (current EL), IL = 32 bits [ 40.997699] SET = 0, FnV = 0 [ 41.001205] EA = 0, S1PTW = 0 [ 41.004428] FSC = 0x04: level 0 translation fault [ 41.009468] Data abort info: [ 41.012410] ISV = 0, ISS = 0x00000004 [ 41.016325] CM = 0, WnR = 0 [ 41.019358] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000042ae1000 [ 41.025926] [0000000000000348] pgd=0000000000000000, p4d=0000000000000000 [ 41.032845] Internal error: Oops: 96000004 [#1] PREEMPT SMP [ 41.038495] Modules linked in: overlay algif_hash algif_skcipher af_alg bnep sch_fq_codel ppdev lp parport ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 multipath linear uas usb_storage snd_soc_hdmi_codec btsdio brcmfmac brcmutil hci_uart btqca btrtl bcm2835_v4l2(C) btbcm crct10dif_ce bcm2835_mmal_vchiq(C) btintel raspberrypi_hwmon videobuf2_vmalloc videobuf2_memops bluetooth videobuf2_v4l2 videobuf2_common cfg80211 ecdh_generic ecc vc4 drm_kms_helper videodev dwc2 cec snd_bcm2835(C) i2c_brcmstb udc_core roles drm xhci_pci mc pwm_bcm2835 xhci_pci_renesas snd_soc_core ac97_bus snd_pcm_dmaengine snd_pcm phy_generic snd_timer uio_pdrv_genirq snd fb_sys_fops syscopyarea sysfillrect sysimgblt uio aes_neon_bs aes_neon_blk crypto_simd cryptd [ 41.116584] CPU: 0 PID: 1569 Comm: pulseaudio Tainted: G C 5.14.0-7c636d4d20f8 #1 [ 41.125494] Hardware name: Raspberry Pi 4 Model B (DT) [ 41.130699] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 41.137756] pc : vc4_hdmi_audio_prepare+0x3c0/0xba4 [vc4] [ 41.143256] lr : vc4_hdmi_audio_prepare+0x308/0xba4 [vc4] [ 41.148747] sp : ffff800012f73a50 [ 41.152099] x29: ffff800012f73a50 x28: ffff0000562ecc00 x27: 0000000000000000 [ 41.159338] x26: 0000000000000000 x25: 000000000000ac44 x24: 0000000021002003 [ 41.166574] x23: ffff800012f73b40 x22: 0000000000000003 x21: ffff000059400080 [ 41.173811] x20: ffff0000594004c8 x19: 0005833333380600 x18: 0000000000000000 [ 41.181047] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 [ 41.188283] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000991 [ 41.195520] x11: 0000000000000001 x10: 000000000001d4c0 x9 : ffff800009047838 [ 41.202757] x8 : 0000000000000031 x7 : 000000000001d4c0 x6 : 0000000000000030 [ 41.209993] x5 : ffff800012f73a98 x4 : ffff80000905bb60 x3 : 0000000010624dd3 [ 41.217230] x2 : 00000000000003e8 x1 : 0000000000000000 x0 : 0000000000562200 [ 41.224466] Call trace: [ 41.226939] vc4_hdmi_audio_prepare+0x3c0/0xba4 [vc4] [ 41.232080] hdmi_codec_prepare+0xe8/0x1b0 [snd_soc_hdmi_codec] [ 41.238083] snd_soc_pcm_dai_prepare+0x5c/0x10c [snd_soc_core] [ 41.244038] soc_pcm_prepare+0x5c/0x130 [snd_soc_core] [ 41.249276] snd_pcm_prepare+0x150/0x1f0 [snd_pcm] [ 41.254149] snd_pcm_common_ioctl+0x1644/0x1d14 [snd_pcm] [ 41.259635] snd_pcm_ioctl+0x3c/0x5c [snd_pcm] [ 41.264152] __arm64_sys_ioctl+0xb4/0x100 [ 41.268216] invoke_syscall+0x50/0x120 [ 41.272014] el0_svc_common+0x18c/0x1a4 [ 41.275899] do_el0_svc+0x34/0x9c [ 41.279254] el0_svc+0x2c/0xc0 [ 41.282348] el0t_64_sync_handler+0xa4/0x12c [ 41.286673] el0t_64_sync+0x1a4/0x1a8 [ 41.290385] Code: 52807d02 72a20c43 f9400421 9ba37c13 (f941a423) [ 41.296563] ---[ end trace dcfe08f10aaf6873 ]---
You can see the complete dmesg at https://openqa.qa.codethink.co.uk/tests/76#step/dmesg/8
What test were you running?
Maxime
Hi Maxime,
On Fri, Sep 3, 2021 at 5:03 PM Maxime Ripard maxime@cerno.tech wrote:
Hi Sudip,
On Thu, Sep 02, 2021 at 10:08:19AM +0100, Sudip Mukherjee wrote:
Hi All,
<snip>
You can see the complete dmesg at https://openqa.qa.codethink.co.uk/tests/76#step/dmesg/8
What test were you running?
Nothing particular, its just a boot test that we do every night after pulling from torvalds/linux.git
On Fri, Sep 03, 2021 at 09:09:50PM +0100, Sudip Mukherjee wrote:
Hi Maxime,
On Fri, Sep 3, 2021 at 5:03 PM Maxime Ripard maxime@cerno.tech wrote:
Hi Sudip,
On Thu, Sep 02, 2021 at 10:08:19AM +0100, Sudip Mukherjee wrote:
Hi All,
<snip>
You can see the complete dmesg at https://openqa.qa.codethink.co.uk/tests/76#step/dmesg/8
What test were you running?
Nothing particular, its just a boot test that we do every night after pulling from torvalds/linux.git
What are you booting to then?
Maxime
Hi Maxime,
On Sat, Sep 4, 2021 at 10:10 AM Maxime Ripard maxime@cerno.tech wrote:
On Fri, Sep 03, 2021 at 09:09:50PM +0100, Sudip Mukherjee wrote:
Hi Maxime,
On Fri, Sep 3, 2021 at 5:03 PM Maxime Ripard maxime@cerno.tech wrote:
Hi Sudip,
On Thu, Sep 02, 2021 at 10:08:19AM +0100, Sudip Mukherjee wrote:
Hi All,
<snip>
You can see the complete dmesg at https://openqa.qa.codethink.co.uk/tests/76#step/dmesg/8
What test were you running?
Nothing particular, its just a boot test that we do every night after pulling from torvalds/linux.git
What are you booting to then?
I am not sure I understood the question. Its an Ubuntu image. The desktop environment is gnome. And as mentioned earlier, we use the HEAD of linus tree every night to boot the rpi4 and test that we can login via desktop environment and that there is no regression in dmesg.
On Sat, Sep 04, 2021 at 10:40:29AM +0100, Sudip Mukherjee wrote:
Hi Maxime,
On Sat, Sep 4, 2021 at 10:10 AM Maxime Ripard maxime@cerno.tech wrote:
On Fri, Sep 03, 2021 at 09:09:50PM +0100, Sudip Mukherjee wrote:
Hi Maxime,
On Fri, Sep 3, 2021 at 5:03 PM Maxime Ripard maxime@cerno.tech wrote:
Hi Sudip,
On Thu, Sep 02, 2021 at 10:08:19AM +0100, Sudip Mukherjee wrote:
Hi All,
<snip>
You can see the complete dmesg at https://openqa.qa.codethink.co.uk/tests/76#step/dmesg/8
What test were you running?
Nothing particular, its just a boot test that we do every night after pulling from torvalds/linux.git
What are you booting to then?
I am not sure I understood the question. Its an Ubuntu image. The desktop environment is gnome. And as mentioned earlier, we use the HEAD of linus tree every night to boot the rpi4 and test that we can login via desktop environment and that there is no regression in dmesg.
Looking at the CI, this isn't from a RPi but from qemu?
What defconfig are you using? How did you generate the Ubuntu image? Through debootstrap? Any additional package?
Maxime
On Mon, Sep 20, 2021 at 04:47:31PM +0200, Maxime Ripard wrote:
On Sat, Sep 04, 2021 at 10:40:29AM +0100, Sudip Mukherjee wrote:
Hi Maxime,
On Sat, Sep 4, 2021 at 10:10 AM Maxime Ripard maxime@cerno.tech wrote:
On Fri, Sep 03, 2021 at 09:09:50PM +0100, Sudip Mukherjee wrote:
Hi Maxime,
On Fri, Sep 3, 2021 at 5:03 PM Maxime Ripard maxime@cerno.tech wrote:
Hi Sudip,
On Thu, Sep 02, 2021 at 10:08:19AM +0100, Sudip Mukherjee wrote:
Hi All,
<snip>
You can see the complete dmesg at https://openqa.qa.codethink.co.uk/tests/76#step/dmesg/8
What test were you running?
Nothing particular, its just a boot test that we do every night after pulling from torvalds/linux.git
What are you booting to then?
I am not sure I understood the question. Its an Ubuntu image. The desktop environment is gnome. And as mentioned earlier, we use the HEAD of linus tree every night to boot the rpi4 and test that we can login via desktop environment and that there is no regression in dmesg.
Looking at the CI, this isn't from a RPi but from qemu?
What defconfig are you using? How did you generate the Ubuntu image? Through debootstrap? Any additional package?
So qemu (at least on Fedora 34) doesn't seem to have an RPI4 target, nor upstream: https://github.com/qemu/qemu/blob/079b1252e9de384385c9da910262312ec2e574c8/h...
I've tested an Ubuntu 21.04 arm64 build (since it seems like it's what you've been using), built using debootstrap + ubuntu-desktop, both with and without a monitor attached, and up to the desktop once logged in.
I don't see any crash.
Maxime
On Mon, Sep 20, 2021 at 05:43:33PM +0200, Maxime Ripard wrote:
On Mon, Sep 20, 2021 at 04:47:31PM +0200, Maxime Ripard wrote:
On Sat, Sep 04, 2021 at 10:40:29AM +0100, Sudip Mukherjee wrote:
Hi Maxime,
On Sat, Sep 4, 2021 at 10:10 AM Maxime Ripard maxime@cerno.tech wrote:
On Fri, Sep 03, 2021 at 09:09:50PM +0100, Sudip Mukherjee wrote:
Hi Maxime,
On Fri, Sep 3, 2021 at 5:03 PM Maxime Ripard maxime@cerno.tech wrote:
Hi Sudip,
On Thu, Sep 02, 2021 at 10:08:19AM +0100, Sudip Mukherjee wrote: > Hi All, >
<snip>
> > You can see the complete dmesg at > https://openqa.qa.codethink.co.uk/tests/76#step/dmesg/8
What test were you running?
Nothing particular, its just a boot test that we do every night after pulling from torvalds/linux.git
What are you booting to then?
I am not sure I understood the question. Its an Ubuntu image. The desktop environment is gnome. And as mentioned earlier, we use the HEAD of linus tree every night to boot the rpi4 and test that we can login via desktop environment and that there is no regression in dmesg.
Looking at the CI, this isn't from a RPi but from qemu?
What defconfig are you using? How did you generate the Ubuntu image? Through debootstrap? Any additional package?
So qemu (at least on Fedora 34) doesn't seem to have an RPI4 target, nor upstream: https://github.com/qemu/qemu/blob/079b1252e9de384385c9da910262312ec2e574c8/h...
I've tested an Ubuntu 21.04 arm64 build (since it seems like it's what you've been using), built using debootstrap + ubuntu-desktop, both with and without a monitor attached, and up to the desktop once logged in.
I don't see any crash.
That was with arm64's defconfig.
Maxime
On Mon, Sep 20, 2021 at 4:53 PM Maxime Ripard maxime@cerno.tech wrote:
On Mon, Sep 20, 2021 at 05:43:33PM +0200, Maxime Ripard wrote:
On Mon, Sep 20, 2021 at 04:47:31PM +0200, Maxime Ripard wrote:
On Sat, Sep 04, 2021 at 10:40:29AM +0100, Sudip Mukherjee wrote:
Hi Maxime,
On Sat, Sep 4, 2021 at 10:10 AM Maxime Ripard maxime@cerno.tech wrote:
On Fri, Sep 03, 2021 at 09:09:50PM +0100, Sudip Mukherjee wrote:
Hi Maxime,
On Fri, Sep 3, 2021 at 5:03 PM Maxime Ripard maxime@cerno.tech wrote: > > Hi Sudip, > > On Thu, Sep 02, 2021 at 10:08:19AM +0100, Sudip Mukherjee wrote: > > Hi All, > > >
<snip>
> > > > > You can see the complete dmesg at > > https://openqa.qa.codethink.co.uk/tests/76#step/dmesg/8 > > What test were you running?
Nothing particular, its just a boot test that we do every night after pulling from torvalds/linux.git
What are you booting to then?
I am not sure I understood the question. Its an Ubuntu image. The desktop environment is gnome. And as mentioned earlier, we use the HEAD of linus tree every night to boot the rpi4 and test that we can login via desktop environment and that there is no regression in dmesg.
Looking at the CI, this isn't from a RPi but from qemu?
No, this is from rpi4 board (4GB), not qemu. The CI is a little complicated here, lava boots the board with the new kernel and will then trigger the openQA job. openQA will then connect to the board using vnc to test the desktop. This is the last link that I sent to Linus when he asked for it. https://lava.qa.codethink.co.uk/scheduler/job/164#L1356
And this is the lava job for today - https://lava.qa.codethink.co.uk/scheduler/job/173
What defconfig are you using? How did you generate the Ubuntu image? Through debootstrap? Any additional package?
Its the default ubuntu config. I can send you the config if you want.
So qemu (at least on Fedora 34) doesn't seem to have an RPI4 target, nor upstream: https://github.com/qemu/qemu/blob/079b1252e9de384385c9da910262312ec2e574c8/h...
Like I said, its not on qemu.
I've tested an Ubuntu 21.04 arm64 build (since it seems like it's what you've been using), built using debootstrap + ubuntu-desktop, both with and without a monitor attached, and up to the desktop once logged in.
I am using the official ubuntu image downloaded from https://cdimage.ubuntu.com/releases/21.04/release/ubuntu-21.04-preinstalled-...
On Mon, Sep 20, 2021 at 05:35:00PM +0100, Sudip Mukherjee wrote:
On Mon, Sep 20, 2021 at 4:53 PM Maxime Ripard maxime@cerno.tech wrote:
On Mon, Sep 20, 2021 at 05:43:33PM +0200, Maxime Ripard wrote:
On Mon, Sep 20, 2021 at 04:47:31PM +0200, Maxime Ripard wrote:
On Sat, Sep 04, 2021 at 10:40:29AM +0100, Sudip Mukherjee wrote:
Hi Maxime,
On Sat, Sep 4, 2021 at 10:10 AM Maxime Ripard maxime@cerno.tech wrote:
On Fri, Sep 03, 2021 at 09:09:50PM +0100, Sudip Mukherjee wrote: > Hi Maxime, > > On Fri, Sep 3, 2021 at 5:03 PM Maxime Ripard maxime@cerno.tech wrote: > > > > Hi Sudip, > > > > On Thu, Sep 02, 2021 at 10:08:19AM +0100, Sudip Mukherjee wrote: > > > Hi All, > > > > > > > <snip> > > > > > > > > > You can see the complete dmesg at > > > https://openqa.qa.codethink.co.uk/tests/76#step/dmesg/8 > > > > What test were you running? > > Nothing particular, its just a boot test that we do every night after > pulling from torvalds/linux.git
What are you booting to then?
I am not sure I understood the question. Its an Ubuntu image. The desktop environment is gnome. And as mentioned earlier, we use the HEAD of linus tree every night to boot the rpi4 and test that we can login via desktop environment and that there is no regression in dmesg.
Looking at the CI, this isn't from a RPi but from qemu?
No, this is from rpi4 board (4GB), not qemu. The CI is a little complicated here, lava boots the board with the new kernel and will then trigger the openQA job. openQA will then connect to the board using vnc to test the desktop. This is the last link that I sent to Linus when he asked for it. https://lava.qa.codethink.co.uk/scheduler/job/164#L1356
And this is the lava job for today - https://lava.qa.codethink.co.uk/scheduler/job/173
Is it connected to a monitor then?
What defconfig are you using? How did you generate the Ubuntu image? Through debootstrap? Any additional package?
Its the default ubuntu config. I can send you the config if you want.
Yes, please.
Thanks Maxime
On Mon, Sep 20, 2021 at 6:10 PM Maxime Ripard maxime@cerno.tech wrote:
On Mon, Sep 20, 2021 at 05:35:00PM +0100, Sudip Mukherjee wrote:
On Mon, Sep 20, 2021 at 4:53 PM Maxime Ripard maxime@cerno.tech wrote:
On Mon, Sep 20, 2021 at 05:43:33PM +0200, Maxime Ripard wrote:
On Mon, Sep 20, 2021 at 04:47:31PM +0200, Maxime Ripard wrote:
On Sat, Sep 04, 2021 at 10:40:29AM +0100, Sudip Mukherjee wrote:
Hi Maxime,
On Sat, Sep 4, 2021 at 10:10 AM Maxime Ripard maxime@cerno.tech wrote: > > On Fri, Sep 03, 2021 at 09:09:50PM +0100, Sudip Mukherjee wrote: > > Hi Maxime, > > > > On Fri, Sep 3, 2021 at 5:03 PM Maxime Ripard maxime@cerno.tech wrote: > > > > > > Hi Sudip, > > > > > > On Thu, Sep 02, 2021 at 10:08:19AM +0100, Sudip Mukherjee wrote: > > > > Hi All, > > > > > > > > > > > <snip> > > > > > > > > > > > > > You can see the complete dmesg at > > > > https://openqa.qa.codethink.co.uk/tests/76#step/dmesg/8 > > > > > > What test were you running? > > > > Nothing particular, its just a boot test that we do every night after > > pulling from torvalds/linux.git > > What are you booting to then?
I am not sure I understood the question. Its an Ubuntu image. The desktop environment is gnome. And as mentioned earlier, we use the HEAD of linus tree every night to boot the rpi4 and test that we can login via desktop environment and that there is no regression in dmesg.
Looking at the CI, this isn't from a RPi but from qemu?
No, this is from rpi4 board (4GB), not qemu. The CI is a little complicated here, lava boots the board with the new kernel and will then trigger the openQA job. openQA will then connect to the board using vnc to test the desktop. This is the last link that I sent to Linus when he asked for it. https://lava.qa.codethink.co.uk/scheduler/job/164#L1356
And this is the lava job for today - https://lava.qa.codethink.co.uk/scheduler/job/173
Is it connected to a monitor then?
Missed replying to this one earlier. I have a hdmi lilliput monitor connected to it.
What defconfig are you using? How did you generate the Ubuntu image? Through debootstrap? Any additional package?
Its the default ubuntu config. I can send you the config if you want.
Yes, please.
Attached. My build script will copy this config as .config, do olddefconfig and then build.
On Mon, Sep 20, 2021 at 06:21:42PM +0100, Sudip Mukherjee wrote:
On Mon, Sep 20, 2021 at 6:10 PM Maxime Ripard maxime@cerno.tech wrote:
On Mon, Sep 20, 2021 at 05:35:00PM +0100, Sudip Mukherjee wrote:
On Mon, Sep 20, 2021 at 4:53 PM Maxime Ripard maxime@cerno.tech wrote:
On Mon, Sep 20, 2021 at 05:43:33PM +0200, Maxime Ripard wrote:
On Mon, Sep 20, 2021 at 04:47:31PM +0200, Maxime Ripard wrote:
On Sat, Sep 04, 2021 at 10:40:29AM +0100, Sudip Mukherjee wrote: > Hi Maxime, > > On Sat, Sep 4, 2021 at 10:10 AM Maxime Ripard maxime@cerno.tech wrote: > > > > On Fri, Sep 03, 2021 at 09:09:50PM +0100, Sudip Mukherjee wrote: > > > Hi Maxime, > > > > > > On Fri, Sep 3, 2021 at 5:03 PM Maxime Ripard maxime@cerno.tech wrote: > > > > > > > > Hi Sudip, > > > > > > > > On Thu, Sep 02, 2021 at 10:08:19AM +0100, Sudip Mukherjee wrote: > > > > > Hi All, > > > > > > > > > > > > > > > <snip> > > > > > > > > > > > > > > > > > You can see the complete dmesg at > > > > > https://openqa.qa.codethink.co.uk/tests/76#step/dmesg/8 > > > > > > > > What test were you running? > > > > > > Nothing particular, its just a boot test that we do every night after > > > pulling from torvalds/linux.git > > > > What are you booting to then? > > I am not sure I understood the question. > Its an Ubuntu image. The desktop environment is gnome. And as > mentioned earlier, we use the HEAD of linus tree every night to boot > the rpi4 and test that we can login via desktop environment and that > there is no regression in dmesg.
Looking at the CI, this isn't from a RPi but from qemu?
No, this is from rpi4 board (4GB), not qemu. The CI is a little complicated here, lava boots the board with the new kernel and will then trigger the openQA job. openQA will then connect to the board using vnc to test the desktop. This is the last link that I sent to Linus when he asked for it. https://lava.qa.codethink.co.uk/scheduler/job/164#L1356
And this is the lava job for today - https://lava.qa.codethink.co.uk/scheduler/job/173
Is it connected to a monitor then?
Missed replying to this one earlier. I have a hdmi lilliput monitor connected to it.
What defconfig are you using? How did you generate the Ubuntu image? Through debootstrap? Any additional package?
Its the default ubuntu config. I can send you the config if you want.
Yes, please.
Attached. My build script will copy this config as .config, do olddefconfig and then build.
I still can't reproduce it.
What other customisations did you do to that image? It looks like there's some customs scripts in there (test-mainline.sh), what are they doing?
Maxime
On Wed, Sep 22, 2021 at 10:57 AM Maxime Ripard maxime@cerno.tech wrote:
On Mon, Sep 20, 2021 at 06:21:42PM +0100, Sudip Mukherjee wrote:
On Mon, Sep 20, 2021 at 6:10 PM Maxime Ripard maxime@cerno.tech wrote:
On Mon, Sep 20, 2021 at 05:35:00PM +0100, Sudip Mukherjee wrote:
On Mon, Sep 20, 2021 at 4:53 PM Maxime Ripard maxime@cerno.tech wrote:
On Mon, Sep 20, 2021 at 05:43:33PM +0200, Maxime Ripard wrote:
On Mon, Sep 20, 2021 at 04:47:31PM +0200, Maxime Ripard wrote: > On Sat, Sep 04, 2021 at 10:40:29AM +0100, Sudip Mukherjee wrote: > > Hi Maxime, > > > > On Sat, Sep 4, 2021 at 10:10 AM Maxime Ripard maxime@cerno.tech wrote: > > > > > > On Fri, Sep 03, 2021 at 09:09:50PM +0100, Sudip Mukherjee wrote: > > > > Hi Maxime, > > > > > > > > On Fri, Sep 3, 2021 at 5:03 PM Maxime Ripard maxime@cerno.tech wrote: > > > > > > > > > > Hi Sudip, > > > > > > > > > > On Thu, Sep 02, 2021 at 10:08:19AM +0100, Sudip Mukherjee wrote: > > > > > > Hi All, > > > > > > > > > > > > > > > > > > > <snip> > > > > > > > > > > > > > > > > > > > > > You can see the complete dmesg at > > > > > > https://openqa.qa.codethink.co.uk/tests/76#step/dmesg/8 > > > > > > > > > > What test were you running? > > > > > > > > Nothing particular, its just a boot test that we do every night after > > > > pulling from torvalds/linux.git > > > > > > What are you booting to then? > > > > I am not sure I understood the question. > > Its an Ubuntu image. The desktop environment is gnome. And as > > mentioned earlier, we use the HEAD of linus tree every night to boot > > the rpi4 and test that we can login via desktop environment and that > > there is no regression in dmesg. > > Looking at the CI, this isn't from a RPi but from qemu?
No, this is from rpi4 board (4GB), not qemu. The CI is a little complicated here, lava boots the board with the new kernel and will then trigger the openQA job. openQA will then connect to the board using vnc to test the desktop. This is the last link that I sent to Linus when he asked for it. https://lava.qa.codethink.co.uk/scheduler/job/164#L1356
And this is the lava job for today - https://lava.qa.codethink.co.uk/scheduler/job/173
Is it connected to a monitor then?
Missed replying to this one earlier. I have a hdmi lilliput monitor connected to it.
> > What defconfig are you using? How did you generate the Ubuntu image? > Through debootstrap? Any additional package?
Its the default ubuntu config. I can send you the config if you want.
Yes, please.
Attached. My build script will copy this config as .config, do olddefconfig and then build.
I still can't reproduce it.
What other customisations did you do to that image? It looks like there's some customs scripts in there (test-mainline.sh), what are they doing?
That test script is triggering the openqa job, but its running only after lava is able to login. The trace is appearing before the login prompt even, so test_mainline.sh should not matter here. The only customization done to the default ubuntu image is the initrd. Can you please try with the initrd from https://elisa-builder-00.iol.unh.edu/kernel/mainline/rpi4/initrd.gz.. I will check with another board also.
On Wed, Sep 22, 2021 at 11:10:34AM +0100, Sudip Mukherjee wrote:
On Wed, Sep 22, 2021 at 10:57 AM Maxime Ripard maxime@cerno.tech wrote:
On Mon, Sep 20, 2021 at 06:21:42PM +0100, Sudip Mukherjee wrote:
On Mon, Sep 20, 2021 at 6:10 PM Maxime Ripard maxime@cerno.tech wrote:
On Mon, Sep 20, 2021 at 05:35:00PM +0100, Sudip Mukherjee wrote:
On Mon, Sep 20, 2021 at 4:53 PM Maxime Ripard maxime@cerno.tech wrote:
On Mon, Sep 20, 2021 at 05:43:33PM +0200, Maxime Ripard wrote: > On Mon, Sep 20, 2021 at 04:47:31PM +0200, Maxime Ripard wrote: > > On Sat, Sep 04, 2021 at 10:40:29AM +0100, Sudip Mukherjee wrote: > > > Hi Maxime, > > > > > > On Sat, Sep 4, 2021 at 10:10 AM Maxime Ripard maxime@cerno.tech wrote: > > > > > > > > On Fri, Sep 03, 2021 at 09:09:50PM +0100, Sudip Mukherjee wrote: > > > > > Hi Maxime, > > > > > > > > > > On Fri, Sep 3, 2021 at 5:03 PM Maxime Ripard maxime@cerno.tech wrote: > > > > > > > > > > > > Hi Sudip, > > > > > > > > > > > > On Thu, Sep 02, 2021 at 10:08:19AM +0100, Sudip Mukherjee wrote: > > > > > > > Hi All, > > > > > > > > > > > > > > > > > > > > > > > <snip> > > > > > > > > > > > > > > > > > > > > > > > > > You can see the complete dmesg at > > > > > > > https://openqa.qa.codethink.co.uk/tests/76#step/dmesg/8 > > > > > > > > > > > > What test were you running? > > > > > > > > > > Nothing particular, its just a boot test that we do every night after > > > > > pulling from torvalds/linux.git > > > > > > > > What are you booting to then? > > > > > > I am not sure I understood the question. > > > Its an Ubuntu image. The desktop environment is gnome. And as > > > mentioned earlier, we use the HEAD of linus tree every night to boot > > > the rpi4 and test that we can login via desktop environment and that > > > there is no regression in dmesg. > > > > Looking at the CI, this isn't from a RPi but from qemu?
No, this is from rpi4 board (4GB), not qemu. The CI is a little complicated here, lava boots the board with the new kernel and will then trigger the openQA job. openQA will then connect to the board using vnc to test the desktop. This is the last link that I sent to Linus when he asked for it. https://lava.qa.codethink.co.uk/scheduler/job/164#L1356
And this is the lava job for today - https://lava.qa.codethink.co.uk/scheduler/job/173
Is it connected to a monitor then?
Missed replying to this one earlier. I have a hdmi lilliput monitor connected to it.
> > > > What defconfig are you using? How did you generate the Ubuntu image? > > Through debootstrap? Any additional package?
Its the default ubuntu config. I can send you the config if you want.
Yes, please.
Attached. My build script will copy this config as .config, do olddefconfig and then build.
I still can't reproduce it.
What other customisations did you do to that image? It looks like there's some customs scripts in there (test-mainline.sh), what are they doing?
That test script is triggering the openqa job, but its running only after lava is able to login. The trace is appearing before the login prompt even, so test_mainline.sh should not matter here. The only customization done to the default ubuntu image is the initrd. Can you please try with the initrd from https://elisa-builder-00.iol.unh.edu/kernel/mainline/rpi4/initrd.gz.. I will check with another board also.
Still works fine (and it required some mangling of the kernel command line).
If we summarize:
- You initially just dumped a panic and a link to your QA, without any more context:
- Then stating that you're not doing any test, really;
- Well, except for booting Ubuntu, but no other modification
- But you're not booting the standard image
- And with a custom initrd
- And that QA link states that you're booting from QEMU, but you're not.
Please provide a full documentation on what you're doing to generate that image, from scratch, in order to get that panic you reported previously.
I've spent an entire day trying to make sense of what you're doing exactly to get into that situation. I have other things to work on and I don't plan on figuring out any random CI system.
Maxime
On Wed, Sep 22, 2021 at 12:28 PM Maxime Ripard maxime@cerno.tech wrote:
On Wed, Sep 22, 2021 at 11:10:34AM +0100, Sudip Mukherjee wrote:
On Wed, Sep 22, 2021 at 10:57 AM Maxime Ripard maxime@cerno.tech wrote:
<snip>
Still works fine (and it required some mangling of the kernel command line).
If we summarize:
- You initially just dumped a panic and a link to your QA, without any more context:
The SHA was also given, and I didn't know what else you would need. The openQA link was given to show the dmesg.
- Then stating that you're not doing any test, really;
Yes, and I still say that, its just a boot test.
Well, except for booting Ubuntu, but no other modification
But you're not booting the standard image
And with a custom initrd
yes, something which has always worked in boot-testing LTS kernel or mainline kernel.
- And that QA link states that you're booting from QEMU, but you're not.
I only found that the "WORKER_CLASS" has the name "qemu_rpi4", that is a name which I choose to give as that worker laptop is connected to rpi4 and also running qemu tests. If you want I can change the name of the WORKER_CLASS. :) iiuc, dmesg shows if its booting in a qemu or on a real hardware.
Please provide a full documentation on what you're doing to generate that image, from scratch, in order to get that panic you reported previously.
I have now ordered another rpi4 board and will create the image from scratch and give you the steps.
I've spent an entire day trying to make sense of what you're doing exactly to get into that situation. I have other things to work on and I don't plan on figuring out any random CI system.
I am not really sure why you are trying to figure out a random CI system. I can reproduce the problem in our setup everytime I test with that reverted commit and I have already said I am happy to test with a debug patch or anything else.
On Wed, Sep 22, 2021 at 3:11 AM Sudip Mukherjee sudipm.mukherjee@gmail.com wrote:
That test script is triggering the openqa job, but its running only after lava is able to login. The trace is appearing before the login prompt even, so test_mainline.sh should not matter here.
Side note: the traces might be more legible if you have debug info in the kernel, and run the dmesg through the script in
scripts/decode_stacktrace.sh
which should give line numbers and inlining information.
That often makes it much easier to see which access it is that hits a NULL pointer dereference.
On x86-64, generally just decode the instruction stream, and look at the instruction patterns and try to figure out where an oops is coming from, but that's much less useful on arm64 (partly because I'm not as used to it, but because the arm64 oopses don't print out much of the instructions so there's often little to go by).
Linus
On Wed, Sep 22, 2021 at 4:25 PM Linus Torvalds torvalds@linux-foundation.org wrote:
On Wed, Sep 22, 2021 at 3:11 AM Sudip Mukherjee sudipm.mukherjee@gmail.com wrote:
That test script is triggering the openqa job, but its running only after lava is able to login. The trace is appearing before the login prompt even, so test_mainline.sh should not matter here.
Side note: the traces might be more legible if you have debug info in the kernel, and run the dmesg through the script in
scripts/decode_stacktrace.sh
which should give line numbers and inlining information.
Attached is a complete dmesg and also the decoded trace. This is done on 4357f03d6611 ("Merge tag 'pm-5.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm")
On Wed, Sep 22, 2021 at 10:02 AM Sudip Mukherjee sudipm.mukherjee@gmail.com wrote:
Attached is a complete dmesg and also the decoded trace. This is done on 4357f03d6611 ("Merge tag 'pm-5.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm")
drivers/gpu/drm/vc4/vc4_hdmi.c:1214 is
tmp = (u64)(mode->clock * 1000) * n;
in vc4_hdmi_set_n_cts(), which has apparently been inlined from vc4_hdmi_audio_prepare() in vc4_hdmi.c:1398.
So it looks like 'mode' is some offset off a NULL pointer.
Which looks not impossible:
1207 struct drm_connector *connector = &vc4_hdmi->connector; 1208 struct drm_crtc *crtc = connector->state->crtc; 1209 const struct drm_display_mode *mode = &crtc->state->adjusted_mode;
looks like crtc->state perhaps might be NULL.
Although it's entirely possible that it's 'crtc' itself that is NULL or one of the earlier indirection accesses.
The exact line information from the debug info is very useful and mostly correct, but at the same time should always be taken with a small pinch of salt.
Compiler optimizations means that code gets munged and moved around, and since this is the first access to 'mode', I would not be surprised if some of the calculations and accesses to get 'mode' might be moved around to it.
Linus
On Wed, Sep 22, 2021 at 7:23 PM Linus Torvalds torvalds@linux-foundation.org wrote:
On Wed, Sep 22, 2021 at 10:02 AM Sudip Mukherjee sudipm.mukherjee@gmail.com wrote:
Attached is a complete dmesg and also the decoded trace. This is done on 4357f03d6611 ("Merge tag 'pm-5.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm")
drivers/gpu/drm/vc4/vc4_hdmi.c:1214 is
tmp = (u64)(mode->clock * 1000) * n;
in vc4_hdmi_set_n_cts(), which has apparently been inlined from vc4_hdmi_audio_prepare() in vc4_hdmi.c:1398.
So it looks like 'mode' is some offset off a NULL pointer.
Which looks not impossible:
1207 struct drm_connector *connector = &vc4_hdmi->connector; 1208 struct drm_crtc *crtc = connector->state->crtc; 1209 const struct drm_display_mode *mode = &crtc->state->adjusted_mode;
looks like crtc->state perhaps might be NULL.
I added some debugs to print the addresses, and I am getting: [ 38.813809] sudip crtc 0000000000000000
This is from struct drm_crtc *crtc = connector->state->crtc;
connector and connector->state had valid addresses. [ 38.805302] sudip connector ffff000040bac578 [ 38.809779] sudip state ffff000057eb5400
This is the diff of the debug I added: diff --git a/drivers/gpu/drm/vc4/vc4_hdmi.c b/drivers/gpu/drm/vc4/vc4_hdmi.c index 4a1115043114..2a8f06948094 100644 --- a/drivers/gpu/drm/vc4/vc4_hdmi.c +++ b/drivers/gpu/drm/vc4/vc4_hdmi.c @@ -1205,11 +1205,20 @@ static void vc4_hdmi_audio_set_mai_clock(struct vc4_hdmi *vc4_hdmi, static void vc4_hdmi_set_n_cts(struct vc4_hdmi *vc4_hdmi, unsigned int samplerate) { struct drm_connector *connector = &vc4_hdmi->connector; - struct drm_crtc *crtc = connector->state->crtc; - const struct drm_display_mode *mode = &crtc->state->adjusted_mode; + struct drm_crtc *crtc; + struct drm_display_mode *mode; u32 n, cts; u64 tmp;
+ + pr_err("sudip connector %px\n", connector); + pr_err("sudip state %px\n", connector->state); + crtc = connector->state->crtc; + + pr_err("sudip crtc %px\n", crtc); + pr_err("sudip state %px\n", crtc->state); + pr_err("state mode %px\n", &crtc->state->adjusted_mode); + mode = &crtc->state->adjusted_mode; n = 128 * samplerate / 1000; tmp = (u64)(mode->clock * 1000) * n; do_div(tmp, 128 * samplerate);
On Wed, Sep 22, 2021 at 1:19 PM Sudip Mukherjee sudipm.mukherjee@gmail.com wrote:
I added some debugs to print the addresses, and I am getting: [ 38.813809] sudip crtc 0000000000000000
This is from struct drm_crtc *crtc = connector->state->crtc;
Yeah, that was my personal suspicion, because while the line number implied "crtc->state" being NULL, the drm data structure documentation and other drivers both imply that "crtc" was the more likely one.
I suspect a simple
if (!crtc) return;
in vc4_hdmi_set_n_cts() is at least part of the fix for this all, but I didn't check if there is possibly something else that needs to be done too.
Linus
On Wed, Sep 22, 2021 at 01:25:21PM -0700, Linus Torvalds wrote:
On Wed, Sep 22, 2021 at 1:19 PM Sudip Mukherjee sudipm.mukherjee@gmail.com wrote:
I added some debugs to print the addresses, and I am getting: [ 38.813809] sudip crtc 0000000000000000
This is from struct drm_crtc *crtc = connector->state->crtc;
Yeah, that was my personal suspicion, because while the line number implied "crtc->state" being NULL, the drm data structure documentation and other drivers both imply that "crtc" was the more likely one.
I suspect a simple
if (!crtc) return;
in vc4_hdmi_set_n_cts() is at least part of the fix for this all, but I didn't check if there is possibly something else that needs to be done too.
Thanks for the decode_stacktrace.sh and the follow-up
Yeah, it looks like we have several things wrong here:
* we only check that connector->state is set, and not connector->state->crtc indeed.
* We also check only in startup(), so at open() and not later on when the sound streaming actually start. This has been there for a while, so I guess it's never really been causing a practical issue before.
I'm still not entirely sure how we can end up in that situation though. The only case I could think of is that:
* The firmware enables the HDMI controller, then boots Linux
* The driver starts, registers its audio card. connector->state is NULL then, and if the HDMI monitor is actually an HDMI monitor (vs a DVI monitor), the VC4_HDMI_RAM_PACKET_ENABLE bit that we test in startup will be set.
* The driver will create the connector->state (through a call to drm_mode_config_reset in vc4_kms_load), connector->state isn't NULL anymore, VC4_HDMI_RAM_PACKET_ENABLE is still set.
* The driver then disables the HDMI controller (in vc4_crtc_disable_at_boot) but never clears the VC4_HDMI_RAM_PACKET_ENABLE bit.
* Pulseaudio opens the audio device, startup succeeds because both conditions we test succeed.
* However, since we either never enabled the HDMI connector (or if it was disabled at some point), connector->state->crtc is NULL and we get our NULL pointer dereference.
The Ubuntu configuration has the framebuffer emulation and the framebuffer console enabled, so it's likely to be enabled and something (X.org?) comes along and disables the connector right when pulseaudio calls prepare().
Maxime
On Fri, Sep 24, 2021 at 3:30 PM Maxime Ripard maxime@cerno.tech wrote:
On Wed, Sep 22, 2021 at 01:25:21PM -0700, Linus Torvalds wrote:
On Wed, Sep 22, 2021 at 1:19 PM Sudip Mukherjee sudipm.mukherjee@gmail.com wrote:
I added some debugs to print the addresses, and I am getting: [ 38.813809] sudip crtc 0000000000000000
This is from struct drm_crtc *crtc = connector->state->crtc;
Yeah, that was my personal suspicion, because while the line number implied "crtc->state" being NULL, the drm data structure documentation and other drivers both imply that "crtc" was the more likely one.
I suspect a simple
if (!crtc) return;
in vc4_hdmi_set_n_cts() is at least part of the fix for this all, but I didn't check if there is possibly something else that needs to be done too.
Thanks for the decode_stacktrace.sh and the follow-up
Yeah, it looks like we have several things wrong here:
we only check that connector->state is set, and not connector->state->crtc indeed.
We also check only in startup(), so at open() and not later on when the sound streaming actually start. This has been there for a while, so I guess it's never really been causing a practical issue before.
You also have no locking, plus looking at ->state objects outside of atomic commit machinery makes no sense because you're not actually in sync with the hw state. Relevant bits need to be copied over at commit time, protected by some spinlock (and that spinlock also needs to be held over whatever other stuff you're setting to make sure we don't get a funny out-of-sync state anywhere).
Liberally sprinkling a few NULL checks here doesn't fix much at all, it only papers over design bugs in the code. -Daniel
I'm still not entirely sure how we can end up in that situation though. The only case I could think of is that:
The firmware enables the HDMI controller, then boots Linux
The driver starts, registers its audio card. connector->state is NULL then, and if the HDMI monitor is actually an HDMI monitor (vs a DVI monitor), the VC4_HDMI_RAM_PACKET_ENABLE bit that we test in startup will be set.
The driver will create the connector->state (through a call to drm_mode_config_reset in vc4_kms_load), connector->state isn't NULL anymore, VC4_HDMI_RAM_PACKET_ENABLE is still set.
The driver then disables the HDMI controller (in vc4_crtc_disable_at_boot) but never clears the VC4_HDMI_RAM_PACKET_ENABLE bit.
Pulseaudio opens the audio device, startup succeeds because both conditions we test succeed.
However, since we either never enabled the HDMI connector (or if it was disabled at some point), connector->state->crtc is NULL and we get our NULL pointer dereference.
The Ubuntu configuration has the framebuffer emulation and the framebuffer console enabled, so it's likely to be enabled and something (X.org?) comes along and disables the connector right when pulseaudio calls prepare().
Maxime
Hi Daniel,
On Sat, Sep 25, 2021 at 12:50:17AM +0200, Daniel Vetter wrote:
On Fri, Sep 24, 2021 at 3:30 PM Maxime Ripard maxime@cerno.tech wrote:
On Wed, Sep 22, 2021 at 01:25:21PM -0700, Linus Torvalds wrote:
On Wed, Sep 22, 2021 at 1:19 PM Sudip Mukherjee sudipm.mukherjee@gmail.com wrote:
I added some debugs to print the addresses, and I am getting: [ 38.813809] sudip crtc 0000000000000000
This is from struct drm_crtc *crtc = connector->state->crtc;
Yeah, that was my personal suspicion, because while the line number implied "crtc->state" being NULL, the drm data structure documentation and other drivers both imply that "crtc" was the more likely one.
I suspect a simple
if (!crtc) return;
in vc4_hdmi_set_n_cts() is at least part of the fix for this all, but I didn't check if there is possibly something else that needs to be done too.
Thanks for the decode_stacktrace.sh and the follow-up
Yeah, it looks like we have several things wrong here:
we only check that connector->state is set, and not connector->state->crtc indeed.
We also check only in startup(), so at open() and not later on when the sound streaming actually start. This has been there for a while, so I guess it's never really been causing a practical issue before.
You also have no locking
Indeed. Do we just need locking to prevent a concurrent audio setup and modeset, or do you have another corner case in mind?
Also, generally, what locks should we make sure we have locked when accessing the connector and CRTC state? drm_mode_config.connection_mutex and drm_mode_config.mutex, respectively?
plus looking at ->state objects outside of atomic commit machinery makes no sense because you're not actually in sync with the hw state. Relevant bits need to be copied over at commit time, protected by some spinlock (and that spinlock also needs to be held over whatever other stuff you're setting to make sure we don't get a funny out-of-sync state anywhere).
If we already have a lock protecting against having both an ASoC and KMS function running, it's not clear to me what the spinlock would prevent here?
Maxime
On Tue, Sep 28, 2021 at 10:34:46AM +0200, Maxime Ripard wrote:
Hi Daniel,
On Sat, Sep 25, 2021 at 12:50:17AM +0200, Daniel Vetter wrote:
On Fri, Sep 24, 2021 at 3:30 PM Maxime Ripard maxime@cerno.tech wrote:
On Wed, Sep 22, 2021 at 01:25:21PM -0700, Linus Torvalds wrote:
On Wed, Sep 22, 2021 at 1:19 PM Sudip Mukherjee sudipm.mukherjee@gmail.com wrote:
I added some debugs to print the addresses, and I am getting: [ 38.813809] sudip crtc 0000000000000000
This is from struct drm_crtc *crtc = connector->state->crtc;
Yeah, that was my personal suspicion, because while the line number implied "crtc->state" being NULL, the drm data structure documentation and other drivers both imply that "crtc" was the more likely one.
I suspect a simple
if (!crtc) return;
in vc4_hdmi_set_n_cts() is at least part of the fix for this all, but I didn't check if there is possibly something else that needs to be done too.
Thanks for the decode_stacktrace.sh and the follow-up
Yeah, it looks like we have several things wrong here:
we only check that connector->state is set, and not connector->state->crtc indeed.
We also check only in startup(), so at open() and not later on when the sound streaming actually start. This has been there for a while, so I guess it's never really been causing a practical issue before.
You also have no locking
Indeed. Do we just need locking to prevent a concurrent audio setup and modeset, or do you have another corner case in mind?
Also, generally, what locks should we make sure we have locked when accessing the connector and CRTC state? drm_mode_config.connection_mutex and drm_mode_config.mutex, respectively?
plus looking at ->state objects outside of atomic commit machinery makes no sense because you're not actually in sync with the hw state. Relevant bits need to be copied over at commit time, protected by some spinlock (and that spinlock also needs to be held over whatever other stuff you're setting to make sure we don't get a funny out-of-sync state anywhere).
If we already have a lock protecting against having both an ASoC and KMS function running, it's not clear to me what the spinlock would prevent here?
Replicating the irc chat here. With
commit 6c5ed5ae353cdf156f9ac4db17e15db56b4de880 Author: Maarten Lankhorst maarten.lankhorst@linux.intel.com Date: Thu Apr 6 20:55:20 2017 +0200
drm/atomic: Acquire connection_mutex lock in drm_helper_probe_single_connector_modes, v4.
this is already taken care of for drivers and should be all good from a locking pov. -Daniel
On Thu, Sep 30, 2021 at 11:19:59AM +0200, Daniel Vetter wrote:
On Tue, Sep 28, 2021 at 10:34:46AM +0200, Maxime Ripard wrote:
Hi Daniel,
On Sat, Sep 25, 2021 at 12:50:17AM +0200, Daniel Vetter wrote:
On Fri, Sep 24, 2021 at 3:30 PM Maxime Ripard maxime@cerno.tech wrote:
On Wed, Sep 22, 2021 at 01:25:21PM -0700, Linus Torvalds wrote:
On Wed, Sep 22, 2021 at 1:19 PM Sudip Mukherjee sudipm.mukherjee@gmail.com wrote:
I added some debugs to print the addresses, and I am getting: [ 38.813809] sudip crtc 0000000000000000
This is from struct drm_crtc *crtc = connector->state->crtc;
Yeah, that was my personal suspicion, because while the line number implied "crtc->state" being NULL, the drm data structure documentation and other drivers both imply that "crtc" was the more likely one.
I suspect a simple
if (!crtc) return;
in vc4_hdmi_set_n_cts() is at least part of the fix for this all, but I didn't check if there is possibly something else that needs to be done too.
Thanks for the decode_stacktrace.sh and the follow-up
Yeah, it looks like we have several things wrong here:
we only check that connector->state is set, and not connector->state->crtc indeed.
We also check only in startup(), so at open() and not later on when the sound streaming actually start. This has been there for a while, so I guess it's never really been causing a practical issue before.
You also have no locking
Indeed. Do we just need locking to prevent a concurrent audio setup and modeset, or do you have another corner case in mind?
Also, generally, what locks should we make sure we have locked when accessing the connector and CRTC state? drm_mode_config.connection_mutex and drm_mode_config.mutex, respectively?
plus looking at ->state objects outside of atomic commit machinery makes no sense because you're not actually in sync with the hw state. Relevant bits need to be copied over at commit time, protected by some spinlock (and that spinlock also needs to be held over whatever other stuff you're setting to make sure we don't get a funny out-of-sync state anywhere).
If we already have a lock protecting against having both an ASoC and KMS function running, it's not clear to me what the spinlock would prevent here?
Replicating the irc chat here. With
commit 6c5ed5ae353cdf156f9ac4db17e15db56b4de880 Author: Maarten Lankhorst maarten.lankhorst@linux.intel.com Date: Thu Apr 6 20:55:20 2017 +0200
drm/atomic: Acquire connection_mutex lock in drm_helper_probe_single_connector_modes, v4.
this is already taken care of for drivers and should be all good from a locking pov.
So, if I understand this properly, this superseeds your comment on the spinlock for the hw state, but not the comment that we need some locking to synchronize between the audio and KMS path (and CEC?). Right?
Maxime
On Wed, Oct 13, 2021 at 05:01:03PM +0200, Maxime Ripard wrote:
On Thu, Sep 30, 2021 at 11:19:59AM +0200, Daniel Vetter wrote:
On Tue, Sep 28, 2021 at 10:34:46AM +0200, Maxime Ripard wrote:
Hi Daniel,
On Sat, Sep 25, 2021 at 12:50:17AM +0200, Daniel Vetter wrote:
On Fri, Sep 24, 2021 at 3:30 PM Maxime Ripard maxime@cerno.tech wrote:
On Wed, Sep 22, 2021 at 01:25:21PM -0700, Linus Torvalds wrote:
On Wed, Sep 22, 2021 at 1:19 PM Sudip Mukherjee sudipm.mukherjee@gmail.com wrote: > > I added some debugs to print the addresses, and I am getting: > [ 38.813809] sudip crtc 0000000000000000 > > This is from struct drm_crtc *crtc = connector->state->crtc;
Yeah, that was my personal suspicion, because while the line number implied "crtc->state" being NULL, the drm data structure documentation and other drivers both imply that "crtc" was the more likely one.
I suspect a simple
if (!crtc) return;
in vc4_hdmi_set_n_cts() is at least part of the fix for this all, but I didn't check if there is possibly something else that needs to be done too.
Thanks for the decode_stacktrace.sh and the follow-up
Yeah, it looks like we have several things wrong here:
we only check that connector->state is set, and not connector->state->crtc indeed.
We also check only in startup(), so at open() and not later on when the sound streaming actually start. This has been there for a while, so I guess it's never really been causing a practical issue before.
You also have no locking
Indeed. Do we just need locking to prevent a concurrent audio setup and modeset, or do you have another corner case in mind?
Also, generally, what locks should we make sure we have locked when accessing the connector and CRTC state? drm_mode_config.connection_mutex and drm_mode_config.mutex, respectively?
plus looking at ->state objects outside of atomic commit machinery makes no sense because you're not actually in sync with the hw state. Relevant bits need to be copied over at commit time, protected by some spinlock (and that spinlock also needs to be held over whatever other stuff you're setting to make sure we don't get a funny out-of-sync state anywhere).
If we already have a lock protecting against having both an ASoC and KMS function running, it's not clear to me what the spinlock would prevent here?
Replicating the irc chat here. With
commit 6c5ed5ae353cdf156f9ac4db17e15db56b4de880 Author: Maarten Lankhorst maarten.lankhorst@linux.intel.com Date: Thu Apr 6 20:55:20 2017 +0200
drm/atomic: Acquire connection_mutex lock in drm_helper_probe_single_connector_modes, v4.
this is already taken care of for drivers and should be all good from a locking pov.
So, if I understand this properly, this superseeds your comment on the spinlock for the hw state, but not the comment that we need some locking to synchronize between the audio and KMS path (and CEC?). Right?
Other way round. There's 3 things involved here: 1. kms output probe code 2. kms atomic commit code 3. calls from asoc side
The above referenced commit makes sure 1&2 are synchronized. The problem is that 2&3 are not synchonronized, and from 3, no matter how much locking you have, you cannot look at kms state. I.e. not allowed to look at crtc->state for example, irrespective of whether you're holding drm_modeset_lock or not. This is because the atomic nonblocking commit is done without holding any locks, protection is purely down to ownership rules of state structures and ordering (through drm_crtc_commit) of in-flight nonblocking atomic commits.
That's why you need a sperate lock _and_ copy state, so taht 2&3 stay in sync.
In practice you only care about modeset changes from 2 vs anything from 3, and most userspace does modeset atomic commits as blocking commits, which means you won't notice that your locking has gaps.
btw same problem exists between atomic and (vblank) irq handler. There you need a irqsafe spinlock and you also have to copy (because the irq handler just cannot access ->state in any safe way, because it doesn't own that structure).
This is maybe a bit the confusing thing with atomic commit: ->state isn't protected by locks, but through ownership rules. Only for atomic check is ->state protected by locks, but once we're committed we switch over to ownership rules for protection. swap_states() is that point of no return.
Cheers, Daniel
Hi,
On Thu, Oct 14, 2021 at 03:15:36PM +0200, Daniel Vetter wrote:
On Wed, Oct 13, 2021 at 05:01:03PM +0200, Maxime Ripard wrote:
On Thu, Sep 30, 2021 at 11:19:59AM +0200, Daniel Vetter wrote:
On Tue, Sep 28, 2021 at 10:34:46AM +0200, Maxime Ripard wrote:
Hi Daniel,
On Sat, Sep 25, 2021 at 12:50:17AM +0200, Daniel Vetter wrote:
On Fri, Sep 24, 2021 at 3:30 PM Maxime Ripard maxime@cerno.tech wrote:
On Wed, Sep 22, 2021 at 01:25:21PM -0700, Linus Torvalds wrote: > On Wed, Sep 22, 2021 at 1:19 PM Sudip Mukherjee > sudipm.mukherjee@gmail.com wrote: > > > > I added some debugs to print the addresses, and I am getting: > > [ 38.813809] sudip crtc 0000000000000000 > > > > This is from struct drm_crtc *crtc = connector->state->crtc; > > Yeah, that was my personal suspicion, because while the line number > implied "crtc->state" being NULL, the drm data structure documentation > and other drivers both imply that "crtc" was the more likely one. > > I suspect a simple > > if (!crtc) > return; > > in vc4_hdmi_set_n_cts() is at least part of the fix for this all, but > I didn't check if there is possibly something else that needs to be > done too.
Thanks for the decode_stacktrace.sh and the follow-up
Yeah, it looks like we have several things wrong here:
we only check that connector->state is set, and not connector->state->crtc indeed.
We also check only in startup(), so at open() and not later on when the sound streaming actually start. This has been there for a while, so I guess it's never really been causing a practical issue before.
You also have no locking
Indeed. Do we just need locking to prevent a concurrent audio setup and modeset, or do you have another corner case in mind?
Also, generally, what locks should we make sure we have locked when accessing the connector and CRTC state? drm_mode_config.connection_mutex and drm_mode_config.mutex, respectively?
plus looking at ->state objects outside of atomic commit machinery makes no sense because you're not actually in sync with the hw state. Relevant bits need to be copied over at commit time, protected by some spinlock (and that spinlock also needs to be held over whatever other stuff you're setting to make sure we don't get a funny out-of-sync state anywhere).
If we already have a lock protecting against having both an ASoC and KMS function running, it's not clear to me what the spinlock would prevent here?
Replicating the irc chat here. With
commit 6c5ed5ae353cdf156f9ac4db17e15db56b4de880 Author: Maarten Lankhorst maarten.lankhorst@linux.intel.com Date: Thu Apr 6 20:55:20 2017 +0200
drm/atomic: Acquire connection_mutex lock in drm_helper_probe_single_connector_modes, v4.
this is already taken care of for drivers and should be all good from a locking pov.
So, if I understand this properly, this superseeds your comment on the spinlock for the hw state, but not the comment that we need some locking to synchronize between the audio and KMS path (and CEC?). Right?
Other way round. There's 3 things involved here:
- kms output probe code
- kms atomic commit code
- calls from asoc side
The above referenced commit makes sure 1&2 are synchronized. The problem is that 2&3 are not synchonronized, and from 3, no matter how much locking you have, you cannot look at kms state. I.e. not allowed to look at crtc->state for example, irrespective of whether you're holding drm_modeset_lock or not. This is because the atomic nonblocking commit is done without holding any locks, protection is purely down to ownership rules of state structures and ordering (through drm_crtc_commit) of in-flight nonblocking atomic commits.
That's why you need a sperate lock _and_ copy state, so taht 2&3 stay in sync.
In practice you only care about modeset changes from 2 vs anything from 3, and most userspace does modeset atomic commits as blocking commits, which means you won't notice that your locking has gaps.
btw same problem exists between atomic and (vblank) irq handler. There you need a irqsafe spinlock and you also have to copy (because the irq handler just cannot access ->state in any safe way, because it doesn't own that structure).
This is maybe a bit the confusing thing with atomic commit: ->state isn't protected by locks, but through ownership rules. Only for atomic check is ->state protected by locks, but once we're committed we switch over to ownership rules for protection. swap_states() is that point of no return.
Thanks for the clarifications, I just posted a series that should be implementing this here: https://lore.kernel.org/dri-devel/20211025141113.702757-1-maxime@cerno.tech/
Maxime
dri-devel@lists.freedesktop.org