Easily hit the below list corruption: == list_add corruption. prev->next should be next (ffffffffc0ceb090), but was ffffec604507edc8. (prev=ffffec604507edc8). WARNING: CPU: 65 PID: 3959 at lib/list_debug.c:26 __list_add_valid+0x53/0x80 CPU: 65 PID: 3959 Comm: fbdev Tainted: G U RIP: 0010:__list_add_valid+0x53/0x80 Call Trace: <TASK> fb_deferred_io_mkwrite+0xea/0x150 do_page_mkwrite+0x57/0xc0 do_wp_page+0x278/0x2f0 __handle_mm_fault+0xdc2/0x1590 handle_mm_fault+0xdd/0x2c0 do_user_addr_fault+0x1d3/0x650 exc_page_fault+0x77/0x180 ? asm_exc_page_fault+0x8/0x30 asm_exc_page_fault+0x1e/0x30 RIP: 0033:0x7fd98fc8fad1 ==
Figure out the race happens when one process is adding &page->lru into the pagelist tail in fb_deferred_io_mkwrite(), another process is re-initializing the same &page->lru in fb_deferred_io_fault(), which is not protected by the lock.
This fix is to init all the page lists one time during initialization, it not only fixes the list corruption, but also avoids INIT_LIST_HEAD() redundantly.
Fixes: 105a940416fc ("fbdev/defio: Early-out if page is already enlisted") Cc: Thomas Zimmermann tzimmermann@suse.de Signed-off-by: Chuansheng Liu chuansheng.liu@intel.com --- drivers/video/fbdev/core/fb_defio.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/drivers/video/fbdev/core/fb_defio.c b/drivers/video/fbdev/core/fb_defio.c index 98b0f23bf5e2..eafb66ca4f28 100644 --- a/drivers/video/fbdev/core/fb_defio.c +++ b/drivers/video/fbdev/core/fb_defio.c @@ -59,7 +59,6 @@ static vm_fault_t fb_deferred_io_fault(struct vm_fault *vmf) printk(KERN_ERR "no mapping available\n");
BUG_ON(!page->mapping); - INIT_LIST_HEAD(&page->lru); page->index = vmf->pgoff;
vmf->page = page; @@ -220,6 +219,8 @@ static void fb_deferred_io_work(struct work_struct *work) void fb_deferred_io_init(struct fb_info *info) { struct fb_deferred_io *fbdefio = info->fbdefio; + struct page *page; + int i;
BUG_ON(!fbdefio); mutex_init(&fbdefio->lock); @@ -227,6 +228,12 @@ void fb_deferred_io_init(struct fb_info *info) INIT_LIST_HEAD(&fbdefio->pagelist); if (fbdefio->delay == 0) /* set a default of 1 s */ fbdefio->delay = HZ; + + /* initialize all the page lists one time */ + for (i = 0; i < info->fix.smem_len; i += PAGE_SIZE) { + page = fb_deferred_io_page(info, i); + INIT_LIST_HEAD(&page->lru); + } } EXPORT_SYMBOL_GPL(fb_deferred_io_init);
Hi Chuansheng,
On Thu, Mar 17, 2022 at 7:17 AM Chuansheng Liu chuansheng.liu@intel.com wrote:
Easily hit the below list corruption:
list_add corruption. prev->next should be next (ffffffffc0ceb090), but was ffffec604507edc8. (prev=ffffec604507edc8). WARNING: CPU: 65 PID: 3959 at lib/list_debug.c:26 __list_add_valid+0x53/0x80 CPU: 65 PID: 3959 Comm: fbdev Tainted: G U RIP: 0010:__list_add_valid+0x53/0x80 Call Trace:
<TASK> fb_deferred_io_mkwrite+0xea/0x150 do_page_mkwrite+0x57/0xc0 do_wp_page+0x278/0x2f0 __handle_mm_fault+0xdc2/0x1590 handle_mm_fault+0xdd/0x2c0 do_user_addr_fault+0x1d3/0x650 exc_page_fault+0x77/0x180 ? asm_exc_page_fault+0x8/0x30 asm_exc_page_fault+0x1e/0x30 RIP: 0033:0x7fd98fc8fad1 ==
Figure out the race happens when one process is adding &page->lru into the pagelist tail in fb_deferred_io_mkwrite(), another process is re-initializing the same &page->lru in fb_deferred_io_fault(), which is not protected by the lock.
This fix is to init all the page lists one time during initialization, it not only fixes the list corruption, but also avoids INIT_LIST_HEAD() redundantly.
Fixes: 105a940416fc ("fbdev/defio: Early-out if page is already enlisted") Cc: Thomas Zimmermann tzimmermann@suse.de Signed-off-by: Chuansheng Liu chuansheng.liu@intel.com
Thanks for your patch!
--- a/drivers/video/fbdev/core/fb_defio.c +++ b/drivers/video/fbdev/core/fb_defio.c @@ -220,6 +219,8 @@ static void fb_deferred_io_work(struct work_struct *work) void fb_deferred_io_init(struct fb_info *info) { struct fb_deferred_io *fbdefio = info->fbdefio;
struct page *page;
int i;
unsigned int i;
BUG_ON(!fbdefio); mutex_init(&fbdefio->lock);
@@ -227,6 +228,12 @@ void fb_deferred_io_init(struct fb_info *info) INIT_LIST_HEAD(&fbdefio->pagelist); if (fbdefio->delay == 0) /* set a default of 1 s */ fbdefio->delay = HZ;
/* initialize all the page lists one time */
for (i = 0; i < info->fix.smem_len; i += PAGE_SIZE) {
page = fb_deferred_io_page(info, i);
INIT_LIST_HEAD(&page->lru);
}
} EXPORT_SYMBOL_GPL(fb_deferred_io_init);
Gr{oetje,eeting}s,
Geert
-- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org
In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
Hello Chuansheng,
On 3/17/22 06:46, Chuansheng Liu wrote:
Easily hit the below list corruption:
list_add corruption. prev->next should be next (ffffffffc0ceb090), but was ffffec604507edc8. (prev=ffffec604507edc8). WARNING: CPU: 65 PID: 3959 at lib/list_debug.c:26 __list_add_valid+0x53/0x80 CPU: 65 PID: 3959 Comm: fbdev Tainted: G U RIP: 0010:__list_add_valid+0x53/0x80 Call Trace:
<TASK> fb_deferred_io_mkwrite+0xea/0x150 do_page_mkwrite+0x57/0xc0 do_wp_page+0x278/0x2f0 __handle_mm_fault+0xdc2/0x1590 handle_mm_fault+0xdd/0x2c0 do_user_addr_fault+0x1d3/0x650 exc_page_fault+0x77/0x180 ? asm_exc_page_fault+0x8/0x30 asm_exc_page_fault+0x1e/0x30 RIP: 0033:0x7fd98fc8fad1 ==
Figure out the race happens when one process is adding &page->lru into the pagelist tail in fb_deferred_io_mkwrite(), another process is re-initializing the same &page->lru in fb_deferred_io_fault(), which is not protected by the lock.
This fix is to init all the page lists one time during initialization, it not only fixes the list corruption, but also avoids INIT_LIST_HEAD() redundantly.
Fixes: 105a940416fc ("fbdev/defio: Early-out if page is already enlisted") Cc: Thomas Zimmermann tzimmermann@suse.de Signed-off-by: Chuansheng Liu chuansheng.liu@intel.com
This makes sense to me. If you address Geert comment and post a v2, feel free to add:
Reviewed-by: Javier Martinez Canillas javierm@redhat.com
Hi
Am 17.03.22 um 06:46 schrieb Chuansheng Liu:
Easily hit the below list corruption:
list_add corruption. prev->next should be next (ffffffffc0ceb090), but was ffffec604507edc8. (prev=ffffec604507edc8). WARNING: CPU: 65 PID: 3959 at lib/list_debug.c:26 __list_add_valid+0x53/0x80 CPU: 65 PID: 3959 Comm: fbdev Tainted: G U RIP: 0010:__list_add_valid+0x53/0x80 Call Trace:
<TASK> fb_deferred_io_mkwrite+0xea/0x150 do_page_mkwrite+0x57/0xc0 do_wp_page+0x278/0x2f0 __handle_mm_fault+0xdc2/0x1590 handle_mm_fault+0xdd/0x2c0 do_user_addr_fault+0x1d3/0x650 exc_page_fault+0x77/0x180 ? asm_exc_page_fault+0x8/0x30 asm_exc_page_fault+0x1e/0x30 RIP: 0033:0x7fd98fc8fad1 ==
Figure out the race happens when one process is adding &page->lru into the pagelist tail in fb_deferred_io_mkwrite(), another process is re-initializing the same &page->lru in fb_deferred_io_fault(), which is not protected by the lock.
This fix is to init all the page lists one time during initialization, it not only fixes the list corruption, but also avoids INIT_LIST_HEAD() redundantly.
Fixes: 105a940416fc ("fbdev/defio: Early-out if page is already enlisted") Cc: Thomas Zimmermann tzimmermann@suse.de Signed-off-by: Chuansheng Liu chuansheng.liu@intel.com
If you fix Geert's comment, feel free to add
Reviewed-by: Thomas Zimmermann tzimmermann@suse.de
Best regards Thomas
drivers/video/fbdev/core/fb_defio.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/drivers/video/fbdev/core/fb_defio.c b/drivers/video/fbdev/core/fb_defio.c index 98b0f23bf5e2..eafb66ca4f28 100644 --- a/drivers/video/fbdev/core/fb_defio.c +++ b/drivers/video/fbdev/core/fb_defio.c @@ -59,7 +59,6 @@ static vm_fault_t fb_deferred_io_fault(struct vm_fault *vmf) printk(KERN_ERR "no mapping available\n");
BUG_ON(!page->mapping);
INIT_LIST_HEAD(&page->lru); page->index = vmf->pgoff;
vmf->page = page;
@@ -220,6 +219,8 @@ static void fb_deferred_io_work(struct work_struct *work) void fb_deferred_io_init(struct fb_info *info) { struct fb_deferred_io *fbdefio = info->fbdefio;
struct page *page;
int i;
BUG_ON(!fbdefio); mutex_init(&fbdefio->lock);
@@ -227,6 +228,12 @@ void fb_deferred_io_init(struct fb_info *info) INIT_LIST_HEAD(&fbdefio->pagelist); if (fbdefio->delay == 0) /* set a default of 1 s */ fbdefio->delay = HZ;
- /* initialize all the page lists one time */
- for (i = 0; i < info->fix.smem_len; i += PAGE_SIZE) {
page = fb_deferred_io_page(info, i);
INIT_LIST_HEAD(&page->lru);
- } } EXPORT_SYMBOL_GPL(fb_deferred_io_init);
Dear Chuansheng,
Am 17.03.22 um 06:46 schrieb Chuansheng Liu:
Easily hit the below list corruption:
list_add corruption. prev->next should be next (ffffffffc0ceb090), but was ffffec604507edc8. (prev=ffffec604507edc8). WARNING: CPU: 65 PID: 3959 at lib/list_debug.c:26 __list_add_valid+0x53/0x80 CPU: 65 PID: 3959 Comm: fbdev Tainted: G U RIP: 0010:__list_add_valid+0x53/0x80 Call Trace:
<TASK> fb_deferred_io_mkwrite+0xea/0x150 do_page_mkwrite+0x57/0xc0 do_wp_page+0x278/0x2f0 __handle_mm_fault+0xdc2/0x1590 handle_mm_fault+0xdd/0x2c0 do_user_addr_fault+0x1d3/0x650 exc_page_fault+0x77/0x180 ? asm_exc_page_fault+0x8/0x30 asm_exc_page_fault+0x1e/0x30 RIP: 0033:0x7fd98fc8fad1 ==
Figure out the race happens when one process is adding &page->lru into the pagelist tail in fb_deferred_io_mkwrite(), another process is re-initializing the same &page->lru in fb_deferred_io_fault(), which is not protected by the lock.
This fix is to init all the page lists one time during initialization, it not only fixes the list corruption, but also avoids INIT_LIST_HEAD() redundantly.
Fixes: 105a940416fc ("fbdev/defio: Early-out if page is already enlisted") Cc: Thomas Zimmermann tzimmermann@suse.de Signed-off-by: Chuansheng Liu chuansheng.liu@intel.com
drivers/video/fbdev/core/fb_defio.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/drivers/video/fbdev/core/fb_defio.c b/drivers/video/fbdev/core/fb_defio.c index 98b0f23bf5e2..eafb66ca4f28 100644 --- a/drivers/video/fbdev/core/fb_defio.c +++ b/drivers/video/fbdev/core/fb_defio.c @@ -59,7 +59,6 @@ static vm_fault_t fb_deferred_io_fault(struct vm_fault *vmf) printk(KERN_ERR "no mapping available\n");
BUG_ON(!page->mapping);
INIT_LIST_HEAD(&page->lru); page->index = vmf->pgoff;
vmf->page = page;
@@ -220,6 +219,8 @@ static void fb_deferred_io_work(struct work_struct *work) void fb_deferred_io_init(struct fb_info *info) { struct fb_deferred_io *fbdefio = info->fbdefio;
struct page *page;
int i;
BUG_ON(!fbdefio); mutex_init(&fbdefio->lock);
@@ -227,6 +228,12 @@ void fb_deferred_io_init(struct fb_info *info) INIT_LIST_HEAD(&fbdefio->pagelist); if (fbdefio->delay == 0) /* set a default of 1 s */ fbdefio->delay = HZ;
- /* initialize all the page lists one time */
- for (i = 0; i < info->fix.smem_len; i += PAGE_SIZE) {
page = fb_deferred_io_page(info, i);
INIT_LIST_HEAD(&page->lru);
- } } EXPORT_SYMBOL_GPL(fb_deferred_io_init);
Applying your patch on top of current Linus’ master branch, tty0 is unusable and looks frozen. Sometimes network card still works, sometimes not.
$ git log --oneline -nodecorate -2 1b351a77ed33 (HEAD -> linus) fbdev: defio: fix the pagelist corruption 52d543b5497c (origin/master, origin/HEAD) Merge tag 'for-linus-5.17-1' of https://github.com/cminyard/linux-ipmi
``` [ 5.256996] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 [ 5.269582] page dumped because: VM_BUG_ON_PAGE(compound && compound_order(page) != order) [ 5.279507] ------------[ cut here ]------------ [ 5.286406] kernel BUG at mm/page_alloc.c:1326! [ 5.291814] invalid opcode: 0000 [#1] PREEMPT SMP [ 5.296350] CPU: 0 PID: 167 Comm: systemd-udevd Not tainted 5.17.0-10753-g1b351a77ed33 #300 [ 5.304670] Hardware name: ASUS F2A85-M_PRO/F2A85-M_PRO, BIOS 4.16-337-gb87986e67b 03/25/2022 [ 5.313163] RIP: 0010:free_pcp_prepare+0x295/0x400 [ 5.317930] Code: 00 01 00 75 0b 48 8b 45 08 45 31 ff a8 01 74 4b 48 8b 45 00 a9 00 00 01 00 75 22 48 c7 c6 68 30 11 96 48 89 ef e8 cb 29 fd ff <0f> 0b 48 89 ef 41 83 c6 01 e8 bd f5 ff ff e9 2e fe ff ff 0f 1f 44 [ 5.336650] RSP: 0018:ffffa6634062f9c0 EFLAGS: 00010246 [ 5.341849] RAX: 000000000000004e RBX: ffffe4be80000000 RCX: 0000000000000000 [ 5.348957] RDX: 0000000000000000 RSI: ffffffff96136a37 RDI: 00000000ffffffff [ 5.356063] RBP: ffffe4be840c0000 R08: 0000000000000000 R09: 00000000ffffdfff [ 5.363170] R10: ffffa6634062f7f0 R11: ffffffff9652c4a8 R12: 0000000000000000 [ 5.370277] R13: 0000000000000009 R14: ffff91fd02ebc640 R15: ffffe4be840c0000 [ 5.377384] FS: 0000000000000000(0000) GS:ffff91fd7b400000(0063) knlGS:00000000f7eea800 [ 5.385443] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 [ 5.391164] CR2: 00000000f6f0e840 CR3: 0000000106b60000 CR4: 00000000000406f0 [ 5.398272] Call Trace: [ 5.400697] <TASK> [ 5.402778] free_unref_page+0x1b/0xf0 [ 5.406505] __vunmap+0x216/0x2c0 [ 5.409798] drm_fbdev_cleanup+0x5f/0xb0 [ 5.413698] drm_fbdev_fb_destroy+0x15/0x30 [ 5.417857] unregister_framebuffer+0x2c/0x40 [ 5.422191] drm_client_dev_unregister+0x69/0xe0 [ 5.422962] usb usb4: New USB device found, idVendor=1d6b, idProduct=0003, bcdDevice= 5.17 [ 5.426784] drm_dev_unregister+0x2e/0x80 [ 5.439005] drm_dev_unplug+0x21/0x40 [ 5.442645] simpledrm_remove+0x11/0x20 [ 5.446458] platform_remove+0x1f/0x40 [ 5.450185] __device_release_driver+0x17a/0x250 [ 5.454779] device_release_driver+0x24/0x30 [ 5.459024] bus_remove_device+0xd8/0x140 [ 5.463012] device_del+0x18b/0x3f0 [ 5.466478] ? idr_alloc_cyclic+0x50/0xb0 [ 5.470466] platform_device_del.part.0+0x13/0x70 [ 5.475146] platform_device_unregister+0x1c/0x30 [ 5.479824] drm_aperture_detach_drivers+0xa1/0xd0 [ 5.484593] drm_aperture_remove_conflicting_pci_framebuffers+0x3f/0x60 [ 5.491179] radeon_pci_probe+0x54/0xf0 [radeon] [ 5.495773] local_pci_probe+0x45/0x80 [ 5.499499] ? pci_match_device+0xd7/0x130 [ 5.503572] pci_device_probe+0xc2/0x1e0 [ 5.507474] really_probe+0x1f5/0x3d0 [ 5.511112] __driver_probe_device+0xfe/0x180 [ 5.515446] driver_probe_device+0x1e/0x90 [ 5.519518] __driver_attach+0xc0/0x1c0 [ 5.523332] ? __device_attach_driver+0xe0/0xe0 [ 5.527839] ? __device_attach_driver+0xe0/0xe0 [ 5.532346] bus_for_each_dev+0x78/0xc0 [ 5.536159] bus_add_driver+0x149/0x1e0 [ 5.539973] driver_register+0x8f/0xe0 [ 5.543699] ? 0xffffffffc0741000 [ 5.546992] do_one_initcall+0x44/0x200 [ 5.550806] ? kmem_cache_alloc_trace+0x170/0x2c0 [ 5.555487] do_init_module+0x4c/0x240 [ 5.559213] __do_sys_finit_module+0xb4/0x120 [ 5.563547] __do_fast_syscall_32+0x6b/0xe0 [ 5.567706] do_fast_syscall_32+0x2f/0x70 [ 5.571693] entry_SYSCALL_compat_after_hwframe+0x45/0x4d [ 5.577067] RIP: 0023:0xf7efa549 [ 5.580273] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 cd 0f 05 cd 80 <5d> 5a 59 c3 90 90 90 90 8d b4 26 00 00 00 00 8d b4 26 00 00 00 00 [ 5.582805] usb usb4: New USB device strings: Mfr=3, Product=2, SerialNumber=1 [ 5.598992] RSP: 002b:00000000ff831c0c EFLAGS: 00200296 ORIG_RAX: 000000000000015e [ 5.598996] RAX: ffffffffffffffda RBX: 0000000000000011 RCX: 00000000f7ed9e09 [ 5.598998] RDX: 0000000000000000 RSI: 0000000056a5c940 RDI: 0000000056a5c4c0 [ 5.598999] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 [ 5.635047] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 5.642154] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 5.649264] </TASK> [ 5.651427] Modules linked in: crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi radeon(+) r8169 xhci_pci(+) realtek snd_hda_intel drm_ttm_helper snd_intel_dspcfg k10temp snd_hda_codec ttm snd_hda_core xhci_hcd snd_pcm sg ohci_hcd ehci_pci(+) snd_timer drm_dp_helper snd ehci_hcd soundcore i2c_piix4 acpi_cpufreq coreboot_table fuse ipv6 autofs4 [ 5.690975] r8169 0000:04:00.0 enp4s0: renamed from eth0 [ 5.691589] ---[ end trace 0000000000000000 ]--- [ 5.704791] RIP: 0010:free_pcp_prepare+0x295/0x400 [ 5.709784] Code: 00 01 00 75 0b 48 8b 45 08 45 31 ff a8 01 74 4b 48 8b 45 00 a9 00 00 01 00 75 22 48 c7 c6 68 30 11 96 48 89 ef e8 cb 29 fd ff <0f> 0b 48 89 ef 41 83 c6 01 e8 bd f5 ff ff e9 2e fe ff ff 0f 1f 44 [ 5.731535] RSP: 0018:ffffa6634062f9c0 EFLAGS: 00010246 [ 5.752988] usb usb4: Product: xHCI Host Controller [ 5.758571] usb usb4: Manufacturer: Linux 5.17.0-10753-g1b351a77ed33 xhci-hcd [ 5.767096] usb usb4: SerialNumber: 0000:03:00.0 [ 5.772213] hub 4-0:1.0: USB hub found [ 5.782383] hub 4-0:1.0: 2 ports detected [ 5.799251] RAX: 000000000000004e RBX: ffffe4be80000000 RCX: 0000000000000000 [ 5.810470] RDX: 0000000000000000 RSI: ffffffff96136a37 RDI: 00000000ffffffff [ 5.817561] RBP: ffffe4be840c0000 R08: 0000000000000000 R09: 00000000ffffdfff [ 5.824680] R10: ffffa6634062f7f0 R11: ffffffff9652c4a8 R12: 0000000000000000 [ 5.831739] R13: 0000000000000009 R14: ffff91fd02ebc640 R15: ffffe4be840c0000 [ 5.839445] FS: 0000000000000000(0000) GS:ffff91fd7b500000(0063) knlGS:00000000f7eea800 [ 5.847905] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 [ 5.854025] CR2: 000000005664d26c CR3: 0000000106b60000 CR4: 00000000000406e0 ```
Kind regards,
Paul
PS: For some reason, the lore.kernel.org lists most messages twice [1].
PPS: I am actually wanted to analyze the new regression, and thought your patch might help, but made it worse. ;-) (The log excerpt is from Linux master.)
``` [ 1.738965] BUG: Bad page state in process systemd-udevd pfn:103003 [ 1.738974] fbcon: Taking over console [ 1.740459] page:00000000c3b5c591 refcount:0 mapcount:0 mapping:0000000 000000000 index:0x3 pfn:0x103003 [ 1.740466] head:000000009b49a8e9 order:9 compound_mapcount:0 compound_ pincount:0 [ 1.740468] flags: 0x2fffc000010000(head|node=0|zone=2|lastcpupid=0x3ff f) [ 1.740473] raw: 002fffc000000000 fffff139840c0001 fffff139840c00c8 000 0000000000000 [ 1.740475] raw: 0000000000000000 0000000000000000 00000000ffffffff 000 0000000000000 [ 1.740477] head: 002fffc000010000 0000000000000000 dead000000000122 00 00000000000000 [ 1.740479] head: 0000000000000000 0000000000000000 00000000ffffffff 00 00000000000000 [ 1.740480] page dumped because: corrupted mapping in tail page ```
I am going to do that in another thread.
[1]: https://lore.kernel.org/all/20220317054602.28846-1-chuansheng.liu@intel.com/
Hi Paul,
-----Original Message----- From: dri-devel dri-devel-bounces@lists.freedesktop.org On Behalf Of Paul Menzel Sent: Saturday, March 26, 2022 4:11 PM To: Liu, Chuansheng chuansheng.liu@intel.com Cc: linux-fbdev@vger.kernel.org; deller@gmx.de; dri- devel@lists.freedesktop.org; tzimmermann@suse.de; jayalk@intworks.biz Subject: Re: [PATCH] fbdev: defio: fix the pagelist corruption
Dear Chuansheng,
Am 17.03.22 um 06:46 schrieb Chuansheng Liu:
Easily hit the below list corruption:
list_add corruption. prev->next should be next (ffffffffc0ceb090), but was ffffec604507edc8. (prev=ffffec604507edc8). WARNING: CPU: 65 PID: 3959 at lib/list_debug.c:26 __list_add_valid+0x53/0x80 CPU: 65 PID: 3959 Comm: fbdev Tainted: G U RIP: 0010:__list_add_valid+0x53/0x80 Call Trace:
<TASK> fb_deferred_io_mkwrite+0xea/0x150 do_page_mkwrite+0x57/0xc0 do_wp_page+0x278/0x2f0 __handle_mm_fault+0xdc2/0x1590 handle_mm_fault+0xdd/0x2c0 do_user_addr_fault+0x1d3/0x650 exc_page_fault+0x77/0x180 ? asm_exc_page_fault+0x8/0x30 asm_exc_page_fault+0x1e/0x30 RIP: 0033:0x7fd98fc8fad1 ==
Figure out the race happens when one process is adding &page->lru into the pagelist tail in fb_deferred_io_mkwrite(), another process is re-initializing the same &page->lru in fb_deferred_io_fault(), which is not protected by the lock.
This fix is to init all the page lists one time during initialization, it not only fixes the list corruption, but also avoids INIT_LIST_HEAD() redundantly.
Fixes: 105a940416fc ("fbdev/defio: Early-out if page is already enlisted") Cc: Thomas Zimmermann tzimmermann@suse.de Signed-off-by: Chuansheng Liu chuansheng.liu@intel.com
drivers/video/fbdev/core/fb_defio.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/drivers/video/fbdev/core/fb_defio.c
b/drivers/video/fbdev/core/fb_defio.c
index 98b0f23bf5e2..eafb66ca4f28 100644 --- a/drivers/video/fbdev/core/fb_defio.c +++ b/drivers/video/fbdev/core/fb_defio.c @@ -59,7 +59,6 @@ static vm_fault_t fb_deferred_io_fault(struct vm_fault
*vmf)
printk(KERN_ERR "no mapping available\n");
BUG_ON(!page->mapping);
INIT_LIST_HEAD(&page->lru); page->index = vmf->pgoff;
vmf->page = page;
@@ -220,6 +219,8 @@ static void fb_deferred_io_work(struct work_struct
*work)
void fb_deferred_io_init(struct fb_info *info) { struct fb_deferred_io *fbdefio = info->fbdefio;
struct page *page;
int i;
BUG_ON(!fbdefio); mutex_init(&fbdefio->lock);
@@ -227,6 +228,12 @@ void fb_deferred_io_init(struct fb_info *info) INIT_LIST_HEAD(&fbdefio->pagelist); if (fbdefio->delay == 0) /* set a default of 1 s */ fbdefio->delay = HZ;
- /* initialize all the page lists one time */
- for (i = 0; i < info->fix.smem_len; i += PAGE_SIZE) {
page = fb_deferred_io_page(info, i);
INIT_LIST_HEAD(&page->lru);
- } } EXPORT_SYMBOL_GPL(fb_deferred_io_init);
Applying your patch on top of current Linus’ master branch, tty0 is unusable and looks frozen. Sometimes network card still works, sometimes not.
I don't see how the patch would cause below BUG call stack, need some time to debug. Just few comments: 1. Will the system work well without this patch? 2. When you are sure the patch causes the regression you saw, please get free to submit one reverted patch, thanks : )
$ git log --oneline -nodecorate -2 1b351a77ed33 (HEAD -> linus) fbdev: defio: fix the pagelist corruption 52d543b5497c (origin/master, origin/HEAD) Merge tag
'for-linus-5.17-1' of https://github.com/cminyard/linux-ipmi
[ 5.256996] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 [ 5.269582] page dumped because: VM_BUG_ON_PAGE(compound && compound_order(page) != order) [ 5.279507] ------------[ cut here ]------------ [ 5.286406] kernel BUG at mm/page_alloc.c:1326! [ 5.291814] invalid opcode: 0000 [#1] PREEMPT SMP [ 5.296350] CPU: 0 PID: 167 Comm: systemd-udevd Not tainted 5.17.0-10753-g1b351a77ed33 #300 [ 5.304670] Hardware name: ASUS F2A85-M_PRO/F2A85-M_PRO, BIOS 4.16-337-gb87986e67b 03/25/2022 [ 5.313163] RIP: 0010:free_pcp_prepare+0x295/0x400 [ 5.317930] Code: 00 01 00 75 0b 48 8b 45 08 45 31 ff a8 01 74 4b 48 8b 45 00 a9 00 00 01 00 75 22 48 c7 c6 68 30 11 96 48 89 ef e8 cb 29 fd ff <0f> 0b 48 89 ef 41 83 c6 01 e8 bd f5 ff ff e9 2e fe ff ff 0f 1f 44 [ 5.336650] RSP: 0018:ffffa6634062f9c0 EFLAGS: 00010246 [ 5.341849] RAX: 000000000000004e RBX: ffffe4be80000000 RCX: 0000000000000000 [ 5.348957] RDX: 0000000000000000 RSI: ffffffff96136a37 RDI: 00000000ffffffff [ 5.356063] RBP: ffffe4be840c0000 R08: 0000000000000000 R09: 00000000ffffdfff [ 5.363170] R10: ffffa6634062f7f0 R11: ffffffff9652c4a8 R12: 0000000000000000 [ 5.370277] R13: 0000000000000009 R14: ffff91fd02ebc640 R15: ffffe4be840c0000 [ 5.377384] FS: 0000000000000000(0000) GS:ffff91fd7b400000(0063) knlGS:00000000f7eea800 [ 5.385443] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 [ 5.391164] CR2: 00000000f6f0e840 CR3: 0000000106b60000 CR4: 00000000000406f0 [ 5.398272] Call Trace: [ 5.400697] <TASK> [ 5.402778] free_unref_page+0x1b/0xf0 [ 5.406505] __vunmap+0x216/0x2c0 [ 5.409798] drm_fbdev_cleanup+0x5f/0xb0 [ 5.413698] drm_fbdev_fb_destroy+0x15/0x30 [ 5.417857] unregister_framebuffer+0x2c/0x40 [ 5.422191] drm_client_dev_unregister+0x69/0xe0 [ 5.422962] usb usb4: New USB device found, idVendor=1d6b, idProduct=0003, bcdDevice= 5.17 [ 5.426784] drm_dev_unregister+0x2e/0x80 [ 5.439005] drm_dev_unplug+0x21/0x40 [ 5.442645] simpledrm_remove+0x11/0x20 [ 5.446458] platform_remove+0x1f/0x40 [ 5.450185] __device_release_driver+0x17a/0x250 [ 5.454779] device_release_driver+0x24/0x30 [ 5.459024] bus_remove_device+0xd8/0x140 [ 5.463012] device_del+0x18b/0x3f0 [ 5.466478] ? idr_alloc_cyclic+0x50/0xb0 [ 5.470466] platform_device_del.part.0+0x13/0x70 [ 5.475146] platform_device_unregister+0x1c/0x30 [ 5.479824] drm_aperture_detach_drivers+0xa1/0xd0 [ 5.484593] drm_aperture_remove_conflicting_pci_framebuffers+0x3f/0x60 [ 5.491179] radeon_pci_probe+0x54/0xf0 [radeon] [ 5.495773] local_pci_probe+0x45/0x80 [ 5.499499] ? pci_match_device+0xd7/0x130 [ 5.503572] pci_device_probe+0xc2/0x1e0 [ 5.507474] really_probe+0x1f5/0x3d0 [ 5.511112] __driver_probe_device+0xfe/0x180 [ 5.515446] driver_probe_device+0x1e/0x90 [ 5.519518] __driver_attach+0xc0/0x1c0 [ 5.523332] ? __device_attach_driver+0xe0/0xe0 [ 5.527839] ? __device_attach_driver+0xe0/0xe0 [ 5.532346] bus_for_each_dev+0x78/0xc0 [ 5.536159] bus_add_driver+0x149/0x1e0 [ 5.539973] driver_register+0x8f/0xe0 [ 5.543699] ? 0xffffffffc0741000 [ 5.546992] do_one_initcall+0x44/0x200 [ 5.550806] ? kmem_cache_alloc_trace+0x170/0x2c0 [ 5.555487] do_init_module+0x4c/0x240 [ 5.559213] __do_sys_finit_module+0xb4/0x120 [ 5.563547] __do_fast_syscall_32+0x6b/0xe0 [ 5.567706] do_fast_syscall_32+0x2f/0x70 [ 5.571693] entry_SYSCALL_compat_after_hwframe+0x45/0x4d [ 5.577067] RIP: 0023:0xf7efa549 [ 5.580273] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 cd 0f 05 cd 80 <5d> 5a 59 c3 90 90 90 90 8d b4 26 00 00 00 00 8d b4 26 00 00 00 00 [ 5.582805] usb usb4: New USB device strings: Mfr=3, Product=2, SerialNumber=1 [ 5.598992] RSP: 002b:00000000ff831c0c EFLAGS: 00200296 ORIG_RAX: 000000000000015e [ 5.598996] RAX: ffffffffffffffda RBX: 0000000000000011 RCX: 00000000f7ed9e09 [ 5.598998] RDX: 0000000000000000 RSI: 0000000056a5c940 RDI: 0000000056a5c4c0 [ 5.598999] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 [ 5.635047] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 5.642154] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 5.649264] </TASK> [ 5.651427] Modules linked in: crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi radeon(+) r8169 xhci_pci(+) realtek snd_hda_intel drm_ttm_helper snd_intel_dspcfg k10temp snd_hda_codec ttm snd_hda_core xhci_hcd snd_pcm sg ohci_hcd ehci_pci(+) snd_timer drm_dp_helper snd ehci_hcd soundcore i2c_piix4 acpi_cpufreq coreboot_table fuse ipv6 autofs4 [ 5.690975] r8169 0000:04:00.0 enp4s0: renamed from eth0 [ 5.691589] ---[ end trace 0000000000000000 ]--- [ 5.704791] RIP: 0010:free_pcp_prepare+0x295/0x400 [ 5.709784] Code: 00 01 00 75 0b 48 8b 45 08 45 31 ff a8 01 74 4b 48 8b 45 00 a9 00 00 01 00 75 22 48 c7 c6 68 30 11 96 48 89 ef e8 cb 29 fd ff <0f> 0b 48 89 ef 41 83 c6 01 e8 bd f5 ff ff e9 2e fe ff ff 0f 1f 44 [ 5.731535] RSP: 0018:ffffa6634062f9c0 EFLAGS: 00010246 [ 5.752988] usb usb4: Product: xHCI Host Controller [ 5.758571] usb usb4: Manufacturer: Linux 5.17.0-10753-g1b351a77ed33 xhci-hcd [ 5.767096] usb usb4: SerialNumber: 0000:03:00.0 [ 5.772213] hub 4-0:1.0: USB hub found [ 5.782383] hub 4-0:1.0: 2 ports detected [ 5.799251] RAX: 000000000000004e RBX: ffffe4be80000000 RCX: 0000000000000000 [ 5.810470] RDX: 0000000000000000 RSI: ffffffff96136a37 RDI: 00000000ffffffff [ 5.817561] RBP: ffffe4be840c0000 R08: 0000000000000000 R09: 00000000ffffdfff [ 5.824680] R10: ffffa6634062f7f0 R11: ffffffff9652c4a8 R12: 0000000000000000 [ 5.831739] R13: 0000000000000009 R14: ffff91fd02ebc640 R15: ffffe4be840c0000 [ 5.839445] FS: 0000000000000000(0000) GS:ffff91fd7b500000(0063) knlGS:00000000f7eea800 [ 5.847905] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 [ 5.854025] CR2: 000000005664d26c CR3: 0000000106b60000 CR4: 00000000000406e0
Kind regards,
Paul
PS: For some reason, the lore.kernel.org lists most messages twice [1].
PPS: I am actually wanted to analyze the new regression, and thought your patch might help, but made it worse. ;-) (The log excerpt is from Linux master.)
[ 1.738965] BUG: Bad page state in process systemd-udevd pfn:103003 [ 1.738974] fbcon: Taking over console [ 1.740459] page:00000000c3b5c591 refcount:0 mapcount:0 mapping:0000000 000000000 index:0x3 pfn:0x103003 [ 1.740466] head:000000009b49a8e9 order:9 compound_mapcount:0 compound_ pincount:0 [ 1.740468] flags: 0x2fffc000010000(head|node=0|zone=2|lastcpupid=0x3ff f) [ 1.740473] raw: 002fffc000000000 fffff139840c0001 fffff139840c00c8 000 0000000000000 [ 1.740475] raw: 0000000000000000 0000000000000000 00000000ffffffff 000 0000000000000 [ 1.740477] head: 002fffc000010000 0000000000000000 dead000000000122 00 00000000000000 [ 1.740479] head: 0000000000000000 0000000000000000 00000000ffffffff 00 00000000000000 [ 1.740480] page dumped because: corrupted mapping in tail page
I am going to do that in another thread.
chuansheng.liu@intel.com/
Dear Chuansheng,
Am 28.03.22 um 02:58 schrieb Liu, Chuansheng:
-----Original Message-----
Sent: Saturday, March 26, 2022 4:11 PM
Am 17.03.22 um 06:46 schrieb Chuansheng Liu:
Easily hit the below list corruption:
list_add corruption. prev->next should be next (ffffffffc0ceb090), but was ffffec604507edc8. (prev=ffffec604507edc8). WARNING: CPU: 65 PID: 3959 at lib/list_debug.c:26 __list_add_valid+0x53/0x80 CPU: 65 PID: 3959 Comm: fbdev Tainted: G U RIP: 0010:__list_add_valid+0x53/0x80 Call Trace:
<TASK> fb_deferred_io_mkwrite+0xea/0x150 do_page_mkwrite+0x57/0xc0 do_wp_page+0x278/0x2f0 __handle_mm_fault+0xdc2/0x1590 handle_mm_fault+0xdd/0x2c0 do_user_addr_fault+0x1d3/0x650 exc_page_fault+0x77/0x180 ? asm_exc_page_fault+0x8/0x30 asm_exc_page_fault+0x1e/0x30 RIP: 0033:0x7fd98fc8fad1 ==
Figure out the race happens when one process is adding &page->lru into the pagelist tail in fb_deferred_io_mkwrite(), another process is re-initializing the same &page->lru in fb_deferred_io_fault(), which is not protected by the lock.
This fix is to init all the page lists one time during initialization, it not only fixes the list corruption, but also avoids INIT_LIST_HEAD() redundantly.
Fixes: 105a940416fc ("fbdev/defio: Early-out if page is already enlisted") Cc: Thomas Zimmermann tzimmermann@suse.de Signed-off-by: Chuansheng Liu chuansheng.liu@intel.com
drivers/video/fbdev/core/fb_defio.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/drivers/video/fbdev/core/fb_defio.c b/drivers/video/fbdev/core/fb_defio.c index 98b0f23bf5e2..eafb66ca4f28 100644 --- a/drivers/video/fbdev/core/fb_defio.c +++ b/drivers/video/fbdev/core/fb_defio.c @@ -59,7 +59,6 @@ static vm_fault_t fb_deferred_io_fault(struct vm_fault *vmf) printk(KERN_ERR "no mapping available\n");
BUG_ON(!page->mapping);
INIT_LIST_HEAD(&page->lru); page->index = vmf->pgoff;
vmf->page = page;
@@ -220,6 +219,8 @@ static void fb_deferred_io_work(struct work_struct *work) void fb_deferred_io_init(struct fb_info *info) { struct fb_deferred_io *fbdefio = info->fbdefio;
struct page *page;
int i;
BUG_ON(!fbdefio); mutex_init(&fbdefio->lock);
@@ -227,6 +228,12 @@ void fb_deferred_io_init(struct fb_info *info) INIT_LIST_HEAD(&fbdefio->pagelist); if (fbdefio->delay == 0) /* set a default of 1 s */ fbdefio->delay = HZ;
- /* initialize all the page lists one time */
- for (i = 0; i < info->fix.smem_len; i += PAGE_SIZE) {
page = fb_deferred_io_page(info, i);
INIT_LIST_HEAD(&page->lru);
- } } EXPORT_SYMBOL_GPL(fb_deferred_io_init);
Applying your patch on top of current Linus’ master branch, tty0 is unusable and looks frozen. Sometimes network card still works, sometimes not.
I don't see how the patch would cause below BUG call stack, need some time to debug. Just few comments:
- Will the system work well without this patch?
Yes, the framebuffer works well without the patch.
- When you are sure the patch causes the regression you saw, please get free to submit
one reverted patch, thanks : )
I think you for patch wasn’t submitted yet – at least not pulled by Linus.
$ git log --oneline -nodecorate -2 1b351a77ed33 (HEAD -> linus) fbdev: defio: fix the pagelist corruption 52d543b5497c (origin/master, origin/HEAD) Merge tag 'for-linus-5.17-1' of https://github.com/cminyard/linux-ipmi
[ 5.256996] raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 [ 5.269582] page dumped because: VM_BUG_ON_PAGE(compound && compound_order(page) != order) [ 5.279507] ------------[ cut here ]------------ [ 5.286406] kernel BUG at mm/page_alloc.c:1326! [ 5.291814] invalid opcode: 0000 [#1] PREEMPT SMP [ 5.296350] CPU: 0 PID: 167 Comm: systemd-udevd Not tainted 5.17.0-10753-g1b351a77ed33 #300 [ 5.304670] Hardware name: ASUS F2A85-M_PRO/F2A85-M_PRO, BIOS 4.16-337-gb87986e67b 03/25/2022 [ 5.313163] RIP: 0010:free_pcp_prepare+0x295/0x400 [ 5.317930] Code: 00 01 00 75 0b 48 8b 45 08 45 31 ff a8 01 74 4b 48 8b 45 00 a9 00 00 01 00 75 22 48 c7 c6 68 30 11 96 48 89 ef e8 cb 29 fd ff <0f> 0b 48 89 ef 41 83 c6 01 e8 bd f5 ff ff e9 2e fe ff ff 0f 1f 44 [ 5.336650] RSP: 0018:ffffa6634062f9c0 EFLAGS: 00010246 [ 5.341849] RAX: 000000000000004e RBX: ffffe4be80000000 RCX: 0000000000000000 [ 5.348957] RDX: 0000000000000000 RSI: ffffffff96136a37 RDI: 00000000ffffffff [ 5.356063] RBP: ffffe4be840c0000 R08: 0000000000000000 R09: 00000000ffffdfff [ 5.363170] R10: ffffa6634062f7f0 R11: ffffffff9652c4a8 R12: 0000000000000000 [ 5.370277] R13: 0000000000000009 R14: ffff91fd02ebc640 R15: ffffe4be840c0000 [ 5.377384] FS: 0000000000000000(0000) GS:ffff91fd7b400000(0063) knlGS:00000000f7eea800 [ 5.385443] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 [ 5.391164] CR2: 00000000f6f0e840 CR3: 0000000106b60000 CR4: 00000000000406f0 [ 5.398272] Call Trace: [ 5.400697] <TASK> [ 5.402778] free_unref_page+0x1b/0xf0 [ 5.406505] __vunmap+0x216/0x2c0 [ 5.409798] drm_fbdev_cleanup+0x5f/0xb0 [ 5.413698] drm_fbdev_fb_destroy+0x15/0x30 [ 5.417857] unregister_framebuffer+0x2c/0x40 [ 5.422191] drm_client_dev_unregister+0x69/0xe0 [ 5.422962] usb usb4: New USB device found, idVendor=1d6b, idProduct=0003, bcdDevice= 5.17 [ 5.426784] drm_dev_unregister+0x2e/0x80 [ 5.439005] drm_dev_unplug+0x21/0x40 [ 5.442645] simpledrm_remove+0x11/0x20 [ 5.446458] platform_remove+0x1f/0x40 [ 5.450185] __device_release_driver+0x17a/0x250 [ 5.454779] device_release_driver+0x24/0x30 [ 5.459024] bus_remove_device+0xd8/0x140 [ 5.463012] device_del+0x18b/0x3f0 [ 5.466478] ? idr_alloc_cyclic+0x50/0xb0 [ 5.470466] platform_device_del.part.0+0x13/0x70 [ 5.475146] platform_device_unregister+0x1c/0x30 [ 5.479824] drm_aperture_detach_drivers+0xa1/0xd0 [ 5.484593] drm_aperture_remove_conflicting_pci_framebuffers+0x3f/0x60 [ 5.491179] radeon_pci_probe+0x54/0xf0 [radeon] [ 5.495773] local_pci_probe+0x45/0x80 [ 5.499499] ? pci_match_device+0xd7/0x130 [ 5.503572] pci_device_probe+0xc2/0x1e0 [ 5.507474] really_probe+0x1f5/0x3d0 [ 5.511112] __driver_probe_device+0xfe/0x180 [ 5.515446] driver_probe_device+0x1e/0x90 [ 5.519518] __driver_attach+0xc0/0x1c0 [ 5.523332] ? __device_attach_driver+0xe0/0xe0 [ 5.527839] ? __device_attach_driver+0xe0/0xe0 [ 5.532346] bus_for_each_dev+0x78/0xc0 [ 5.536159] bus_add_driver+0x149/0x1e0 [ 5.539973] driver_register+0x8f/0xe0 [ 5.543699] ? 0xffffffffc0741000 [ 5.546992] do_one_initcall+0x44/0x200 [ 5.550806] ? kmem_cache_alloc_trace+0x170/0x2c0 [ 5.555487] do_init_module+0x4c/0x240 [ 5.559213] __do_sys_finit_module+0xb4/0x120 [ 5.563547] __do_fast_syscall_32+0x6b/0xe0 [ 5.567706] do_fast_syscall_32+0x2f/0x70 [ 5.571693] entry_SYSCALL_compat_after_hwframe+0x45/0x4d [ 5.577067] RIP: 0023:0xf7efa549 [ 5.580273] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 cd 0f 05 cd 80 <5d> 5a 59 c3 90 90 90 90 8d b4 26 00 00 00 00 8d b4 26 00 00 00 00 [ 5.582805] usb usb4: New USB device strings: Mfr=3, Product=2, SerialNumber=1 [ 5.598992] RSP: 002b:00000000ff831c0c EFLAGS: 00200296 ORIG_RAX: 000000000000015e [ 5.598996] RAX: ffffffffffffffda RBX: 0000000000000011 RCX: 00000000f7ed9e09 [ 5.598998] RDX: 0000000000000000 RSI: 0000000056a5c940 RDI: 0000000056a5c4c0 [ 5.598999] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 [ 5.635047] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 5.642154] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 5.649264] </TASK> [ 5.651427] Modules linked in: crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi radeon(+) r8169 xhci_pci(+) realtek snd_hda_intel drm_ttm_helper snd_intel_dspcfg k10temp snd_hda_codec ttm snd_hda_core xhci_hcd snd_pcm sg ohci_hcd ehci_pci(+) snd_timer drm_dp_helper snd ehci_hcd soundcore i2c_piix4 acpi_cpufreq coreboot_table fuse ipv6 autofs4 [ 5.690975] r8169 0000:04:00.0 enp4s0: renamed from eth0 [ 5.691589] ---[ end trace 0000000000000000 ]--- [ 5.704791] RIP: 0010:free_pcp_prepare+0x295/0x400 [ 5.709784] Code: 00 01 00 75 0b 48 8b 45 08 45 31 ff a8 01 74 4b 48 8b 45 00 a9 00 00 01 00 75 22 48 c7 c6 68 30 11 96 48 89 ef e8 cb 29 fd ff <0f> 0b 48 89 ef 41 83 c6 01 e8 bd f5 ff ff e9 2e fe ff ff 0f 1f 44 [ 5.731535] RSP: 0018:ffffa6634062f9c0 EFLAGS: 00010246 [ 5.752988] usb usb4: Product: xHCI Host Controller [ 5.758571] usb usb4: Manufacturer: Linux 5.17.0-10753-g1b351a77ed33 xhci-hcd [ 5.767096] usb usb4: SerialNumber: 0000:03:00.0 [ 5.772213] hub 4-0:1.0: USB hub found [ 5.782383] hub 4-0:1.0: 2 ports detected [ 5.799251] RAX: 000000000000004e RBX: ffffe4be80000000 RCX: 0000000000000000 [ 5.810470] RDX: 0000000000000000 RSI: ffffffff96136a37 RDI: 00000000ffffffff [ 5.817561] RBP: ffffe4be840c0000 R08: 0000000000000000 R09: 00000000ffffdfff [ 5.824680] R10: ffffa6634062f7f0 R11: ffffffff9652c4a8 R12: 0000000000000000 [ 5.831739] R13: 0000000000000009 R14: ffff91fd02ebc640 R15: ffffe4be840c0000 [ 5.839445] FS: 0000000000000000(0000) GS:ffff91fd7b500000(0063) knlGS:00000000f7eea800 [ 5.847905] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 [ 5.854025] CR2: 000000005664d26c CR3: 0000000106b60000 CR4: 00000000000406e0
PS: For some reason, the lore.kernel.org lists most messages twice [1].
PPS: I am actually wanted to analyze the new regression, and thought your patch might help, but made it worse. ;-) (The log excerpt is from Linux master.)
[ 1.738965] BUG: Bad page state in process systemd-udevd pfn:103003 [ 1.738974] fbcon: Taking over console [ 1.740459] page:00000000c3b5c591 refcount:0 mapcount:0 mapping:0000000 000000000 index:0x3 pfn:0x103003 [ 1.740466] head:000000009b49a8e9 order:9 compound_mapcount:0 compound_pincount:0 [ 1.740468] flags: 0x2fffc000010000(head|node=0|zone=2|lastcpupid=0x3ff f) [ 1.740473] raw: 002fffc000000000 fffff139840c0001 fffff139840c00c8 000 0000000000000 [ 1.740475] raw: 0000000000000000 0000000000000000 00000000ffffffff 000 0000000000000 [ 1.740477] head: 002fffc000010000 0000000000000000 dead000000000122 00 00000000000000 [ 1.740479] head: 0000000000000000 0000000000000000 00000000ffffffff 00 00000000000000 [ 1.740480] page dumped because: corrupted mapping in tail page
I am going to do that in another thread.
This is [2].
Kind regards,
Paul
[2]: https://lore.kernel.org/bpf/7edcd673-decf-7b4e-1f6e-f2e0e26f757a@molgen.mpg....
Hi Paul,
-----Original Message----- From: dri-devel dri-devel-bounces@lists.freedesktop.org On Behalf Of Paul Menzel Sent: Monday, March 28, 2022 2:15 PM To: Liu, Chuansheng chuansheng.liu@intel.com Cc: tzimmermann@suse.de; linux-fbdev@vger.kernel.org; deller@gmx.de; dri- devel@lists.freedesktop.org; jayalk@intworks.biz Subject: Re: [PATCH] fbdev: defio: fix the pagelist corruption
Dear Chuansheng,
Am 28.03.22 um 02:58 schrieb Liu, Chuansheng:
-----Original Message-----
Sent: Saturday, March 26, 2022 4:11 PM
Am 17.03.22 um 06:46 schrieb Chuansheng Liu:
Easily hit the below list corruption:
list_add corruption. prev->next should be next (ffffffffc0ceb090), but was ffffec604507edc8. (prev=ffffec604507edc8). WARNING: CPU: 65 PID: 3959 at lib/list_debug.c:26 __list_add_valid+0x53/0x80 CPU: 65 PID: 3959 Comm: fbdev Tainted: G U RIP: 0010:__list_add_valid+0x53/0x80 Call Trace:
<TASK> fb_deferred_io_mkwrite+0xea/0x150 do_page_mkwrite+0x57/0xc0 do_wp_page+0x278/0x2f0 __handle_mm_fault+0xdc2/0x1590 handle_mm_fault+0xdd/0x2c0 do_user_addr_fault+0x1d3/0x650 exc_page_fault+0x77/0x180 ? asm_exc_page_fault+0x8/0x30 asm_exc_page_fault+0x1e/0x30 RIP: 0033:0x7fd98fc8fad1 ==
Figure out the race happens when one process is adding &page->lru into the pagelist tail in fb_deferred_io_mkwrite(), another process is re-initializing the same &page->lru in fb_deferred_io_fault(), which is not protected by the lock.
This fix is to init all the page lists one time during initialization, it not only fixes the list corruption, but also avoids INIT_LIST_HEAD() redundantly.
Fixes: 105a940416fc ("fbdev/defio: Early-out if page is already enlisted") Cc: Thomas Zimmermann tzimmermann@suse.de Signed-off-by: Chuansheng Liu chuansheng.liu@intel.com
drivers/video/fbdev/core/fb_defio.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/drivers/video/fbdev/core/fb_defio.c
b/drivers/video/fbdev/core/fb_defio.c
index 98b0f23bf5e2..eafb66ca4f28 100644 --- a/drivers/video/fbdev/core/fb_defio.c +++ b/drivers/video/fbdev/core/fb_defio.c @@ -59,7 +59,6 @@ static vm_fault_t fb_deferred_io_fault(struct vm_fault
*vmf)
printk(KERN_ERR "no mapping available\n"); BUG_ON(!page->mapping);
INIT_LIST_HEAD(&page->lru); page->index = vmf->pgoff;
vmf->page = page;
@@ -220,6 +219,8 @@ static void fb_deferred_io_work(struct work_struct
*work)
void fb_deferred_io_init(struct fb_info *info) { struct fb_deferred_io *fbdefio = info->fbdefio;
struct page *page;
int i;
BUG_ON(!fbdefio); mutex_init(&fbdefio->lock);
@@ -227,6 +228,12 @@ void fb_deferred_io_init(struct fb_info *info) INIT_LIST_HEAD(&fbdefio->pagelist); if (fbdefio->delay == 0) /* set a default of 1 s */ fbdefio->delay = HZ;
- /* initialize all the page lists one time */
- for (i = 0; i < info->fix.smem_len; i += PAGE_SIZE) {
page = fb_deferred_io_page(info, i);
INIT_LIST_HEAD(&page->lru);
- } } EXPORT_SYMBOL_GPL(fb_deferred_io_init);
Applying your patch on top of current Linus’ master branch, tty0 is unusable and looks frozen. Sometimes network card still works, sometimes not.
I don't see how the patch would cause below BUG call stack, need some time
to
debug. Just few comments:
- Will the system work well without this patch?
Yes, the framebuffer works well without the patch.
- When you are sure the patch causes the regression you saw, please get free
to submit
one reverted patch, thanks : )
I think you for patch wasn’t submitted yet – at least not pulled by Linus.
The patch has been in drm-tip, could you have a try with the latest drm-tip to see if the Framebuffer works well, in that case, we could revert it in drm-tip then.
Best Regards Chuansheng
[Cc: -jayalk@intworks.biz as it bounces]
Dear Chuansheng,
Am 29.03.22 um 01:58 schrieb Liu, Chuansheng:
-----Original Message----- From: Paul Menzel Sent: Monday, March 28, 2022 2:15 PM
Am 28.03.22 um 02:58 schrieb Liu, Chuansheng:
-----Original Message-----
Sent: Saturday, March 26, 2022 4:11 PM
Am 17.03.22 um 06:46 schrieb Chuansheng Liu:
Easily hit the below list corruption:
list_add corruption. prev->next should be next (ffffffffc0ceb090), but was ffffec604507edc8. (prev=ffffec604507edc8). WARNING: CPU: 65 PID: 3959 at lib/list_debug.c:26 __list_add_valid+0x53/0x80 CPU: 65 PID: 3959 Comm: fbdev Tainted: G U RIP: 0010:__list_add_valid+0x53/0x80 Call Trace: <TASK> fb_deferred_io_mkwrite+0xea/0x150 do_page_mkwrite+0x57/0xc0 do_wp_page+0x278/0x2f0 __handle_mm_fault+0xdc2/0x1590 handle_mm_fault+0xdd/0x2c0 do_user_addr_fault+0x1d3/0x650 exc_page_fault+0x77/0x180 ? asm_exc_page_fault+0x8/0x30 asm_exc_page_fault+0x1e/0x30 RIP: 0033:0x7fd98fc8fad1 ==
Figure out the race happens when one process is adding &page->lru into the pagelist tail in fb_deferred_io_mkwrite(), another process is re-initializing the same &page->lru in fb_deferred_io_fault(), which is not protected by the lock.
This fix is to init all the page lists one time during initialization, it not only fixes the list corruption, but also avoids INIT_LIST_HEAD() redundantly.
Fixes: 105a940416fc ("fbdev/defio: Early-out if page is already enlisted") Cc: Thomas Zimmermann tzimmermann@suse.de Signed-off-by: Chuansheng Liu chuansheng.liu@intel.com
drivers/video/fbdev/core/fb_defio.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/drivers/video/fbdev/core/fb_defio.c
b/drivers/video/fbdev/core/fb_defio.c
index 98b0f23bf5e2..eafb66ca4f28 100644 --- a/drivers/video/fbdev/core/fb_defio.c +++ b/drivers/video/fbdev/core/fb_defio.c @@ -59,7 +59,6 @@ static vm_fault_t fb_deferred_io_fault(struct vm_fault *vmf) printk(KERN_ERR "no mapping available\n");
BUG_ON(!page->mapping);
INIT_LIST_HEAD(&page->lru); page->index = vmf->pgoff;
vmf->page = page;
@@ -220,6 +219,8 @@ static void fb_deferred_io_work(struct work_struct *work) void fb_deferred_io_init(struct fb_info *info) { struct fb_deferred_io *fbdefio = info->fbdefio;
struct page *page;
int i;
BUG_ON(!fbdefio); mutex_init(&fbdefio->lock);
@@ -227,6 +228,12 @@ void fb_deferred_io_init(struct fb_info *info) INIT_LIST_HEAD(&fbdefio->pagelist); if (fbdefio->delay == 0) /* set a default of 1 s */ fbdefio->delay = HZ;
- /* initialize all the page lists one time */
- for (i = 0; i < info->fix.smem_len; i += PAGE_SIZE) {
page = fb_deferred_io_page(info, i);
INIT_LIST_HEAD(&page->lru);
- } } EXPORT_SYMBOL_GPL(fb_deferred_io_init);
Applying your patch on top of current Linus’ master branch, tty0 is unusable and looks frozen. Sometimes network card still works, sometimes not.
I don't see how the patch would cause below BUG call stack, need some time to debug. Just few comments:
- Will the system work well without this patch?
Yes, the framebuffer works well without the patch.
- When you are sure the patch causes the regression you saw, please get free
to submit one reverted patch, thanks : )
I think you for patch wasn’t submitted yet – at least not pulled by Linus.
The patch has been in drm-tip, could you have a try with the latest drm-tip to see if the Framebuffer works well, in that case, we could revert it in drm-tip then.
With drm-tip (drm-tip: 2022y-03m-29d-13h-14m-35s UTC integration manifest) everything works fine. (I had to disable amdgpu driver, as it failed to build.) Is anyone able to explain that?
Kind regards,
Paul
Hi Paul,
-----Original Message----- From: Paul Menzel pmenzel@molgen.mpg.de Sent: Thursday, March 31, 2022 12:47 AM To: Liu, Chuansheng chuansheng.liu@intel.com Cc: tzimmermann@suse.de; linux-fbdev@vger.kernel.org; deller@gmx.de; dri- devel@lists.freedesktop.org Subject: Re: [PATCH] fbdev: defio: fix the pagelist corruption
[Cc: -jayalk@intworks.biz as it bounces]
Dear Chuansheng,
Am 29.03.22 um 01:58 schrieb Liu, Chuansheng:
-----Original Message----- From: Paul Menzel Sent: Monday, March 28, 2022 2:15 PM
Am 28.03.22 um 02:58 schrieb Liu, Chuansheng:
-----Original Message-----
Sent: Saturday, March 26, 2022 4:11 PM
Am 17.03.22 um 06:46 schrieb Chuansheng Liu:
Easily hit the below list corruption:
list_add corruption. prev->next should be next (ffffffffc0ceb090), but was ffffec604507edc8. (prev=ffffec604507edc8). WARNING: CPU: 65 PID: 3959 at lib/list_debug.c:26 __list_add_valid+0x53/0x80 CPU: 65 PID: 3959 Comm: fbdev Tainted: G U RIP: 0010:__list_add_valid+0x53/0x80 Call Trace: <TASK> fb_deferred_io_mkwrite+0xea/0x150 do_page_mkwrite+0x57/0xc0 do_wp_page+0x278/0x2f0 __handle_mm_fault+0xdc2/0x1590 handle_mm_fault+0xdd/0x2c0 do_user_addr_fault+0x1d3/0x650 exc_page_fault+0x77/0x180 ? asm_exc_page_fault+0x8/0x30 asm_exc_page_fault+0x1e/0x30 RIP: 0033:0x7fd98fc8fad1 ==
Figure out the race happens when one process is adding &page->lru into the pagelist tail in fb_deferred_io_mkwrite(), another process is re-initializing the same &page->lru in fb_deferred_io_fault(), which is not protected by the lock.
This fix is to init all the page lists one time during initialization, it not only fixes the list corruption, but also avoids INIT_LIST_HEAD() redundantly.
Fixes: 105a940416fc ("fbdev/defio: Early-out if page is already enlisted") Cc: Thomas Zimmermann tzimmermann@suse.de Signed-off-by: Chuansheng Liu chuansheng.liu@intel.com
drivers/video/fbdev/core/fb_defio.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/drivers/video/fbdev/core/fb_defio.c
b/drivers/video/fbdev/core/fb_defio.c
index 98b0f23bf5e2..eafb66ca4f28 100644 --- a/drivers/video/fbdev/core/fb_defio.c +++ b/drivers/video/fbdev/core/fb_defio.c @@ -59,7 +59,6 @@ static vm_fault_t fb_deferred_io_fault(struct
vm_fault *vmf)
printk(KERN_ERR "no mapping available\n"); BUG_ON(!page->mapping);
INIT_LIST_HEAD(&page->lru); page->index = vmf->pgoff;
vmf->page = page;
@@ -220,6 +219,8 @@ static void fb_deferred_io_work(struct
work_struct *work)
void fb_deferred_io_init(struct fb_info *info) { struct fb_deferred_io *fbdefio = info->fbdefio;
struct page *page;
int i;
BUG_ON(!fbdefio); mutex_init(&fbdefio->lock);
@@ -227,6 +228,12 @@ void fb_deferred_io_init(struct fb_info *info) INIT_LIST_HEAD(&fbdefio->pagelist); if (fbdefio->delay == 0) /* set a default of 1 s */ fbdefio->delay = HZ;
- /* initialize all the page lists one time */
- for (i = 0; i < info->fix.smem_len; i += PAGE_SIZE) {
page = fb_deferred_io_page(info, i);
INIT_LIST_HEAD(&page->lru);
- } } EXPORT_SYMBOL_GPL(fb_deferred_io_init);
Applying your patch on top of current Linus’ master branch, tty0 is unusable and looks frozen. Sometimes network card still works, sometimes not.
I don't see how the patch would cause below BUG call stack, need some
time to
debug. Just few comments:
- Will the system work well without this patch?
Yes, the framebuffer works well without the patch.
- When you are sure the patch causes the regression you saw, please get
free
to submit one reverted patch, thanks : )
I think you for patch wasn’t submitted yet – at least not pulled by Linus.
The patch has been in drm-tip, could you have a try with the latest drm-tip to
see if the
Framebuffer works well, in that case, we could revert it in drm-tip then.
With drm-tip (drm-tip: 2022y-03m-29d-13h-14m-35s UTC integration manifest) everything works fine. (I had to disable amdgpu driver, as it failed to build.) Is anyone able to explain that?
My patch is for fixing another patch which is in the drm-tip at least, so I assume applying my patch into Linus tree directly is not completely proper. That's my intention of asking your help for retesting drm-tip.
You mean everything working fine means another issue you hit is also gone?
Best Regards Chuansheng
Dear Chuansheng,
Am 31.03.22 um 02:06 schrieb Liu, Chuansheng:
-----Original Message----- From: Paul Menzel pmenzel@molgen.mpg.de Sent: Thursday, March 31, 2022 12:47 AM
[…]
Am 29.03.22 um 01:58 schrieb Liu, Chuansheng:
-----Original Message----- From: Paul Menzel Sent: Monday, March 28, 2022 2:15 PM
Am 28.03.22 um 02:58 schrieb Liu, Chuansheng:
-----Original Message-----
Sent: Saturday, March 26, 2022 4:11 PM
Am 17.03.22 um 06:46 schrieb Chuansheng Liu: > Easily hit the below list corruption: > == > list_add corruption. prev->next should be next (ffffffffc0ceb090), but > was ffffec604507edc8. (prev=ffffec604507edc8). > WARNING: CPU: 65 PID: 3959 at lib/list_debug.c:26 > __list_add_valid+0x53/0x80 > CPU: 65 PID: 3959 Comm: fbdev Tainted: G U > RIP: 0010:__list_add_valid+0x53/0x80 > Call Trace: > <TASK> > fb_deferred_io_mkwrite+0xea/0x150 > do_page_mkwrite+0x57/0xc0 > do_wp_page+0x278/0x2f0 > __handle_mm_fault+0xdc2/0x1590 > handle_mm_fault+0xdd/0x2c0 > do_user_addr_fault+0x1d3/0x650 > exc_page_fault+0x77/0x180 > ? asm_exc_page_fault+0x8/0x30 > asm_exc_page_fault+0x1e/0x30 > RIP: 0033:0x7fd98fc8fad1 > == > > Figure out the race happens when one process is adding &page->lru into > the pagelist tail in fb_deferred_io_mkwrite(), another process is > re-initializing the same &page->lru in fb_deferred_io_fault(), which is > not protected by the lock. > > This fix is to init all the page lists one time during initialization, > it not only fixes the list corruption, but also avoids INIT_LIST_HEAD() > redundantly. > > Fixes: 105a940416fc ("fbdev/defio: Early-out if page is already enlisted") > Cc: Thomas Zimmermann tzimmermann@suse.de > Signed-off-by: Chuansheng Liu chuansheng.liu@intel.com > --- > drivers/video/fbdev/core/fb_defio.c | 9 ++++++++- > 1 file changed, 8 insertions(+), 1 deletion(-) > > diff --git a/drivers/video/fbdev/core/fb_defio.c b/drivers/video/fbdev/core/fb_defio.c > index 98b0f23bf5e2..eafb66ca4f28 100644 > --- a/drivers/video/fbdev/core/fb_defio.c > +++ b/drivers/video/fbdev/core/fb_defio.c > @@ -59,7 +59,6 @@ static vm_fault_t fb_deferred_io_fault(struct vm_fault *vmf) > printk(KERN_ERR "no mapping available\n"); > > BUG_ON(!page->mapping); > - INIT_LIST_HEAD(&page->lru); > page->index = vmf->pgoff; > > vmf->page = page; > @@ -220,6 +219,8 @@ static void fb_deferred_io_work(struct work_struct *work) > void fb_deferred_io_init(struct fb_info *info) > { > struct fb_deferred_io *fbdefio = info->fbdefio; > + struct page *page; > + int i; > > BUG_ON(!fbdefio); > mutex_init(&fbdefio->lock); > @@ -227,6 +228,12 @@ void fb_deferred_io_init(struct fb_info *info) > INIT_LIST_HEAD(&fbdefio->pagelist); > if (fbdefio->delay == 0) /* set a default of 1 s */ > fbdefio->delay = HZ; > + > + /* initialize all the page lists one time */ > + for (i = 0; i < info->fix.smem_len; i += PAGE_SIZE) { > + page = fb_deferred_io_page(info, i); > + INIT_LIST_HEAD(&page->lru); > + } > } > EXPORT_SYMBOL_GPL(fb_deferred_io_init); > Applying your patch on top of current Linus’ master branch, tty0 is unusable and looks frozen. Sometimes network card still works, sometimes not.
I don't see how the patch would cause below BUG call stack, need some time to debug. Just few comments:
- Will the system work well without this patch?
Yes, the framebuffer works well without the patch.
- When you are sure the patch causes the regression you saw, please get free
to submit one reverted patch, thanks : )
I think you for patch wasn’t submitted yet – at least not pulled by Linus.
The patch has been in drm-tip, could you have a try with the latest drm-tip to see if the Framebuffer works well, in that case, we could revert it in drm-tip then.
With drm-tip (drm-tip: 2022y-03m-29d-13h-14m-35s UTC integration manifest) everything works fine. (I had to disable amdgpu driver, as it failed to build.) Is anyone able to explain that?
My patch is for fixing another patch which is in the drm-tip at least,
The referenced commit 105a940416fc in the Fixes tag is also in Linus’ master branch.
so I assume applying my patch into Linus tree directly is not completely proper. That's my intention of asking your help for retesting drm-tip.
If there were such a relation, that would need to be documented in the commit message.
You mean everything working fine means another issue you hit is also gone?
No, I just mean the hang when applying your patch.
Anyway, after figuring out, that drm-tip, is actually not behind Linus’ master branch, I tried to figure out the differences, and it turns out it’s also related to commit fac54e2bfb5b (x86/Kconfig: Select HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP) [1], which is in Linus’ master branch, but not drm-tip. Note, I am using a 32-bit user space and a 64-bit Linux kernel. Reverting commit fac54e2bfb5b, and having your patch a applied, the hang is gone.
I am adding the people involved in the other discussion to make them aware of this failure case.
Kind regards,
Paul
[1]: https://linux-regtracking.leemhuis.info/regzbot/mainline/
Hi Paul,
-----Original Message----- From: dri-devel dri-devel-bounces@lists.freedesktop.org On Behalf Of Paul Menzel Sent: Thursday, March 31, 2022 4:22 PM To: Liu, Chuansheng chuansheng.liu@intel.com Cc: linux-fbdev@vger.kernel.org; Dave Hansen dave.hansen@linux.intel.com; akpm@linux-foundation.org; daniel@iogearbox.net; linux-mm@kvack.org; netdev@vger.kernel.org; deller@gmx.de; x86@kernel.org; ast@kernel.org; dri- devel@lists.freedesktop.org; andrii@kernel.org; Song Liu song@kernel.org; Ingo Molnar mingo@redhat.com; Thomas Gleixner tglx@linutronix.de; tzimmermann@suse.de; Borislav Petkov bp@alien8.de; bpf@vger.kernel.org; Edgecombe, Rick P rick.p.edgecombe@intel.com; kernel-team@fb.com Subject: Re: [PATCH] fbdev: defio: fix the pagelist corruption
Dear Chuansheng,
Am 31.03.22 um 02:06 schrieb Liu, Chuansheng:
-----Original Message----- From: Paul Menzel pmenzel@molgen.mpg.de Sent: Thursday, March 31, 2022 12:47 AM
[…]
Am 29.03.22 um 01:58 schrieb Liu, Chuansheng:
-----Original Message----- From: Paul Menzel Sent: Monday, March 28, 2022 2:15 PM
Am 28.03.22 um 02:58 schrieb Liu, Chuansheng:
> -----Original Message-----
> Sent: Saturday, March 26, 2022 4:11 PM
> Am 17.03.22 um 06:46 schrieb Chuansheng Liu: >> Easily hit the below list corruption: >> == >> list_add corruption. prev->next should be next (ffffffffc0ceb090), but >> was ffffec604507edc8. (prev=ffffec604507edc8). >> WARNING: CPU: 65 PID: 3959 at lib/list_debug.c:26 >> __list_add_valid+0x53/0x80 >> CPU: 65 PID: 3959 Comm: fbdev Tainted: G U >> RIP: 0010:__list_add_valid+0x53/0x80 >> Call Trace: >> <TASK> >> fb_deferred_io_mkwrite+0xea/0x150 >> do_page_mkwrite+0x57/0xc0 >> do_wp_page+0x278/0x2f0 >> __handle_mm_fault+0xdc2/0x1590 >> handle_mm_fault+0xdd/0x2c0 >> do_user_addr_fault+0x1d3/0x650 >> exc_page_fault+0x77/0x180 >> ? asm_exc_page_fault+0x8/0x30 >> asm_exc_page_fault+0x1e/0x30 >> RIP: 0033:0x7fd98fc8fad1 >> == >> >> Figure out the race happens when one process is adding &page->lru
into
>> the pagelist tail in fb_deferred_io_mkwrite(), another process is >> re-initializing the same &page->lru in fb_deferred_io_fault(), which is >> not protected by the lock. >> >> This fix is to init all the page lists one time during initialization, >> it not only fixes the list corruption, but also avoids INIT_LIST_HEAD() >> redundantly. >> >> Fixes: 105a940416fc ("fbdev/defio: Early-out if page is already
enlisted")
>> Cc: Thomas Zimmermann tzimmermann@suse.de >> Signed-off-by: Chuansheng Liu chuansheng.liu@intel.com >> --- >> drivers/video/fbdev/core/fb_defio.c | 9 ++++++++- >> 1 file changed, 8 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/video/fbdev/core/fb_defio.c
b/drivers/video/fbdev/core/fb_defio.c
>> index 98b0f23bf5e2..eafb66ca4f28 100644 >> --- a/drivers/video/fbdev/core/fb_defio.c >> +++ b/drivers/video/fbdev/core/fb_defio.c >> @@ -59,7 +59,6 @@ static vm_fault_t fb_deferred_io_fault(struct
vm_fault *vmf)
>> printk(KERN_ERR "no mapping available\n"); >> >> BUG_ON(!page->mapping); >> - INIT_LIST_HEAD(&page->lru); >> page->index = vmf->pgoff; >> >> vmf->page = page; >> @@ -220,6 +219,8 @@ static void fb_deferred_io_work(struct
work_struct *work)
>> void fb_deferred_io_init(struct fb_info *info) >> { >> struct fb_deferred_io *fbdefio = info->fbdefio; >> + struct page *page; >> + int i; >> >> BUG_ON(!fbdefio); >> mutex_init(&fbdefio->lock); >> @@ -227,6 +228,12 @@ void fb_deferred_io_init(struct fb_info *info) >> INIT_LIST_HEAD(&fbdefio->pagelist); >> if (fbdefio->delay == 0) /* set a default of 1 s */ >> fbdefio->delay = HZ; >> + >> + /* initialize all the page lists one time */ >> + for (i = 0; i < info->fix.smem_len; i += PAGE_SIZE) { >> + page = fb_deferred_io_page(info, i); >> + INIT_LIST_HEAD(&page->lru); >> + } >> } >> EXPORT_SYMBOL_GPL(fb_deferred_io_init); >> > Applying your patch on top of current Linus’ master branch, tty0 is > unusable and looks frozen. Sometimes network card still works,
sometimes
> not.
I don't see how the patch would cause below BUG call stack, need some
time to
debug. Just few comments:
- Will the system work well without this patch?
Yes, the framebuffer works well without the patch.
- When you are sure the patch causes the regression you saw, please get
free
to submit one reverted patch, thanks : )
I think you for patch wasn’t submitted yet – at least not pulled by Linus.
The patch has been in drm-tip, could you have a try with the latest drm-tip
to see
if the Framebuffer works well, in that case, we could revert it in drm-tip then.
With drm-tip (drm-tip: 2022y-03m-29d-13h-14m-35s UTC integration manifest) everything works fine. (I had to disable amdgpu driver, as it failed to build.) Is anyone able to explain that?
My patch is for fixing another patch which is in the drm-tip at least,
The referenced commit 105a940416fc in the Fixes tag is also in Linus’ master branch.
so I assume applying my patch into Linus tree directly is not completely proper. That's my intention of asking your help for retesting drm-tip.
If there were such a relation, that would need to be documented in the commit message.
You should have seen it : )
You mean everything working fine means another issue you hit is also gone?
No, I just mean the hang when applying your patch.
Anyway, after figuring out, that drm-tip, is actually not behind Linus’ master branch, I tried to figure out the differences, and it turns out it’s also related to commit fac54e2bfb5b (x86/Kconfig: Select HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP) [1], which is in Linus’ master branch, but not drm-tip. Note, I am using a 32-bit user space and a 64-bit Linux kernel. Reverting commit fac54e2bfb5b, and having your patch a applied, the hang is gone.
Good to know you have figured it out, and the issue you hit is not related to my patch : )
I am adding the people involved in the other discussion to make them aware of this failure case.
Kind regards,
Paul
dri-devel@lists.freedesktop.org