On Tue, Nov 09, 2021 at 10:38:47PM +0100, Peter Zijlstra wrote:
On Tue, Nov 09, 2021 at 12:06:48PM -0800, Sultan Alsawaf wrote:
Hi,
I encountered a printk deadlock on 5.13 which appears to still affect the latest kernel. The deadlock occurs due to printk being used while having the current CPU's runqueue locked, and the underlying framebuffer console attempting to lock the same runqueue when printk tries to flush the log buffer.
Yes, that's a known 'feature' of some consoles. printk() is in the process of being reworked to not call con->write() from the printk() calling context, which would go a long way towards fixing this.
I'm a bit out of the loop but from lwn articles my understanding is that part of upstreaming from -rt we no longer have the explicit "I'm a safe console for direct printing" opt-in. Which I get from a backwards compat pov, but I still think for at least fbcon we really should never attempt a direct printk con->write, it's just all around terrible.
And it's getting worse by the year: - direct scanout displays (i.e. just a few mmio writes and it will show up) are on the way out at least in laptops, everyone gets self-refresh (dp psr) under software control, so without being able to kick a kthread off nothing shows up except more oopses
- because of the impendence mismatch between fbdev and drm-kms we even go ever more this direction for dumb framebuffers, including the firmware boot-up framebuffer simpledrm. This could perhaps be fixed with a new dedicate console driver directly on top of drm-kms, but that's on the wishlist for years and I don't see anyone typing that.
So yeah for fbcon at least I think we really should throw out direct con->write from printk completely.
Also adding John Ogness. -Daniel
#27 [ffffc900005b8e28] enqueue_task_fair at ffffffff8129774a <-- SCHED_WARN_ON(rq->tmp_alone_branch != &rq->leaf_cfs_rq_list); #28 [ffffc900005b8ec0] activate_task at ffffffff8125625d #29 [ffffc900005b8ef0] ttwu_do_activate at ffffffff81257943 #30 [ffffc900005b8f28] sched_ttwu_pending at ffffffff8125c71f <-- locks this CPU's runqueue #31 [ffffc900005b8fa0] flush_smp_call_function_queue at ffffffff813c6833 #32 [ffffc900005b8fd8] generic_smp_call_function_single_interrupt at ffffffff813c7f58 #33 [ffffc900005b8fe0] __sysvec_call_function_single at ffffffff810f1456 #34 [ffffc900005b8ff0] sysvec_call_function_single at ffffffff831ec1bc --- <IRQ stack> --- #35 [ffffc9000019fda8] sysvec_call_function_single at ffffffff831ec1bc RIP: ffffffff831ed06e RSP: ffffed10438a6a49 RFLAGS: 00000001 RAX: ffff888100d832c0 RBX: 0000000000000000 RCX: 1ffff92000033fd7 RDX: 0000000000000000 RSI: ffff888100d832c0 RDI: ffffed10438a6a49 RBP: ffffffff831ec166 R8: dffffc0000000000 R9: 0000000000000000 R10: ffffffff83400e22 R11: 0000000000000000 R12: ffffffff831ed83e R13: 0000000000000000 R14: ffffc9000019fde8 R15: ffffffff814d4d9d ORIG_RAX: ffff88821c53524b CS: 0001 SS: ef073a2 WARNING: possibly bogus exception frame ----------------------->8-----------------------
The catalyst is that CONFIG_SCHED_DEBUG is enabled and the tmp_alone_branch assertion fails (Peter, is this bad?).
Yes, that's not good. IIRC Vincent and Michal were looking at that code recently.
I'm not sure what the *correct* solution is here (don't use printk while having a runqueue locked? don't use schedule_work() from the fbcon path? tell printk to use one of its lock-less backends?), so I've cc'd all the relevant folks.
I'm a firm believer in early_printk serial consoles.