Hello,
Just took a look out of curiosity.
On Thu, May 12, 2022 at 02:25:57PM +0900, Byungchul Park wrote:
PROCESS A PROCESS B WORKER C
__do_sys_reboot() __do_sys_reboot() mutex_lock(&system_transition_mutex) ... mutex_lock(&system_transition_mutex) <- stuck ... request_firmware_work_func() _request_firmware() firmware_fallback_sysfs() usermodehelper_read_lock_wait() down_read(&umhelper_sem) ... fw_load_sysfs_fallback() fw_sysfs_wait_timeout() wait_for_completion_killable_timeout(&fw_st->completion) <- stuck kernel_halt() __usermodehelper_disable() down_write(&umhelper_sem) <- stuck
All the 3 contexts are stuck at this point.
PROCESS A PROCESS B WORKER C
... up_write(&umhelper_sem) ... mutex_unlock(&system_transition_mutex) <- cannot wake up B
... kernel_halt() notifier_call_chain() hw_shutdown_notify() kill_pending_fw_fallback_reqs() __fw_load_abort() complete_all(&fw_st->completion) <- cannot wake up C ... usermodeheler_read_unlock() up_read(&umhelper_sem) <- cannot wake up A
I'm not sure I'm reading it correctly but it looks like "process B" column is superflous given that it's waiting on the same lock to do the same thing that A is already doing (besides, you can't really halt the machine twice). What it's reporting seems to be ABBA deadlock between A waiting on umhelper_sem and C waiting on fw_st->completion. The report seems spurious:
1. wait_for_completion_killable_timeout() doesn't need someone to wake it up to make forward progress because it will unstick itself after timeout expires.
2. complete_all() from __fw_load_abort() isn't the only source of wakeup. The fw loader can be, and mainly should be, woken up by firmware loading actually completing instead of being aborted.
I guess the reason why B shows up there is because the operation order is such that just between A and C, the complete_all() takes place before __usermodehlper_disable(), so the whole thing kinda doesn't make sense as you can't block a past operation by a future one. Inserting process B introduces the reverse ordering.
Thanks.