On Mon 28-02-22 18:28:26, Byungchul Park wrote:
On Thu, Feb 24, 2022 at 11:22:39AM +0100, Jan Kara wrote:
On Thu 24-02-22 10:11:02, Byungchul Park wrote:
On Wed, Feb 23, 2022 at 03:48:59PM +0100, Jan Kara wrote:
KJOURNALD2(kthread) TASK1(ksys_write) TASK2(ksys_write)
wait A --- stuck wait B --- stuck wait C --- stuck
wake up B wake up C wake up A
where: A is a wait_queue, j_wait_commit B is a wait_queue, j_wait_transaction_locked C is a rwsem, mapping.invalidate_lock
I see. But a situation like this is not necessarily a guarantee of a deadlock, is it? I mean there can be task D that will eventually call say 'wake up B' and unblock everything and this is how things were designed to work? Multiple sources of wakeups are quite common I'd say... What does
Yes. At the very beginning when I desgined Dept, I was thinking whether to support multiple wakeup sources or not for a quite long time. Supporting it would be a better option to aovid non-critical reports. However, I thought anyway we'd better fix it - not urgent tho - if there's any single circle dependency. That's why I decided not to support it for now and wanted to gather the kernel guys' opinions. Thing is which policy we should go with.
I see. So supporting only a single wakeup source is fine for locks I guess. But for general wait queues or other synchronization mechanisms, I'm afraid it will lead to quite some false positive reports. Just my 2c.
Thank you for your feedback.
I realized we've been using "false positive" differently. There exist the three types of code in terms of dependency and deadlock. It's worth noting that dependencies are built from between waits and events in Dept.
case 1. Code with an actual circular dependency, but not deadlock.
A circular dependency can be broken by a rescue wakeup source e.g. timeout. It's not a deadlock. If it's okay that the contexts participating in the circular dependency and others waiting for the events in the circle are stuck until it gets broken. Otherwise, say, if it's not meant, then it's anyway problematic.
1-1. What if we judge this code is problematic? 1-2. What if we judge this code is good?
case 2. Code with an actual circular dependency, and deadlock.
There's no other wakeup source than those within the circular dependency. Literally deadlock. It's problematic and critical.
2-1. What if we judge this code is problematic? 2-2. What if we judge this code is good?
case 3. Code with no actual circular dependency, and not deadlock.
Must be good.
3-1. What if we judge this code is problematic? 3-2. What if we judge this code is good?
I call only 3-1 "false positive" circular dependency. And you call 1-1 and 3-1 "false positive" deadlock.
I've been wondering if the kernel guys esp. Linus considers code with any circular dependency is problematic or not, even if it won't lead to a deadlock, say, case 1. Even though I designed Dept based on what I believe is right, of course, I'm willing to change the design according to the majority opinion.
However, I would never allow case 1 if I were the owner of the kernel for better stability, even though the code works anyway okay for now.
So yes, I call a report for the situation "There is circular dependency but deadlock is not possible." a false positive. And that is because in my opinion your definition of circular dependency includes schemes that are useful and used in the kernel.
Your example in case 1 is kind of borderline (I personally would consider that bug as well) but there are other more valid schemes with multiple wakeup sources like:
We have a queue of work to do Q protected by lock L. Consumer process has code like:
while (1) { lock L prepare_to_wait(work_queued); if (no work) { unlock L sleep } else { unlock L do work wake_up(work_done) } }
AFAIU Dept will create dependency here that 'wakeup work_done' is after 'wait for work_queued'. Producer has code like:
while (1) { lock L prepare_to_wait(work_done) if (too much work queued) { unlock L sleep } else { queue work unlock L wake_up(work_queued) } }
And Dept will create dependency here that 'wakeup work_queued' is after 'wait for work_done'. And thus we have a trivial cycle in the dependencies despite the code being perfectly valid and safe.
Honza