On Fri, Mar 13, 2020 at 02:04:46PM -0700, Matthew Wilcox wrote:
On Fri, Mar 13, 2020 at 04:55:50PM -0300, Jason Gunthorpe wrote:
On Thu, Mar 12, 2020 at 05:02:18PM +0000, Steven Price wrote:
On 12/03/2020 16:37, Jason Gunthorpe wrote:
On Thu, Mar 12, 2020 at 04:16:33PM +0000, Steven Price wrote:
Actually, while you are looking at this, do you think we should be adding at least READ_ONCE in the pagewalk.c walk_* functions? The multiple references of pmd, pud, etc without locking seems sketchy to me.
I agree it seems worrying. I'm not entirely sure whether the holding of mmap_sem is sufficient,
I looked at this question, and at least for PMD, mmap_sem is not sufficient. I didn't easilly figure it out for the other ones
I'm guessing if PMD is not safe then none of them are.
this isn't something that I changed so I've just been hoping that it's sufficient since it seems to have been working (whether that's by chance because the compiler didn't generate multiple reads I've no idea). For walking the kernel's page tables the lack of READ_ONCE is also not great, but at least for PTDUMP we don't care too much about accuracy and it should be crash proof because there's no RCU grace period. And again the code I was replacing didn't have any special protection.
I can't see any harm in updating the code to include READ_ONCE and I'm happy to review a patch.
The reason I ask is because hmm's walkers often have this pattern where they get the pointer and then de-ref it (again) then immediately have to recheck the 'again' conditions of the walker itself because the re-read may have given a different value.
Having the walker deref the pointer and pass the value it into the ops for use rather than repeatedly de-refing an unlocked value seems like a much safer design to me.
Yeah that sounds like a good idea.
I'm looking at this now.. The PUD is also changing under the read mmap_sem - and I was able to think up some race conditiony bugs related to this. Have some patches now..
However, I haven't been able to understand why walk_page_range() doesn't check pud_present() or pmd_present() before calling pmd_offset_map() or pte_offset_map().
As far as I can see a non-present entry has a swap entry encoded in it, and thus it seems like it is a bad idea to pass a non-present entry to the two map functions. I think those should only be called when the entry points to the next level in the page table (so there is something to map?)
I see you added !present tests for the !vma case, but why only there?
Is this a bug? Do you know how it works?
Is it something that was missed when people added non-present PUD and PMD's?
... I'm sorry, I did what now?
No, no, just widening to see if someone knows
As far as I can tell, you're talking about mm/pagewalk.c, and the only commit I have in that file is a00cc7d9dd93d66a3fb83fc52aa57a4bec51c517 ("mm, x86: add support for PUD-sized transparent hugepages", which I think I was pretty clear from the commit message is basically copy-and-paste from the PMD code.
Right, which added the split_huge_pud() which seems maybe related to pud_present, or maybe not, I don't know.
I have no clue why most of the decisions in the MM were made.
Fun!
Jason