On Thu, Sep 26, 2019 at 1:55 PM Thomas Hellström (VMware) thomas_os@shipmail.org wrote:
Well, we're working on supporting huge puds and pmds in the graphics VMAs, although in the write-notify cases we're looking at here, we would probably want to split them down to PTE level.
Well, that's what the existing walker code does if you don't have that "pud_entry()" callback.
That said, I assume you would *not* want to do that if the huge pud/pmd is already clean and read-only, but just continue.
So you may want to have a special pud_entry() that handles that case. Eventually. Maybe. Although honestly, if you're doing dirty tracking, I doubt it makes much sense to use largepages.
Looking at zap_pud_range() which when called from unmap_mapping_pages() uses identical locking (no mmap_sem), it seems we should be able to get away with i_mmap_lock(), making sure the whole page table doesn't disappear under us. So it's not clear to me why the mmap_sem is strictly needed here. Better to sort those restrictions out now rather than when huge entries start appearing.
zap_pud_range()actually does have that
VM_BUG_ON_VMA(!rwsem_is_locked(&tlb->mm->mmap_sem), vma);
exactly for the case where it might have to split the pud entry.
Zapping the whole thing it does do without the assert.
I'm not going to swear the mmap_sem is absolutely required, since a shared vma should be stable due to the i_mmap_lock, but splitting the hugepage really is a fairly big deal.
It can't happen if you zap the *whole* mapping, but it can happen if you have a start/end range. Like you do.
Also, in general it's probably not a great idea to look at zap_page_range() (and copy_page_range()) for ideas.
They are kind of special, since they tend to be used for fundamental whole-address-space operations (ie fork/exit) and so as a result they get to do special things that a normal page walker generally shouldn't do.
It's why they've never gotten translated to use the generic walker code.
Linus