From: Jason Gunthorpe jgg@mellanox.com
The hmm_range_fault() flow is fairly complicated. The scheme allows the caller to specify if it needs a usable result for each page, or if it only needs the current page table status filled in. This mixture of behavior is useful for a caller that wants to build a 'prefetch around fault' algorithm.
Although we have no in-tree users of this capability, I am working on having RDMA ODP work in this manner, and removing these bugs from hmm_range_fault() is a necessary step.
The basic principles are:
- If the caller did not ask for a VA to be valid then hmm_range_fault() should not fail because of that VA
- If 0 is returned from hmm_range_fault() then the entire pfns array contains valid data
- HMM_PFN_ERROR is set if faulting fails, or if asking for faulting would fail
- A 0 return from hmm_range_fault() does not have HMM_PFN_ERROR in any VA's the caller asked to be valid
This series does not get there completely, I have a followup series closing some more complex cases.
I tested this series using Ralph's hmm tester he posted a while back, other testing would be appreciated.
Jason Gunthorpe (8): mm/hmm: add missing unmaps of the ptep during hmm_vma_handle_pte() mm/hmm: don't free the cached pgmap while scanning mm/hmm: do not call hmm_vma_walk_hole() while holding a spinlock mm/hmm: add missing pfns set to hmm_vma_walk_pmd() mm/hmm: add missing call to hmm_range_need_fault() before returning EFAULT mm/hmm: reorganize how !pte_present is handled in hmm_vma_handle_pte() mm/hmm: return -EFAULT when setting HMM_PFN_ERROR on requested valid pages mm/hmm: add missing call to hmm_pte_need_fault in HMM_PFN_SPECIAL handling
mm/hmm.c | 166 ++++++++++++++++++++++++++----------------------------- 1 file changed, 79 insertions(+), 87 deletions(-)
From: Jason Gunthorpe jgg@mellanox.com
Many of the direct returns of error skipped doing the pte_unmap(). All non zero exit paths must unmap the pte.
The pte_unmap() is split unnaturally like this because some of the error exit paths trigger a sleep and must release the lock before sleeping.
Fixes: 992de9a8b751 ("mm/hmm: allow to mirror vma of a file on a DAX backed filesystem") Fixes: 53f5c3f489ec ("mm/hmm: factor out pte and pmd handling to simplify hmm_vma_walk_pmd()") Signed-off-by: Jason Gunthorpe jgg@mellanox.com --- mm/hmm.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/mm/hmm.c b/mm/hmm.c index 72e5a6d9a41756..35f85424176d14 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -325,6 +325,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, }
/* Report error for everything else */ + pte_unmap(ptep); *pfn = range->values[HMM_PFN_ERROR]; return -EFAULT; } else { @@ -339,10 +340,13 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, if (pte_devmap(pte)) { hmm_vma_walk->pgmap = get_dev_pagemap(pte_pfn(pte), hmm_vma_walk->pgmap); - if (unlikely(!hmm_vma_walk->pgmap)) + if (unlikely(!hmm_vma_walk->pgmap)) { + pte_unmap(ptep); return -EBUSY; + } } else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) { if (!is_zero_pfn(pte_pfn(pte))) { + pte_unmap(ptep); *pfn = range->values[HMM_PFN_SPECIAL]; return -EFAULT; } @@ -437,7 +441,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
r = hmm_vma_handle_pte(walk, addr, end, pmdp, ptep, &pfns[i]); if (r) { - /* hmm_vma_handle_pte() did unmap pte directory */ + /* hmm_vma_handle_pte() did pte_unmap() */ hmm_vma_walk->last = addr; return r; }
On 3/11/20 11:34 AM, Jason Gunthorpe wrote:
From: Jason Gunthorpe jgg@mellanox.com
Many of the direct returns of error skipped doing the pte_unmap(). All non zero exit paths must unmap the pte.
The pte_unmap() is split unnaturally like this because some of the error exit paths trigger a sleep and must release the lock before sleeping.
Fixes: 992de9a8b751 ("mm/hmm: allow to mirror vma of a file on a DAX backed filesystem") Fixes: 53f5c3f489ec ("mm/hmm: factor out pte and pmd handling to simplify hmm_vma_walk_pmd()") Signed-off-by: Jason Gunthorpe jgg@mellanox.com
The changes look OK to me but one issue noted below. In any case, you can add: Reviewed-by: Ralph Campbell rcampbell@nvidia.com
mm/hmm.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/mm/hmm.c b/mm/hmm.c index 72e5a6d9a41756..35f85424176d14 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -325,6 +325,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, }
/* Report error for everything else */
*pfn = range->values[HMM_PFN_ERROR]; return -EFAULT; } else {pte_unmap(ptep);
@@ -339,10 +340,13 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, if (pte_devmap(pte)) { hmm_vma_walk->pgmap = get_dev_pagemap(pte_pfn(pte), hmm_vma_walk->pgmap);
if (unlikely(!hmm_vma_walk->pgmap))
if (unlikely(!hmm_vma_walk->pgmap)) {
pte_unmap(ptep); return -EBUSY;
} else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) { if (!is_zero_pfn(pte_pfn(pte))) {}
}pte_unmap(ptep); *pfn = range->values[HMM_PFN_SPECIAL]; return -EFAULT;
@@ -437,7 +441,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
r = hmm_vma_handle_pte(walk, addr, end, pmdp, ptep, &pfns[i]); if (r) {
/* hmm_vma_handle_pte() did unmap pte directory */
}/* hmm_vma_handle_pte() did pte_unmap() */ hmm_vma_walk->last = addr; return r;
I think there is a case where hmm_vma_handle_pte() is called, a fault is requested, pte_unmap() and hmm_vma_walk_hole_() are called, the latter returns zero (the fault was handled OK), but now the page table is unmapped for successive loop iterations and pte_unmap(ptep - 1) is called at the loop end. Since this isn't an issue on x86_64 but is on x86_32, I can see how it could be missed.
On Wed, Mar 11, 2020 at 06:28:30PM -0700, Ralph Campbell wrote:
mm/hmm.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/mm/hmm.c b/mm/hmm.c index 72e5a6d9a41756..35f85424176d14 100644 +++ b/mm/hmm.c @@ -325,6 +325,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, } /* Report error for everything else */
*pfn = range->values[HMM_PFN_ERROR]; return -EFAULT; } else {pte_unmap(ptep);
@@ -339,10 +340,13 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, if (pte_devmap(pte)) { hmm_vma_walk->pgmap = get_dev_pagemap(pte_pfn(pte), hmm_vma_walk->pgmap);
if (unlikely(!hmm_vma_walk->pgmap))
if (unlikely(!hmm_vma_walk->pgmap)) {
pte_unmap(ptep); return -EBUSY;
} else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) { if (!is_zero_pfn(pte_pfn(pte))) {}
}pte_unmap(ptep); *pfn = range->values[HMM_PFN_SPECIAL]; return -EFAULT;
@@ -437,7 +441,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, r = hmm_vma_handle_pte(walk, addr, end, pmdp, ptep, &pfns[i]); if (r) {
/* hmm_vma_handle_pte() did unmap pte directory */
}/* hmm_vma_handle_pte() did pte_unmap() */ hmm_vma_walk->last = addr; return r;
I think there is a case where hmm_vma_handle_pte() is called, a fault is requested, pte_unmap() and hmm_vma_walk_hole_() are called, the latter returns zero (the fault was handled OK)
Not quite, hmm_vma_walk_hole_() never returns 0 if called with fault:
return (fault || write_fault) ? -EBUSY : 0;
And all the call sites of hmm_vma_walk_hole_() in hmm_vma_handle_pte() are structured as:
if (fault || write_fault) goto fault; fault: return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk);
So, it never returns 0.
I already made a patch making this less twisty while fixing something else:
https://github.com/jgunthorpe/linux/commit/078e10ca5919f2c263c245784fb5fe63d...
Thanks, Jason
From: Jason Gunthorpe jgg@mellanox.com
The pgmap is held in the hmm_vma_walk variable in hope of speeding up future get_dev_pagemap() calls by hitting the same pointer. The algorithm doesn't actually care about how long the pgmap is held for.
Move the put of the cached pgmap to after the walk is completed and delete all the other now redundant puts.
This solves a possible leak of the reference in hmm_vma_walk_pmd() if a hmm_vma_handle_pte() fails while looping.
Fixes: 992de9a8b751 ("mm/hmm: allow to mirror vma of a file on a DAX backed filesystem") Signed-off-by: Jason Gunthorpe jgg@mellanox.com --- mm/hmm.c | 31 +++++++++---------------------- 1 file changed, 9 insertions(+), 22 deletions(-)
We talked about just deleting this stuff, but I think it makes alot sense for hmm_range_fault() to trigger fault on devmap pages that are not compatible with the caller - so lets just fix the leak on error path for now.
diff --git a/mm/hmm.c b/mm/hmm.c index 35f85424176d14..9e8f68eb83287a 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -239,10 +239,6 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk, unsigned long addr, } pfns[i] = hmm_device_entry_from_pfn(range, pfn) | cpu_flags; } - if (hmm_vma_walk->pgmap) { - put_dev_pagemap(hmm_vma_walk->pgmap); - hmm_vma_walk->pgmap = NULL; - } hmm_vma_walk->last = end; return 0; } @@ -360,10 +356,6 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, return 0;
fault: - if (hmm_vma_walk->pgmap) { - put_dev_pagemap(hmm_vma_walk->pgmap); - hmm_vma_walk->pgmap = NULL; - } pte_unmap(ptep); /* Fault any virtual address we were asked to fault */ return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk); @@ -446,16 +438,6 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, return r; } } - if (hmm_vma_walk->pgmap) { - /* - * We do put_dev_pagemap() here and not in hmm_vma_handle_pte() - * so that we can leverage get_dev_pagemap() optimization which - * will not re-take a reference on a pgmap if we already have - * one. - */ - put_dev_pagemap(hmm_vma_walk->pgmap); - hmm_vma_walk->pgmap = NULL; - } pte_unmap(ptep - 1);
hmm_vma_walk->last = addr; @@ -529,10 +511,6 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end, pfns[i] = hmm_device_entry_from_pfn(range, pfn) | cpu_flags; } - if (hmm_vma_walk->pgmap) { - put_dev_pagemap(hmm_vma_walk->pgmap); - hmm_vma_walk->pgmap = NULL; - } hmm_vma_walk->last = end; goto out_unlock; } @@ -694,6 +672,15 @@ long hmm_range_fault(struct hmm_range *range, unsigned int flags) return -EBUSY; ret = walk_page_range(mm, hmm_vma_walk.last, range->end, &hmm_walk_ops, &hmm_vma_walk); + /* + * A pgmap is kept cached in the hmm_vma_walk to avoid expensive + * searching in the probably common case that the pgmap is the + * same for the entire requested range. + */ + if (hmm_vma_walk.pgmap) { + put_dev_pagemap(hmm_vma_walk.pgmap); + hmm_vma_walk.pgmap = NULL; + } } while (ret == -EBUSY);
if (ret)
On 3/11/20 11:35 AM, Jason Gunthorpe wrote:
From: Jason Gunthorpe jgg@mellanox.com
The pgmap is held in the hmm_vma_walk variable in hope of speeding up future get_dev_pagemap() calls by hitting the same pointer. The algorithm doesn't actually care about how long the pgmap is held for.
Move the put of the cached pgmap to after the walk is completed and delete all the other now redundant puts.
This solves a possible leak of the reference in hmm_vma_walk_pmd() if a hmm_vma_handle_pte() fails while looping.
Fixes: 992de9a8b751 ("mm/hmm: allow to mirror vma of a file on a DAX backed filesystem") Signed-off-by: Jason Gunthorpe jgg@mellanox.com
Reviewed-by: Ralph Campbell rcampbell@nvidia.com
mm/hmm.c | 31 +++++++++---------------------- 1 file changed, 9 insertions(+), 22 deletions(-)
We talked about just deleting this stuff, but I think it makes alot sense for hmm_range_fault() to trigger fault on devmap pages that are not compatible with the caller - so lets just fix the leak on error path for now.
diff --git a/mm/hmm.c b/mm/hmm.c index 35f85424176d14..9e8f68eb83287a 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -239,10 +239,6 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk, unsigned long addr, } pfns[i] = hmm_device_entry_from_pfn(range, pfn) | cpu_flags; }
- if (hmm_vma_walk->pgmap) {
put_dev_pagemap(hmm_vma_walk->pgmap);
hmm_vma_walk->pgmap = NULL;
- } hmm_vma_walk->last = end; return 0; }
@@ -360,10 +356,6 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, return 0;
fault:
- if (hmm_vma_walk->pgmap) {
put_dev_pagemap(hmm_vma_walk->pgmap);
hmm_vma_walk->pgmap = NULL;
- } pte_unmap(ptep); /* Fault any virtual address we were asked to fault */ return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk);
@@ -446,16 +438,6 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, return r; } }
if (hmm_vma_walk->pgmap) {
/*
* We do put_dev_pagemap() here and not in hmm_vma_handle_pte()
* so that we can leverage get_dev_pagemap() optimization which
* will not re-take a reference on a pgmap if we already have
* one.
*/
put_dev_pagemap(hmm_vma_walk->pgmap);
hmm_vma_walk->pgmap = NULL;
} pte_unmap(ptep - 1);
hmm_vma_walk->last = addr;
@@ -529,10 +511,6 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end, pfns[i] = hmm_device_entry_from_pfn(range, pfn) | cpu_flags; }
if (hmm_vma_walk->pgmap) {
put_dev_pagemap(hmm_vma_walk->pgmap);
hmm_vma_walk->pgmap = NULL;
hmm_vma_walk->last = end; goto out_unlock; }}
@@ -694,6 +672,15 @@ long hmm_range_fault(struct hmm_range *range, unsigned int flags) return -EBUSY; ret = walk_page_range(mm, hmm_vma_walk.last, range->end, &hmm_walk_ops, &hmm_vma_walk);
/*
* A pgmap is kept cached in the hmm_vma_walk to avoid expensive
* searching in the probably common case that the pgmap is the
* same for the entire requested range.
*/
if (hmm_vma_walk.pgmap) {
put_dev_pagemap(hmm_vma_walk.pgmap);
hmm_vma_walk.pgmap = NULL;
}
} while (ret == -EBUSY);
if (ret)
From: Jason Gunthorpe jgg@mellanox.com
This eventually calls into handle_mm_fault() which is a sleeping function. Release the lock first.
hmm_vma_walk_hole() does not touch the contents of the PUD, so it does not need the lock.
Fixes: 3afc423632a1 ("mm: pagewalk: add p4d_entry() and pgd_entry()") Cc: Steven Price steven.price@arm.com Signed-off-by: Jason Gunthorpe jgg@mellanox.com --- mm/hmm.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a/mm/hmm.c b/mm/hmm.c index 9e8f68eb83287a..32dcbfd3908315 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -473,8 +473,8 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end,
pud = READ_ONCE(*pudp); if (pud_none(pud)) { - ret = hmm_vma_walk_hole(start, end, -1, walk); - goto out_unlock; + spin_unlock(ptl); + return hmm_vma_walk_hole(start, end, -1, walk); }
if (pud_huge(pud) && pud_devmap(pud)) { @@ -483,8 +483,8 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end, bool fault, write_fault;
if (!pud_present(pud)) { - ret = hmm_vma_walk_hole(start, end, -1, walk); - goto out_unlock; + spin_unlock(ptl); + return hmm_vma_walk_hole(start, end, -1, walk); }
i = (addr - range->start) >> PAGE_SHIFT; @@ -495,9 +495,9 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end, hmm_range_need_fault(hmm_vma_walk, pfns, npages, cpu_flags, &fault, &write_fault); if (fault || write_fault) { - ret = hmm_vma_walk_hole_(addr, end, fault, - write_fault, walk); - goto out_unlock; + spin_unlock(ptl); + return hmm_vma_walk_hole_(addr, end, fault, write_fault, + walk); }
pfn = pud_pfn(pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
On 3/11/20 11:35 AM, Jason Gunthorpe wrote:
From: Jason Gunthorpe jgg@mellanox.com
This eventually calls into handle_mm_fault() which is a sleeping function. Release the lock first.
hmm_vma_walk_hole() does not touch the contents of the PUD, so it does not need the lock.
Fixes: 3afc423632a1 ("mm: pagewalk: add p4d_entry() and pgd_entry()") Cc: Steven Price steven.price@arm.com Signed-off-by: Jason Gunthorpe jgg@mellanox.com
Reviewed-by: Ralph Campbell rcampbell@nvidia.com
mm/hmm.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a/mm/hmm.c b/mm/hmm.c index 9e8f68eb83287a..32dcbfd3908315 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -473,8 +473,8 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end,
pud = READ_ONCE(*pudp); if (pud_none(pud)) {
ret = hmm_vma_walk_hole(start, end, -1, walk);
goto out_unlock;
spin_unlock(ptl);
return hmm_vma_walk_hole(start, end, -1, walk);
}
if (pud_huge(pud) && pud_devmap(pud)) {
@@ -483,8 +483,8 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end, bool fault, write_fault;
if (!pud_present(pud)) {
ret = hmm_vma_walk_hole(start, end, -1, walk);
goto out_unlock;
spin_unlock(ptl);
return hmm_vma_walk_hole(start, end, -1, walk);
}
i = (addr - range->start) >> PAGE_SHIFT;
@@ -495,9 +495,9 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end, hmm_range_need_fault(hmm_vma_walk, pfns, npages, cpu_flags, &fault, &write_fault); if (fault || write_fault) {
ret = hmm_vma_walk_hole_(addr, end, fault,
write_fault, walk);
goto out_unlock;
spin_unlock(ptl);
return hmm_vma_walk_hole_(addr, end, fault, write_fault,
walk);
}
pfn = pud_pfn(pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
On 11/03/2020 18:35, Jason Gunthorpe wrote:
From: Jason Gunthorpe jgg@mellanox.com
This eventually calls into handle_mm_fault() which is a sleeping function. Release the lock first.
hmm_vma_walk_hole() does not touch the contents of the PUD, so it does not need the lock.
Fixes: 3afc423632a1 ("mm: pagewalk: add p4d_entry() and pgd_entry()") Cc: Steven Price steven.price@arm.com Signed-off-by: Jason Gunthorpe jgg@mellanox.com
Sorry about that, thanks for fixing.
Reviewed-by: Steven Price steven.price@arm.com
mm/hmm.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a/mm/hmm.c b/mm/hmm.c index 9e8f68eb83287a..32dcbfd3908315 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -473,8 +473,8 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end,
pud = READ_ONCE(*pudp); if (pud_none(pud)) {
ret = hmm_vma_walk_hole(start, end, -1, walk);
goto out_unlock;
spin_unlock(ptl);
return hmm_vma_walk_hole(start, end, -1, walk);
}
if (pud_huge(pud) && pud_devmap(pud)) {
@@ -483,8 +483,8 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end, bool fault, write_fault;
if (!pud_present(pud)) {
ret = hmm_vma_walk_hole(start, end, -1, walk);
goto out_unlock;
spin_unlock(ptl);
return hmm_vma_walk_hole(start, end, -1, walk);
}
i = (addr - range->start) >> PAGE_SHIFT;
@@ -495,9 +495,9 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end, hmm_range_need_fault(hmm_vma_walk, pfns, npages, cpu_flags, &fault, &write_fault); if (fault || write_fault) {
ret = hmm_vma_walk_hole_(addr, end, fault,
write_fault, walk);
goto out_unlock;
spin_unlock(ptl);
return hmm_vma_walk_hole_(addr, end, fault, write_fault,
walk);
}
pfn = pud_pfn(pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
By refactoring to deal with the !pud_huge(pud) || !pud_devmap(pud) condition early it's possible to remove the 'ret' variable and remove a level of indentation from half the function making the code easier to read.
No functional change.
Signed-off-by: Steven Price steven.price@arm.com --- Thanks to Jason's changes there were only two code paths left using the out_unlock label so it seemed like a good opportunity to refactor. --- mm/hmm.c | 69 ++++++++++++++++++++++++++------------------------------ 1 file changed, 32 insertions(+), 37 deletions(-)
diff --git a/mm/hmm.c b/mm/hmm.c index ca33d086bdc1..0117c86426d1 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -466,8 +466,10 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end, struct hmm_range *range = hmm_vma_walk->range; unsigned long addr = start; pud_t pud; - int ret = 0; spinlock_t *ptl = pud_trans_huge_lock(pudp, walk->vma); + unsigned long i, npages, pfn; + uint64_t *pfns, cpu_flags; + bool fault, write_fault;
if (!ptl) return 0; @@ -481,50 +483,43 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end, return hmm_vma_walk_hole(start, end, -1, walk); }
- if (pud_huge(pud) && pud_devmap(pud)) { - unsigned long i, npages, pfn; - uint64_t *pfns, cpu_flags; - bool fault, write_fault; + if (!pud_huge(pud) || !pud_devmap(pud)) { + /* Ask for the PUD to be split */ + walk->action = ACTION_SUBTREE; + spin_unlock(ptl); + return 0; + }
- if (!pud_present(pud)) { - spin_unlock(ptl); - return hmm_vma_walk_hole(start, end, -1, walk); - } + if (!pud_present(pud)) { + spin_unlock(ptl); + return hmm_vma_walk_hole(start, end, -1, walk); + }
- i = (addr - range->start) >> PAGE_SHIFT; - npages = (end - addr) >> PAGE_SHIFT; - pfns = &range->pfns[i]; + i = (addr - range->start) >> PAGE_SHIFT; + npages = (end - addr) >> PAGE_SHIFT; + pfns = &range->pfns[i];
- cpu_flags = pud_to_hmm_pfn_flags(range, pud); - hmm_range_need_fault(hmm_vma_walk, pfns, npages, - cpu_flags, &fault, &write_fault); - if (fault || write_fault) { - spin_unlock(ptl); - return hmm_vma_walk_hole_(addr, end, fault, write_fault, - walk); - } + cpu_flags = pud_to_hmm_pfn_flags(range, pud); + hmm_range_need_fault(hmm_vma_walk, pfns, npages, + cpu_flags, &fault, &write_fault); + if (fault || write_fault) { + spin_unlock(ptl); + return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk); + }
- pfn = pud_pfn(pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT); - for (i = 0; i < npages; ++i, ++pfn) { - hmm_vma_walk->pgmap = get_dev_pagemap(pfn, - hmm_vma_walk->pgmap); - if (unlikely(!hmm_vma_walk->pgmap)) { - ret = -EBUSY; - goto out_unlock; - } - pfns[i] = hmm_device_entry_from_pfn(range, pfn) | - cpu_flags; + pfn = pud_pfn(pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT); + for (i = 0; i < npages; ++i, ++pfn) { + hmm_vma_walk->pgmap = get_dev_pagemap(pfn, hmm_vma_walk->pgmap); + if (unlikely(!hmm_vma_walk->pgmap)) { + spin_unlock(ptl); + return -EBUSY; } - hmm_vma_walk->last = end; - goto out_unlock; + pfns[i] = hmm_device_entry_from_pfn(range, pfn) | cpu_flags; } + hmm_vma_walk->last = end;
- /* Ask for the PUD to be split */ - walk->action = ACTION_SUBTREE; - -out_unlock: spin_unlock(ptl); - return ret; + return 0; } #else #define hmm_vma_walk_pud NULL
On Thu, Mar 12, 2020 at 10:28:13AM +0000, Steven Price wrote:
By refactoring to deal with the !pud_huge(pud) || !pud_devmap(pud) condition early it's possible to remove the 'ret' variable and remove a level of indentation from half the function making the code easier to read.
No functional change.
Signed-off-by: Steven Price steven.price@arm.com
Thanks to Jason's changes there were only two code paths left using the out_unlock label so it seemed like a good opportunity to refactor.
Yes, I made something very similar, what do you think of this:
https://github.com/jgunthorpe/linux/commit/93f0ed42ab3f9ceb27b58fb7c7c3ecaf6...
From 93f0ed42ab3f9ceb27b58fb7c7c3ecaf60f16b36 Mon Sep 17 00:00:00 2001
From: Jason Gunthorpe jgg@mellanox.com Date: Wed, 4 Mar 2020 17:11:10 -0400 Subject: [PATCH] mm/hmm: rework hmm_vma_walk_pud()
At least since commit 3afc423632a1 ("mm: pagewalk: add p4d_entry() and pgd_entry()") this code has developed a number of strange control flows.
The purpose of the routine is to copy the pfns of a huge devmap PUD into the pfns output array, without splitting the PUD. Everything that is not a huge devmap PUD should go back to the walker for splitting.
Rework the logic to show this goal and remove redundant stuff:
- If pud_trans_huge_lock returns !NULL then this is already 'pud_trans_huge() || pud_devmap()' and 'pud_huge() || pud_devmap()' so some of the logic is redundant.
- Hitting pud_none() is a race, treat it as such and return back to the walker using ACTION_AGAIN
- !pud_present() gives 0 cpu_flags, so the extra checks are redundant
- Once the *pudp is read there is no need to continue holding the pud lock, so drop it. The only thing the following code cares about is the pfn from the devmap, and if there is racing then the notifiers will resolve everything. Perhaps the unlocked READ_ONCE in an ealier version was correct
Signed-off-by: Jason Gunthorpe jgg@mellanox.com --- mm/hmm.c | 79 +++++++++++++++++++++++--------------------------------- 1 file changed, 33 insertions(+), 46 deletions(-)
diff --git a/mm/hmm.c b/mm/hmm.c index 8fec801a33c9e2..87a376659b5ad4 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -455,66 +455,53 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end, { struct hmm_vma_walk *hmm_vma_walk = walk->private; struct hmm_range *range = hmm_vma_walk->range; - unsigned long addr = start; + unsigned long i, npages, pfn; + unsigned int required_fault; + uint64_t cpu_flags; + uint64_t *pfns; + spinlock_t *ptl; pud_t pud; - int ret = 0; - spinlock_t *ptl = pud_trans_huge_lock(pudp, walk->vma);
+ /* + * This only handles huge devmap pages, the default return is + * ACTION_SUBTREE, so everything else is split by the walker and passed + * to the other routines. + */ + ptl = pud_trans_huge_lock(pudp, walk->vma); if (!ptl) return 0; + pud = *pudp; + spin_unlock(ptl);
- /* Normally we don't want to split the huge page */ - walk->action = ACTION_CONTINUE; - - pud = READ_ONCE(*pudp); if (pud_none(pud)) { - spin_unlock(ptl); - return hmm_vma_walk_hole(start, end, -1, walk); + walk->action = ACTION_AGAIN; + return 0; }
- if (pud_huge(pud) && pud_devmap(pud)) { - unsigned long i, npages, pfn; - unsigned int required_flags; - uint64_t *pfns, cpu_flags; - - if (!pud_present(pud)) { - spin_unlock(ptl); - return hmm_vma_walk_hole(start, end, -1, walk); - } - - i = (addr - range->start) >> PAGE_SHIFT; - npages = (end - addr) >> PAGE_SHIFT; - pfns = &range->pfns[i]; + if (!pud_devmap(pud)) + return 0;
- cpu_flags = pud_to_hmm_pfn_flags(range, pud); + pfns = &range->pfns[(start - range->start) >> PAGE_SHIFT]; + cpu_flags = pud_to_hmm_pfn_flags(range, pud); + required_fault = hmm_range_need_fault(hmm_vma_walk, pfns, npages, cpu_flags); - if (required_flags) { - spin_unlock(ptl); - return hmm_vma_walk_hole_(addr, end, required_flags, - walk); - } + if (required_fault) + return hmm_vma_walk_hole_(start, end, required_fault, walk);
- pfn = pud_pfn(pud) + ((addr & ~PUD_MASK) >> PAGE_SHIFT); - for (i = 0; i < npages; ++i, ++pfn) { - hmm_vma_walk->pgmap = get_dev_pagemap(pfn, - hmm_vma_walk->pgmap); - if (unlikely(!hmm_vma_walk->pgmap)) { - ret = -EBUSY; - goto out_unlock; - } - pfns[i] = hmm_device_entry_from_pfn(range, pfn) | - cpu_flags; - } - hmm_vma_walk->last = end; - goto out_unlock; + pfn = pud_pfn(pud) + ((start & ~PUD_MASK) >> PAGE_SHIFT); + npages = (end - start) >> PAGE_SHIFT; + for (i = 0; i < npages; ++i, ++pfn) { + hmm_vma_walk->pgmap = get_dev_pagemap(pfn, hmm_vma_walk->pgmap); + if (unlikely(!hmm_vma_walk->pgmap)) + return -EBUSY; + pfns[i] = hmm_device_entry_from_pfn(range, pfn) | cpu_flags; }
- /* Ask for the PUD to be split */ - walk->action = ACTION_SUBTREE; + hmm_vma_walk->last = end;
-out_unlock: - spin_unlock(ptl); - return ret; + /* Do not split the pud */ + walk->action = ACTION_CONTINUE; + return 0; } #else #define hmm_vma_walk_pud NULL
On 12/03/2020 14:27, Jason Gunthorpe wrote:
On Thu, Mar 12, 2020 at 10:28:13AM +0000, Steven Price wrote:
By refactoring to deal with the !pud_huge(pud) || !pud_devmap(pud) condition early it's possible to remove the 'ret' variable and remove a level of indentation from half the function making the code easier to read.
No functional change.
Signed-off-by: Steven Price steven.price@arm.com
Thanks to Jason's changes there were only two code paths left using the out_unlock label so it seemed like a good opportunity to refactor.
Yes, I made something very similar, what do you think of this:
https://github.com/jgunthorpe/linux/commit/93f0ed42ab3f9ceb27b58fb7c7c3ecaf6...
Even better! Sorry I didn't realise you'd already done this. I just saw that the function was needlessly complicated after your fix, so I thought I'd do a drive-by cleanup since part of the mess was my fault! :)
Thanks,
Steve
On Thu, Mar 12, 2020 at 02:40:08PM +0000, Steven Price wrote:
On 12/03/2020 14:27, Jason Gunthorpe wrote:
On Thu, Mar 12, 2020 at 10:28:13AM +0000, Steven Price wrote:
By refactoring to deal with the !pud_huge(pud) || !pud_devmap(pud) condition early it's possible to remove the 'ret' variable and remove a level of indentation from half the function making the code easier to read.
No functional change.
Signed-off-by: Steven Price steven.price@arm.com Thanks to Jason's changes there were only two code paths left using the out_unlock label so it seemed like a good opportunity to refactor.
Yes, I made something very similar, what do you think of this:
https://github.com/jgunthorpe/linux/commit/93f0ed42ab3f9ceb27b58fb7c7c3ecaf6...
Even better! Sorry I didn't realise you'd already done this. I just saw that the function was needlessly complicated after your fix, so I thought I'd do a drive-by cleanup since part of the mess was my fault! :)
No worries, I've got a lot of patches for hmm_range_fault right now, just trying to organize them, test them and post them. Haven't posted that one yet.
Actually, while you are looking at this, do you think we should be adding at least READ_ONCE in the pagewalk.c walk_* functions? The multiple references of pmd, pud, etc without locking seems sketchy to me.
Jason
On 12/03/2020 15:11, Jason Gunthorpe wrote:
On Thu, Mar 12, 2020 at 02:40:08PM +0000, Steven Price wrote:
On 12/03/2020 14:27, Jason Gunthorpe wrote:
On Thu, Mar 12, 2020 at 10:28:13AM +0000, Steven Price wrote:
By refactoring to deal with the !pud_huge(pud) || !pud_devmap(pud) condition early it's possible to remove the 'ret' variable and remove a level of indentation from half the function making the code easier to read.
No functional change.
Signed-off-by: Steven Price steven.price@arm.com Thanks to Jason's changes there were only two code paths left using the out_unlock label so it seemed like a good opportunity to refactor.
Yes, I made something very similar, what do you think of this:
https://github.com/jgunthorpe/linux/commit/93f0ed42ab3f9ceb27b58fb7c7c3ecaf6...
Even better! Sorry I didn't realise you'd already done this. I just saw that the function was needlessly complicated after your fix, so I thought I'd do a drive-by cleanup since part of the mess was my fault! :)
No worries, I've got a lot of patches for hmm_range_fault right now, just trying to organize them, test them and post them. Haven't posted that one yet.
Actually, while you are looking at this, do you think we should be adding at least READ_ONCE in the pagewalk.c walk_* functions? The multiple references of pmd, pud, etc without locking seems sketchy to me.
I agree it seems worrying. I'm not entirely sure whether the holding of mmap_sem is sufficient, this isn't something that I changed so I've just been hoping that it's sufficient since it seems to have been working (whether that's by chance because the compiler didn't generate multiple reads I've no idea). For walking the kernel's page tables the lack of READ_ONCE is also not great, but at least for PTDUMP we don't care too much about accuracy and it should be crash proof because there's no RCU grace period. And again the code I was replacing didn't have any special protection.
I can't see any harm in updating the code to include READ_ONCE and I'm happy to review a patch.
Thanks,
Steve
On Thu, Mar 12, 2020 at 04:16:33PM +0000, Steven Price wrote:
Actually, while you are looking at this, do you think we should be adding at least READ_ONCE in the pagewalk.c walk_* functions? The multiple references of pmd, pud, etc without locking seems sketchy to me.
I agree it seems worrying. I'm not entirely sure whether the holding of mmap_sem is sufficient,
I looked at this question, and at least for PMD, mmap_sem is not sufficient. I didn't easilly figure it out for the other ones
I'm guessing if PMD is not safe then none of them are.
this isn't something that I changed so I've just been hoping that it's sufficient since it seems to have been working (whether that's by chance because the compiler didn't generate multiple reads I've no idea). For walking the kernel's page tables the lack of READ_ONCE is also not great, but at least for PTDUMP we don't care too much about accuracy and it should be crash proof because there's no RCU grace period. And again the code I was replacing didn't have any special protection.
I can't see any harm in updating the code to include READ_ONCE and I'm happy to review a patch.
The reason I ask is because hmm's walkers often have this pattern where they get the pointer and then de-ref it (again) then immediately have to recheck the 'again' conditions of the walker itself because the re-read may have given a different value.
Having the walker deref the pointer and pass the value it into the ops for use rather than repeatedly de-refing an unlocked value seems like a much safer design to me.
If this also implicitly relies on a RCU grace period then it is also missing RCU locking...
I also didn't quite understand why walk_pte_range() skipped locking the pte in the no_vma case - I don't get why vma would be related to locking here.
I also saw that hmm open coded the pte walk, presumably for performance, so I was thinking of adding some kind of pte_range() callback to avoid the expensive indirect function call per pte, but hmm also can't have the pmd locked...
Jason
On 12/03/2020 16:37, Jason Gunthorpe wrote:
On Thu, Mar 12, 2020 at 04:16:33PM +0000, Steven Price wrote:
Actually, while you are looking at this, do you think we should be adding at least READ_ONCE in the pagewalk.c walk_* functions? The multiple references of pmd, pud, etc without locking seems sketchy to me.
I agree it seems worrying. I'm not entirely sure whether the holding of mmap_sem is sufficient,
I looked at this question, and at least for PMD, mmap_sem is not sufficient. I didn't easilly figure it out for the other ones
I'm guessing if PMD is not safe then none of them are.
this isn't something that I changed so I've just been hoping that it's sufficient since it seems to have been working (whether that's by chance because the compiler didn't generate multiple reads I've no idea). For walking the kernel's page tables the lack of READ_ONCE is also not great, but at least for PTDUMP we don't care too much about accuracy and it should be crash proof because there's no RCU grace period. And again the code I was replacing didn't have any special protection.
I can't see any harm in updating the code to include READ_ONCE and I'm happy to review a patch.
The reason I ask is because hmm's walkers often have this pattern where they get the pointer and then de-ref it (again) then immediately have to recheck the 'again' conditions of the walker itself because the re-read may have given a different value.
Having the walker deref the pointer and pass the value it into the ops for use rather than repeatedly de-refing an unlocked value seems like a much safer design to me.
Yeah that sounds like a good idea.
If this also implicitly relies on a RCU grace period then it is also missing RCU locking...
True - I'm not 100% sure in what situations a page table entry can be freed. Anshuman has added locking to deal with memory hotplug[1]. I believe this is sufficient.
[1] bf2b59f60ee1 ("arm64/mm: Hold memory hotplug lock while walking for kernel page table dump")
I also didn't quite understand why walk_pte_range() skipped locking the pte in the no_vma case - I don't get why vma would be related to locking here.
The no_vma case is for walking the kernel's page tables and they may have entries that are not backed by struct page, so there isn't (reliably) a PTE lock to take.
I also saw that hmm open coded the pte walk, presumably for performance, so I was thinking of adding some kind of pte_range() callback to avoid the expensive indirect function call per pte, but hmm also can't have the pmd locked...
Yeah the callback per PTE is a bit heavy because of the indirect function call. I'm not sure how to optimise it beyond open coding at the PMD level. One option would be to provide helper functions to make it a bit more generic.
Do you have an idea of what pte_range() would look like?
Steve
On Thu, Mar 12, 2020 at 05:02:18PM +0000, Steven Price wrote:
Having the walker deref the pointer and pass the value it into the ops for use rather than repeatedly de-refing an unlocked value seems like a much safer design to me.
Yeah that sounds like a good idea.
Ok.. let see when I get this hmm & odp stuff cleared off
I also didn't quite understand why walk_pte_range() skipped locking the pte in the no_vma case - I don't get why vma would be related to locking here.
The no_vma case is for walking the kernel's page tables and they may have entries that are not backed by struct page, so there isn't (reliably) a PTE lock to take.
Oh, that is an interesting bit of insight..
I also saw that hmm open coded the pte walk, presumably for performance, so I was thinking of adding some kind of pte_range() callback to avoid the expensive indirect function call per pte, but hmm also can't have the pmd locked...
Yeah the callback per PTE is a bit heavy because of the indirect function call. I'm not sure how to optimise it beyond open coding at the PMD level. One option would be to provide helper functions to make it a bit more generic.
Do you have an idea of what pte_range() would look like?
Basically just pass in the already mapped pte array just like is already done at the tail of the pmd
The reason to do it like this is so that the common code in the walker can correctly prove the pmd is pointing at a pte before trying to map it.
This is complicated, and hmm at least already got it wrong when trying to open code at the PMD level.
Jason
On Thu, Mar 12, 2020 at 05:02:18PM +0000, Steven Price wrote:
On 12/03/2020 16:37, Jason Gunthorpe wrote:
On Thu, Mar 12, 2020 at 04:16:33PM +0000, Steven Price wrote:
Actually, while you are looking at this, do you think we should be adding at least READ_ONCE in the pagewalk.c walk_* functions? The multiple references of pmd, pud, etc without locking seems sketchy to me.
I agree it seems worrying. I'm not entirely sure whether the holding of mmap_sem is sufficient,
I looked at this question, and at least for PMD, mmap_sem is not sufficient. I didn't easilly figure it out for the other ones
I'm guessing if PMD is not safe then none of them are.
this isn't something that I changed so I've just been hoping that it's sufficient since it seems to have been working (whether that's by chance because the compiler didn't generate multiple reads I've no idea). For walking the kernel's page tables the lack of READ_ONCE is also not great, but at least for PTDUMP we don't care too much about accuracy and it should be crash proof because there's no RCU grace period. And again the code I was replacing didn't have any special protection.
I can't see any harm in updating the code to include READ_ONCE and I'm happy to review a patch.
The reason I ask is because hmm's walkers often have this pattern where they get the pointer and then de-ref it (again) then immediately have to recheck the 'again' conditions of the walker itself because the re-read may have given a different value.
Having the walker deref the pointer and pass the value it into the ops for use rather than repeatedly de-refing an unlocked value seems like a much safer design to me.
Yeah that sounds like a good idea.
I'm looking at this now.. The PUD is also changing under the read mmap_sem - and I was able to think up some race conditiony bugs related to this. Have some patches now..
However, I haven't been able to understand why walk_page_range() doesn't check pud_present() or pmd_present() before calling pmd_offset_map() or pte_offset_map().
As far as I can see a non-present entry has a swap entry encoded in it, and thus it seems like it is a bad idea to pass a non-present entry to the two map functions. I think those should only be called when the entry points to the next level in the page table (so there is something to map?)
I see you added !present tests for the !vma case, but why only there?
Is this a bug? Do you know how it works?
Is it something that was missed when people added non-present PUD and PMD's?
Thanks, Jason
On Fri, Mar 13, 2020 at 04:55:50PM -0300, Jason Gunthorpe wrote:
On Thu, Mar 12, 2020 at 05:02:18PM +0000, Steven Price wrote:
On 12/03/2020 16:37, Jason Gunthorpe wrote:
On Thu, Mar 12, 2020 at 04:16:33PM +0000, Steven Price wrote:
Actually, while you are looking at this, do you think we should be adding at least READ_ONCE in the pagewalk.c walk_* functions? The multiple references of pmd, pud, etc without locking seems sketchy to me.
I agree it seems worrying. I'm not entirely sure whether the holding of mmap_sem is sufficient,
I looked at this question, and at least for PMD, mmap_sem is not sufficient. I didn't easilly figure it out for the other ones
I'm guessing if PMD is not safe then none of them are.
this isn't something that I changed so I've just been hoping that it's sufficient since it seems to have been working (whether that's by chance because the compiler didn't generate multiple reads I've no idea). For walking the kernel's page tables the lack of READ_ONCE is also not great, but at least for PTDUMP we don't care too much about accuracy and it should be crash proof because there's no RCU grace period. And again the code I was replacing didn't have any special protection.
I can't see any harm in updating the code to include READ_ONCE and I'm happy to review a patch.
The reason I ask is because hmm's walkers often have this pattern where they get the pointer and then de-ref it (again) then immediately have to recheck the 'again' conditions of the walker itself because the re-read may have given a different value.
Having the walker deref the pointer and pass the value it into the ops for use rather than repeatedly de-refing an unlocked value seems like a much safer design to me.
Yeah that sounds like a good idea.
I'm looking at this now.. The PUD is also changing under the read mmap_sem - and I was able to think up some race conditiony bugs related to this. Have some patches now..
However, I haven't been able to understand why walk_page_range() doesn't check pud_present() or pmd_present() before calling pmd_offset_map() or pte_offset_map().
As far as I can see a non-present entry has a swap entry encoded in it, and thus it seems like it is a bad idea to pass a non-present entry to the two map functions. I think those should only be called when the entry points to the next level in the page table (so there is something to map?)
I see you added !present tests for the !vma case, but why only there?
Is this a bug? Do you know how it works?
Is it something that was missed when people added non-present PUD and PMD's?
... I'm sorry, I did what now? As far as I can tell, you're talking about mm/pagewalk.c, and the only commit I have in that file is a00cc7d9dd93d66a3fb83fc52aa57a4bec51c517 ("mm, x86: add support for PUD-sized transparent hugepages", which I think I was pretty clear from the commit message is basically copy-and-paste from the PMD code. I have no clue why most of the decisions in the MM were made.
On Fri, Mar 13, 2020 at 02:04:46PM -0700, Matthew Wilcox wrote:
On Fri, Mar 13, 2020 at 04:55:50PM -0300, Jason Gunthorpe wrote:
On Thu, Mar 12, 2020 at 05:02:18PM +0000, Steven Price wrote:
On 12/03/2020 16:37, Jason Gunthorpe wrote:
On Thu, Mar 12, 2020 at 04:16:33PM +0000, Steven Price wrote:
Actually, while you are looking at this, do you think we should be adding at least READ_ONCE in the pagewalk.c walk_* functions? The multiple references of pmd, pud, etc without locking seems sketchy to me.
I agree it seems worrying. I'm not entirely sure whether the holding of mmap_sem is sufficient,
I looked at this question, and at least for PMD, mmap_sem is not sufficient. I didn't easilly figure it out for the other ones
I'm guessing if PMD is not safe then none of them are.
this isn't something that I changed so I've just been hoping that it's sufficient since it seems to have been working (whether that's by chance because the compiler didn't generate multiple reads I've no idea). For walking the kernel's page tables the lack of READ_ONCE is also not great, but at least for PTDUMP we don't care too much about accuracy and it should be crash proof because there's no RCU grace period. And again the code I was replacing didn't have any special protection.
I can't see any harm in updating the code to include READ_ONCE and I'm happy to review a patch.
The reason I ask is because hmm's walkers often have this pattern where they get the pointer and then de-ref it (again) then immediately have to recheck the 'again' conditions of the walker itself because the re-read may have given a different value.
Having the walker deref the pointer and pass the value it into the ops for use rather than repeatedly de-refing an unlocked value seems like a much safer design to me.
Yeah that sounds like a good idea.
I'm looking at this now.. The PUD is also changing under the read mmap_sem - and I was able to think up some race conditiony bugs related to this. Have some patches now..
However, I haven't been able to understand why walk_page_range() doesn't check pud_present() or pmd_present() before calling pmd_offset_map() or pte_offset_map().
As far as I can see a non-present entry has a swap entry encoded in it, and thus it seems like it is a bad idea to pass a non-present entry to the two map functions. I think those should only be called when the entry points to the next level in the page table (so there is something to map?)
I see you added !present tests for the !vma case, but why only there?
Is this a bug? Do you know how it works?
Is it something that was missed when people added non-present PUD and PMD's?
... I'm sorry, I did what now?
No, no, just widening to see if someone knows
As far as I can tell, you're talking about mm/pagewalk.c, and the only commit I have in that file is a00cc7d9dd93d66a3fb83fc52aa57a4bec51c517 ("mm, x86: add support for PUD-sized transparent hugepages", which I think I was pretty clear from the commit message is basically copy-and-paste from the PMD code.
Right, which added the split_huge_pud() which seems maybe related to pud_present, or maybe not, I don't know.
I have no clue why most of the decisions in the MM were made.
Fun!
Jason
From: Jason Gunthorpe jgg@mellanox.com
All success exit paths from the walker functions must set the pfns array.
A migration entry with no required fault is a HMM_PFN_NONE return, just like the pte case.
Fixes: d08faca018c4 ("mm/hmm: properly handle migration pmd") Signed-off-by: Jason Gunthorpe jgg@mellanox.com --- mm/hmm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/hmm.c b/mm/hmm.c index 32dcbfd3908315..5f5ccf13dd1e85 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -394,7 +394,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, pmd_migration_entry_wait(walk->mm, pmdp); return -EBUSY; } - return 0; + return hmm_pfns_fill(start, end, range, HMM_PFN_NONE); } else if (!pmd_present(pmd)) return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
On 3/11/20 11:35 AM, Jason Gunthorpe wrote:
From: Jason Gunthorpe jgg@mellanox.com
All success exit paths from the walker functions must set the pfns array.
A migration entry with no required fault is a HMM_PFN_NONE return, just like the pte case.
Fixes: d08faca018c4 ("mm/hmm: properly handle migration pmd") Signed-off-by: Jason Gunthorpe jgg@mellanox.com
Reviewed-by: Ralph Campbell rcampbell@nvidia.com
mm/hmm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/hmm.c b/mm/hmm.c index 32dcbfd3908315..5f5ccf13dd1e85 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -394,7 +394,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, pmd_migration_entry_wait(walk->mm, pmdp); return -EBUSY; }
return 0;
} else if (!pmd_present(pmd)) return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);return hmm_pfns_fill(start, end, range, HMM_PFN_NONE);
From: Jason Gunthorpe jgg@mellanox.com
All return paths that do EFAULT must call hmm_range_need_fault() to determine if the user requires this page to be valid.
If the page cannot be made valid if the user later requires it, due to vma flags in this case, then the return should be HMM_PFN_ERROR.
Fixes: a3e0d41c2b1f ("mm/hmm: improve driver API to work and wait over a range") Signed-off-by: Jason Gunthorpe jgg@mellanox.com --- mm/hmm.c | 19 ++++++++----------- 1 file changed, 8 insertions(+), 11 deletions(-)
diff --git a/mm/hmm.c b/mm/hmm.c index 5f5ccf13dd1e85..e10cd0adba7b37 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -582,18 +582,15 @@ static int hmm_vma_walk_test(unsigned long start, unsigned long end, struct vm_area_struct *vma = walk->vma;
/* - * Skip vma ranges that don't have struct page backing them or - * map I/O devices directly. - */ - if (vma->vm_flags & (VM_IO | VM_PFNMAP | VM_MIXEDMAP)) - return -EFAULT; - - /* + * Skip vma ranges that don't have struct page backing them or map I/O + * devices directly. + * * If the vma does not allow read access, then assume that it does not - * allow write access either. HMM does not support architectures - * that allow write without read. + * allow write access either. HMM does not support architectures that + * allow write without read. */ - if (!(vma->vm_flags & VM_READ)) { + if ((vma->vm_flags & (VM_IO | VM_PFNMAP | VM_MIXEDMAP)) || + !(vma->vm_flags & VM_READ)) { bool fault, write_fault;
/* @@ -607,7 +604,7 @@ static int hmm_vma_walk_test(unsigned long start, unsigned long end, if (fault || write_fault) return -EFAULT;
- hmm_pfns_fill(start, end, range, HMM_PFN_NONE); + hmm_pfns_fill(start, end, range, HMM_PFN_ERROR); hmm_vma_walk->last = end;
/* Skip this vma and continue processing the next vma. */
On 3/11/20 11:35 AM, Jason Gunthorpe wrote:
From: Jason Gunthorpe jgg@mellanox.com
All return paths that do EFAULT must call hmm_range_need_fault() to determine if the user requires this page to be valid.
If the page cannot be made valid if the user later requires it, due to vma flags in this case, then the return should be HMM_PFN_ERROR.
Fixes: a3e0d41c2b1f ("mm/hmm: improve driver API to work and wait over a range") Signed-off-by: Jason Gunthorpe jgg@mellanox.com
Reviewed-by: Ralph Campbell rcampbell@nvidia.com
mm/hmm.c | 19 ++++++++----------- 1 file changed, 8 insertions(+), 11 deletions(-)
diff --git a/mm/hmm.c b/mm/hmm.c index 5f5ccf13dd1e85..e10cd0adba7b37 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -582,18 +582,15 @@ static int hmm_vma_walk_test(unsigned long start, unsigned long end, struct vm_area_struct *vma = walk->vma;
/*
* Skip vma ranges that don't have struct page backing them or
* map I/O devices directly.
*/
- if (vma->vm_flags & (VM_IO | VM_PFNMAP | VM_MIXEDMAP))
return -EFAULT;
- /*
* Skip vma ranges that don't have struct page backing them or map I/O
* devices directly.
*
- If the vma does not allow read access, then assume that it does not
* allow write access either. HMM does not support architectures
* that allow write without read.
* allow write access either. HMM does not support architectures that
*/* allow write without read.
- if (!(vma->vm_flags & VM_READ)) {
if ((vma->vm_flags & (VM_IO | VM_PFNMAP | VM_MIXEDMAP)) ||
!(vma->vm_flags & VM_READ)) {
bool fault, write_fault;
/*
@@ -607,7 +604,7 @@ static int hmm_vma_walk_test(unsigned long start, unsigned long end, if (fault || write_fault) return -EFAULT;
hmm_pfns_fill(start, end, range, HMM_PFN_NONE);
hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
hmm_vma_walk->last = end;
/* Skip this vma and continue processing the next vma. */
From: Jason Gunthorpe jgg@mellanox.com
The intention with this code is to determine if the caller required the pages to be valid, and if so, then take some action to make them valid. The action varies depending on the page type.
In all cases, if the caller doesn't ask for the page, then hmm_range_fault() should not return an error.
Revise the implementation to be clearer, and fix some bugs:
- hmm_pte_need_fault() must always be called before testing fault or write_fault otherwise the defaults of false apply and the if()'s don't work. This was missed on the is_migration_entry() branch
- -EFAULT should not be returned unless hmm_pte_need_fault() indicates fault is required - ie snapshotting should not fail.
- For !pte_present() the cpu_flags are always 0, except in the special case of is_device_private_entry(), calling pte_to_hmm_pfn_flags() is confusing.
Reorganize the flow so that it always follows the pattern of calling hmm_pte_need_fault() and then checking fault || write_fault.
Fixes: 2aee09d8c116 ("mm/hmm: change hmm_vma_fault() to allow write fault on page basis") Signed-off-by: Jason Gunthorpe jgg@mellanox.com --- mm/hmm.c | 35 +++++++++++++++-------------------- 1 file changed, 15 insertions(+), 20 deletions(-)
diff --git a/mm/hmm.c b/mm/hmm.c index e10cd0adba7b37..bf676cfef3e8ee 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -282,15 +282,6 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, if (!pte_present(pte)) { swp_entry_t entry = pte_to_swp_entry(pte);
- if (!non_swap_entry(entry)) { - cpu_flags = pte_to_hmm_pfn_flags(range, pte); - hmm_pte_need_fault(hmm_vma_walk, orig_pfn, cpu_flags, - &fault, &write_fault); - if (fault || write_fault) - goto fault; - return 0; - } - /* * This is a special swap entry, ignore migration, use * device and report anything else as error. @@ -310,26 +301,30 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, return 0; }
- if (is_migration_entry(entry)) { - if (fault || write_fault) { - pte_unmap(ptep); - hmm_vma_walk->last = addr; - migration_entry_wait(walk->mm, pmdp, addr); - return -EBUSY; - } + hmm_pte_need_fault(hmm_vma_walk, orig_pfn, 0, &fault, + &write_fault); + if (!fault && !write_fault) return 0; + + if (!non_swap_entry(entry)) + goto fault; + + if (is_migration_entry(entry)) { + pte_unmap(ptep); + hmm_vma_walk->last = addr; + migration_entry_wait(walk->mm, pmdp, addr); + return -EBUSY; }
/* Report error for everything else */ pte_unmap(ptep); *pfn = range->values[HMM_PFN_ERROR]; return -EFAULT; - } else { - cpu_flags = pte_to_hmm_pfn_flags(range, pte); - hmm_pte_need_fault(hmm_vma_walk, orig_pfn, cpu_flags, - &fault, &write_fault); }
+ cpu_flags = pte_to_hmm_pfn_flags(range, pte); + hmm_pte_need_fault(hmm_vma_walk, orig_pfn, cpu_flags, &fault, + &write_fault); if (fault || write_fault) goto fault;
On 3/11/20 11:35 AM, Jason Gunthorpe wrote:
From: Jason Gunthorpe jgg@mellanox.com
The intention with this code is to determine if the caller required the pages to be valid, and if so, then take some action to make them valid. The action varies depending on the page type.
In all cases, if the caller doesn't ask for the page, then hmm_range_fault() should not return an error.
Revise the implementation to be clearer, and fix some bugs:
hmm_pte_need_fault() must always be called before testing fault or write_fault otherwise the defaults of false apply and the if()'s don't work. This was missed on the is_migration_entry() branch
-EFAULT should not be returned unless hmm_pte_need_fault() indicates fault is required - ie snapshotting should not fail.
For !pte_present() the cpu_flags are always 0, except in the special case of is_device_private_entry(), calling pte_to_hmm_pfn_flags() is confusing.
Reorganize the flow so that it always follows the pattern of calling hmm_pte_need_fault() and then checking fault || write_fault.
Fixes: 2aee09d8c116 ("mm/hmm: change hmm_vma_fault() to allow write fault on page basis") Signed-off-by: Jason Gunthorpe jgg@mellanox.com
Reviewed-by: Ralph Campbell rcampbell@nvidia.com
mm/hmm.c | 35 +++++++++++++++-------------------- 1 file changed, 15 insertions(+), 20 deletions(-)
diff --git a/mm/hmm.c b/mm/hmm.c index e10cd0adba7b37..bf676cfef3e8ee 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -282,15 +282,6 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, if (!pte_present(pte)) { swp_entry_t entry = pte_to_swp_entry(pte);
if (!non_swap_entry(entry)) {
cpu_flags = pte_to_hmm_pfn_flags(range, pte);
hmm_pte_need_fault(hmm_vma_walk, orig_pfn, cpu_flags,
&fault, &write_fault);
if (fault || write_fault)
goto fault;
return 0;
}
- /*
- This is a special swap entry, ignore migration, use
- device and report anything else as error.
@@ -310,26 +301,30 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, return 0; }
if (is_migration_entry(entry)) {
if (fault || write_fault) {
pte_unmap(ptep);
hmm_vma_walk->last = addr;
migration_entry_wait(walk->mm, pmdp, addr);
return -EBUSY;
}
hmm_pte_need_fault(hmm_vma_walk, orig_pfn, 0, &fault,
&write_fault);
if (!fault && !write_fault) return 0;
if (!non_swap_entry(entry))
goto fault;
if (is_migration_entry(entry)) {
pte_unmap(ptep);
hmm_vma_walk->last = addr;
migration_entry_wait(walk->mm, pmdp, addr);
return -EBUSY;
}
/* Report error for everything else */ pte_unmap(ptep); *pfn = range->values[HMM_PFN_ERROR]; return -EFAULT;
- } else {
cpu_flags = pte_to_hmm_pfn_flags(range, pte);
hmm_pte_need_fault(hmm_vma_walk, orig_pfn, cpu_flags,
}&fault, &write_fault);
- cpu_flags = pte_to_hmm_pfn_flags(range, pte);
- hmm_pte_need_fault(hmm_vma_walk, orig_pfn, cpu_flags, &fault,
if (fault || write_fault) goto fault;&write_fault);
From: Jason Gunthorpe jgg@mellanox.com
hmm_range_fault() should never return 0 if the caller requested a valid page, but the pfns output for that page would be HMM_PFN_ERROR.
hmm_pte_need_fault() must always be called before setting HMM_PFN_ERROR to detect if the page is in faulting mode or not.
Fix two cases in hmm_vma_walk_pmd() and reorganize some of the duplicated code.
Fixes: d08faca018c4 ("mm/hmm: properly handle migration pmd") Fixes: da4c3c735ea4 ("mm/hmm/mirror: helper to snapshot CPU page table") Signed-off-by: Jason Gunthorpe jgg@mellanox.com --- mm/hmm.c | 38 +++++++++++++++++++++----------------- 1 file changed, 21 insertions(+), 17 deletions(-)
diff --git a/mm/hmm.c b/mm/hmm.c index bf676cfef3e8ee..f61fddf2ef6505 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -363,8 +363,10 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, { struct hmm_vma_walk *hmm_vma_walk = walk->private; struct hmm_range *range = hmm_vma_walk->range; - uint64_t *pfns = range->pfns; - unsigned long addr = start, i; + uint64_t *pfns = &range->pfns[(start - range->start) >> PAGE_SHIFT]; + unsigned long npages = (end - start) >> PAGE_SHIFT; + unsigned long addr = start; + bool fault, write_fault; pte_t *ptep; pmd_t pmd;
@@ -374,14 +376,6 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, return hmm_vma_walk_hole(start, end, -1, walk);
if (thp_migration_supported() && is_pmd_migration_entry(pmd)) { - bool fault, write_fault; - unsigned long npages; - uint64_t *pfns; - - i = (addr - range->start) >> PAGE_SHIFT; - npages = (end - addr) >> PAGE_SHIFT; - pfns = &range->pfns[i]; - hmm_range_need_fault(hmm_vma_walk, pfns, npages, 0, &fault, &write_fault); if (fault || write_fault) { @@ -390,8 +384,15 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, return -EBUSY; } return hmm_pfns_fill(start, end, range, HMM_PFN_NONE); - } else if (!pmd_present(pmd)) + } + + if (!pmd_present(pmd)) { + hmm_range_need_fault(hmm_vma_walk, pfns, npages, 0, &fault, + &write_fault); + if (fault || write_fault) + return -EFAULT; return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR); + }
if (pmd_devmap(pmd) || pmd_trans_huge(pmd)) { /* @@ -408,8 +409,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, if (!pmd_devmap(pmd) && !pmd_trans_huge(pmd)) goto again;
- i = (addr - range->start) >> PAGE_SHIFT; - return hmm_vma_handle_pmd(walk, addr, end, &pfns[i], pmd); + return hmm_vma_handle_pmd(walk, addr, end, pfns, pmd); }
/* @@ -418,15 +418,19 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, * entry pointing to pte directory or it is a bad pmd that will not * recover. */ - if (pmd_bad(pmd)) + if (pmd_bad(pmd)) { + hmm_range_need_fault(hmm_vma_walk, pfns, npages, 0, &fault, + &write_fault); + if (fault || write_fault) + return -EFAULT; return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR); + }
ptep = pte_offset_map(pmdp, addr); - i = (addr - range->start) >> PAGE_SHIFT; - for (; addr < end; addr += PAGE_SIZE, ptep++, i++) { + for (; addr < end; addr += PAGE_SIZE, ptep++, pfns++) { int r;
- r = hmm_vma_handle_pte(walk, addr, end, pmdp, ptep, &pfns[i]); + r = hmm_vma_handle_pte(walk, addr, end, pmdp, ptep, pfns); if (r) { /* hmm_vma_handle_pte() did pte_unmap() */ hmm_vma_walk->last = addr;
On 3/11/20 11:35 AM, Jason Gunthorpe wrote:
From: Jason Gunthorpe jgg@mellanox.com
hmm_range_fault() should never return 0 if the caller requested a valid page, but the pfns output for that page would be HMM_PFN_ERROR.
hmm_pte_need_fault() must always be called before setting HMM_PFN_ERROR to detect if the page is in faulting mode or not.
Fix two cases in hmm_vma_walk_pmd() and reorganize some of the duplicated code.
Fixes: d08faca018c4 ("mm/hmm: properly handle migration pmd") Fixes: da4c3c735ea4 ("mm/hmm/mirror: helper to snapshot CPU page table") Signed-off-by: Jason Gunthorpe jgg@mellanox.com
Reviewed-by: Ralph Campbell rcampbell@nvidia.com
mm/hmm.c | 38 +++++++++++++++++++++----------------- 1 file changed, 21 insertions(+), 17 deletions(-)
diff --git a/mm/hmm.c b/mm/hmm.c index bf676cfef3e8ee..f61fddf2ef6505 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -363,8 +363,10 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, { struct hmm_vma_walk *hmm_vma_walk = walk->private; struct hmm_range *range = hmm_vma_walk->range;
- uint64_t *pfns = range->pfns;
- unsigned long addr = start, i;
- uint64_t *pfns = &range->pfns[(start - range->start) >> PAGE_SHIFT];
- unsigned long npages = (end - start) >> PAGE_SHIFT;
- unsigned long addr = start;
- bool fault, write_fault; pte_t *ptep; pmd_t pmd;
@@ -374,14 +376,6 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, return hmm_vma_walk_hole(start, end, -1, walk);
if (thp_migration_supported() && is_pmd_migration_entry(pmd)) {
bool fault, write_fault;
unsigned long npages;
uint64_t *pfns;
i = (addr - range->start) >> PAGE_SHIFT;
npages = (end - addr) >> PAGE_SHIFT;
pfns = &range->pfns[i];
- hmm_range_need_fault(hmm_vma_walk, pfns, npages, 0, &fault, &write_fault); if (fault || write_fault) {
@@ -390,8 +384,15 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, return -EBUSY; } return hmm_pfns_fill(start, end, range, HMM_PFN_NONE);
- } else if (!pmd_present(pmd))
- }
- if (!pmd_present(pmd)) {
hmm_range_need_fault(hmm_vma_walk, pfns, npages, 0, &fault,
&write_fault);
if (fault || write_fault)
return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);return -EFAULT;
Shouldn't this fill with HMM_PFN_NONE instead of HMM_PFN_ERROR? Otherwise, when a THP is swapped out, you will get a different value than if a PTE is swapped out and you are prefetching/snapshotting.
}
if (pmd_devmap(pmd) || pmd_trans_huge(pmd)) { /*
@@ -408,8 +409,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, if (!pmd_devmap(pmd) && !pmd_trans_huge(pmd)) goto again;
i = (addr - range->start) >> PAGE_SHIFT;
return hmm_vma_handle_pmd(walk, addr, end, &pfns[i], pmd);
return hmm_vma_handle_pmd(walk, addr, end, pfns, pmd);
}
/*
@@ -418,15 +418,19 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, * entry pointing to pte directory or it is a bad pmd that will not * recover. */
- if (pmd_bad(pmd))
if (pmd_bad(pmd)) {
hmm_range_need_fault(hmm_vma_walk, pfns, npages, 0, &fault,
&write_fault);
if (fault || write_fault)
return -EFAULT;
return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
}
ptep = pte_offset_map(pmdp, addr);
- i = (addr - range->start) >> PAGE_SHIFT;
- for (; addr < end; addr += PAGE_SIZE, ptep++, i++) {
- for (; addr < end; addr += PAGE_SIZE, ptep++, pfns++) { int r;
r = hmm_vma_handle_pte(walk, addr, end, pmdp, ptep, &pfns[i]);
if (r) { /* hmm_vma_handle_pte() did pte_unmap() */ hmm_vma_walk->last = addr;r = hmm_vma_handle_pte(walk, addr, end, pmdp, ptep, pfns);
On Wed, Mar 11, 2020 at 06:36:47PM -0700, Ralph Campbell wrote:
@@ -390,8 +384,15 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp, return -EBUSY; } return hmm_pfns_fill(start, end, range, HMM_PFN_NONE);
- } else if (!pmd_present(pmd))
- }
- if (!pmd_present(pmd)) {
hmm_range_need_fault(hmm_vma_walk, pfns, npages, 0, &fault,
&write_fault);
if (fault || write_fault)
return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);return -EFAULT;
Shouldn't this fill with HMM_PFN_NONE instead of HMM_PFN_ERROR? Otherwise, when a THP is swapped out, you will get a different value than if a PTE is swapped out and you are prefetching/snapshotting.
If this is the case then the problem is that the return -EFAULT path needs to do something else.. ie since the above code can't trigger swap in, it is correct to return PFN_ERROR.
I'm completely guessing, but do we need to call pmd_to_swp_entry() and handle things similarly to the pte? What swp_entries are valid for a pmd?
Do you understand this better, or know how to trigger a !pmd_present for test?
I suppose another option would be this:
if (!pmd_present(pmd)) { hmm_range_need_fault(hmm_vma_walk, pfns, npages, 0, &fault, &write_fault); /* We can't handle this. Cause the PMD to be split and * handle it in the pte handler. */ if (fault || write_fault) return 0; return hmm_pfns_fill(start, end, range, HMM_PFN_NONE); }
Which, I think, must be correct, but inefficient?
Jason
From: Jason Gunthorpe jgg@mellanox.com
Currently if a special PTE is encountered hmm_range_fault() immediately returns EFAULT and sets the HMM_PFN_SPECIAL error output (which nothing uses).
EFAULT should only be returned after testing with hmm_pte_need_fault().
Also pte_devmap() and pte_special() are exclusive, and there is no need to check IS_ENABLED, pte_special() is stubbed out to return false on unsupported architectures.
Fixes: 992de9a8b751 ("mm/hmm: allow to mirror vma of a file on a DAX backed filesystem") Signed-off-by: Jason Gunthorpe jgg@mellanox.com --- mm/hmm.c | 19 ++++++++++++------- 1 file changed, 12 insertions(+), 7 deletions(-)
diff --git a/mm/hmm.c b/mm/hmm.c index f61fddf2ef6505..ca33d086bdc190 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -335,16 +335,21 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, pte_unmap(ptep); return -EBUSY; } - } else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) { - if (!is_zero_pfn(pte_pfn(pte))) { + } + + /* + * Since each architecture defines a struct page for the zero page, just + * fall through and treat it like a normal page. + */ + if (pte_special(pte) && !is_zero_pfn(pte_pfn(pte))) { + hmm_pte_need_fault(hmm_vma_walk, orig_pfn, 0, &fault, + &write_fault); + if (fault || write_fault) { pte_unmap(ptep); - *pfn = range->values[HMM_PFN_SPECIAL]; return -EFAULT; } - /* - * Since each architecture defines a struct page for the zero - * page, just fall through and treat it like a normal page. - */ + *pfn = range->values[HMM_PFN_SPECIAL]; + return 0; }
*pfn = hmm_device_entry_from_pfn(range, pte_pfn(pte)) | cpu_flags;
On 3/11/20 11:35 AM, Jason Gunthorpe wrote:
From: Jason Gunthorpe jgg@mellanox.com
Currently if a special PTE is encountered hmm_range_fault() immediately returns EFAULT and sets the HMM_PFN_SPECIAL error output (which nothing uses).
EFAULT should only be returned after testing with hmm_pte_need_fault().
Also pte_devmap() and pte_special() are exclusive, and there is no need to check IS_ENABLED, pte_special() is stubbed out to return false on unsupported architectures.
Fixes: 992de9a8b751 ("mm/hmm: allow to mirror vma of a file on a DAX backed filesystem") Signed-off-by: Jason Gunthorpe jgg@mellanox.com
Reviewed-by: Ralph Campbell rcampbell@nvidia.com
mm/hmm.c | 19 ++++++++++++------- 1 file changed, 12 insertions(+), 7 deletions(-)
diff --git a/mm/hmm.c b/mm/hmm.c index f61fddf2ef6505..ca33d086bdc190 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -335,16 +335,21 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, pte_unmap(ptep); return -EBUSY; }
- } else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) {
if (!is_zero_pfn(pte_pfn(pte))) {
- }
- /*
* Since each architecture defines a struct page for the zero page, just
* fall through and treat it like a normal page.
*/
- if (pte_special(pte) && !is_zero_pfn(pte_pfn(pte))) {
hmm_pte_need_fault(hmm_vma_walk, orig_pfn, 0, &fault,
&write_fault);
if (fault || write_fault) { pte_unmap(ptep);
}*pfn = range->values[HMM_PFN_SPECIAL]; return -EFAULT;
/*
* Since each architecture defines a struct page for the zero
* page, just fall through and treat it like a normal page.
*/
*pfn = range->values[HMM_PFN_SPECIAL];
return 0;
}
*pfn = hmm_device_entry_from_pfn(range, pte_pfn(pte)) | cpu_flags;
pmd_to_hmm_pfn_flags() already checks it and makes the cpu flags 0. If no fault is requested then the pfns should be returned with the not valid flags.
It should not unconditionally fault if faulting is not requested.
Fixes: 2aee09d8c116 ("mm/hmm: change hmm_vma_fault() to allow write fault on page basis") Signed-off-by: Jason Gunthorpe jgg@mellanox.com --- mm/hmm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
Bonus patch, this one got found after I made the series..
diff --git a/mm/hmm.c b/mm/hmm.c index ca33d086bdc190..6d9da4b0f0a9f8 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -226,7 +226,7 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk, unsigned long addr, hmm_range_need_fault(hmm_vma_walk, pfns, npages, cpu_flags, &fault, &write_fault);
- if (pmd_protnone(pmd) || fault || write_fault) + if (fault || write_fault) return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk);
pfn = pmd_pfn(pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
On 3/12/20 12:33 PM, Jason Gunthorpe wrote:
pmd_to_hmm_pfn_flags() already checks it and makes the cpu flags 0. If no fault is requested then the pfns should be returned with the not valid flags.
It should not unconditionally fault if faulting is not requested.
Fixes: 2aee09d8c116 ("mm/hmm: change hmm_vma_fault() to allow write fault on page basis") Signed-off-by: Jason Gunthorpe jgg@mellanox.com
Looks good to me. Reviewed-by: Ralph Campbell rcampbell@nvidia.com
mm/hmm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
Bonus patch, this one got found after I made the series..
diff --git a/mm/hmm.c b/mm/hmm.c index ca33d086bdc190..6d9da4b0f0a9f8 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -226,7 +226,7 @@ static int hmm_vma_handle_pmd(struct mm_walk *walk, unsigned long addr, hmm_range_need_fault(hmm_vma_walk, pfns, npages, cpu_flags, &fault, &write_fault);
- if (pmd_protnone(pmd) || fault || write_fault)
if (fault || write_fault) return hmm_vma_walk_hole_(addr, end, fault, write_fault, walk);
pfn = pmd_pfn(pmd) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
On Wed, Mar 11, 2020 at 03:34:58PM -0300, Jason Gunthorpe wrote:
From: Jason Gunthorpe jgg@mellanox.com
The hmm_range_fault() flow is fairly complicated. The scheme allows the caller to specify if it needs a usable result for each page, or if it only needs the current page table status filled in. This mixture of behavior is useful for a caller that wants to build a 'prefetch around fault' algorithm.
Although we have no in-tree users of this capability, I am working on having RDMA ODP work in this manner, and removing these bugs from hmm_range_fault() is a necessary step.
The basic principles are:
If the caller did not ask for a VA to be valid then hmm_range_fault() should not fail because of that VA
If 0 is returned from hmm_range_fault() then the entire pfns array contains valid data
HMM_PFN_ERROR is set if faulting fails, or if asking for faulting would fail
A 0 return from hmm_range_fault() does not have HMM_PFN_ERROR in any VA's the caller asked to be valid
This series does not get there completely, I have a followup series closing some more complex cases.
I tested this series using Ralph's hmm tester he posted a while back, other testing would be appreciated.
Jason Gunthorpe (8): mm/hmm: add missing unmaps of the ptep during hmm_vma_handle_pte() mm/hmm: do not call hmm_vma_walk_hole() while holding a spinlock mm/hmm: add missing pfns set to hmm_vma_walk_pmd() mm/hmm: add missing call to hmm_range_need_fault() before returning EFAULT mm/hmm: reorganize how !pte_present is handled in hmm_vma_handle_pte() mm/hmm: return -EFAULT when setting HMM_PFN_ERROR on requested valid pages mm/hmm: add missing call to hmm_pte_need_fault in HMM_PFN_SPECIAL handling mm/hmm: do not check pmd_protnone twice in hmm_vma_handle_pmd()
I moved these toward linux-next, if others have remarks or tags please feel free to continue.
mm/hmm: don't free the cached pgmap while scanning
I will respin
Thank you all for the reviews!
Regards, Jason
dri-devel@lists.freedesktop.org