The cost of faulting in all memory to be locked can be very high when working with large mappings. If only portions of the mapping will be used this can incur a high penalty for locking.
For the example of a large file, this is the usage pattern for a large statical language model (probably applies to other statical or graphical models as well). For the security example, any application transacting in data that cannot be swapped out (credit card data, medical records, etc).
This patch introduces the ability to request that pages are not pre-faulted, but are placed on the unevictable LRU when they are finally faulted in. The VM_LOCKONFAULT flag will be used together with VM_LOCKED and has no effect when set without VM_LOCKED. Setting the VM_LOCKONFAULT flag for a VMA will cause pages faulted into that VMA to be added to the unevictable LRU when they are faulted or if they are already present, but will not cause any missing pages to be faulted in.
Exposing this new lock state means that we cannot overload the meaning of the FOLL_POPULATE flag any longer. Prior to this patch it was used to mean that the VMA for a fault was locked. This means we need the new FOLL_MLOCK flag to communicate the locked state of a VMA. FOLL_POPULATE will now only control if the VMA should be populated and in the case of VM_LOCKONFAULT, it will not be set.
Signed-off-by: Eric B Munson emunson@akamai.com Acked-by: Kirill A. Shutemov kirill.shutemov@linux.intel.com Cc: Michal Hocko mhocko@suse.cz Cc: Vlastimil Babka vbabka@suse.cz Cc: Jonathan Corbet corbet@lwn.net Cc: "Kirill A. Shutemov" kirill@shutemov.name Cc: linux-kernel@vger.kernel.org Cc: dri-devel@lists.freedesktop.org Cc: linux-mm@kvack.org Cc: linux-api@vger.kernel.org --- Documentation/filesystems/proc.txt | 1 + drivers/gpu/drm/drm_vm.c | 8 +++++++- fs/proc/task_mmu.c | 1 + include/linux/mm.h | 2 ++ kernel/fork.c | 2 +- mm/debug.c | 1 + mm/gup.c | 10 ++++++++-- mm/huge_memory.c | 2 +- mm/hugetlb.c | 4 ++-- mm/mlock.c | 2 +- mm/mmap.c | 2 +- mm/rmap.c | 6 ++++-- 12 files changed, 30 insertions(+), 11 deletions(-)
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 6f7fafd..ed21989 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -463,6 +463,7 @@ manner. The codes are the following: rr - random read advise provided dc - do not copy area on fork de - do not expand area on remapping + lf - mark area to lock pages when faulted in, do not pre-populate ac - area is accountable nr - swap space is not reserved for the area ht - area uses huge tlb pages diff --git a/drivers/gpu/drm/drm_vm.c b/drivers/gpu/drm/drm_vm.c index aab49ee..103a5f6 100644 --- a/drivers/gpu/drm/drm_vm.c +++ b/drivers/gpu/drm/drm_vm.c @@ -699,9 +699,15 @@ int drm_vma_info(struct seq_file *m, void *data) (void *)(unsigned long)virt_to_phys(high_memory));
list_for_each_entry(pt, &dev->vmalist, head) { + char lock_flag = '-'; + vma = pt->vma; if (!vma) continue; + if (vma->vm_flags & VM_LOCKONFAULT) + lock_flag = 'f'; + else if (vma->vm_flags & VM_LOCKED) + lock_flag = 'l'; seq_printf(m, "\n%5d 0x%pK-0x%pK %c%c%c%c%c%c 0x%08lx000", pt->pid, @@ -710,7 +716,7 @@ int drm_vma_info(struct seq_file *m, void *data) vma->vm_flags & VM_WRITE ? 'w' : '-', vma->vm_flags & VM_EXEC ? 'x' : '-', vma->vm_flags & VM_MAYSHARE ? 's' : 'p', - vma->vm_flags & VM_LOCKED ? 'l' : '-', + lock_flag, vma->vm_flags & VM_IO ? 'i' : '-', vma->vm_pgoff);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index ca1e091..8dcc297 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -585,6 +585,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma) [ilog2(VM_RAND_READ)] = "rr", [ilog2(VM_DONTCOPY)] = "dc", [ilog2(VM_DONTEXPAND)] = "de", + [ilog2(VM_LOCKONFAULT)] = "lf", [ilog2(VM_ACCOUNT)] = "ac", [ilog2(VM_NORESERVE)] = "nr", [ilog2(VM_HUGETLB)] = "ht", diff --git a/include/linux/mm.h b/include/linux/mm.h index 2e872f9..d6e1637 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -136,6 +136,7 @@ extern unsigned int kobjsize(const void *objp);
#define VM_DONTCOPY 0x00020000 /* Do not copy this vma on fork */ #define VM_DONTEXPAND 0x00040000 /* Cannot expand with mremap() */ +#define VM_LOCKONFAULT 0x00080000 /* Lock the pages covered when they are faulted in */ #define VM_ACCOUNT 0x00100000 /* Is a VM accounted object */ #define VM_NORESERVE 0x00200000 /* should the VM suppress accounting */ #define VM_HUGETLB 0x00400000 /* Huge TLB Page VM */ @@ -2043,6 +2044,7 @@ static inline struct page *follow_page(struct vm_area_struct *vma, #define FOLL_NUMA 0x200 /* force NUMA hinting page fault */ #define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */ #define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */ +#define FOLL_MLOCK 0x1000 /* lock present pages */
typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr, void *data); diff --git a/kernel/fork.c b/kernel/fork.c index dbd9b8d..a949228 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -454,7 +454,7 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm) tmp->vm_mm = mm; if (anon_vma_fork(tmp, mpnt)) goto fail_nomem_anon_vma_fork; - tmp->vm_flags &= ~VM_LOCKED; + tmp->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT); tmp->vm_next = tmp->vm_prev = NULL; file = tmp->vm_file; if (file) { diff --git a/mm/debug.c b/mm/debug.c index 76089dd..25176bb 100644 --- a/mm/debug.c +++ b/mm/debug.c @@ -121,6 +121,7 @@ static const struct trace_print_flags vmaflags_names[] = { {VM_GROWSDOWN, "growsdown" }, {VM_PFNMAP, "pfnmap" }, {VM_DENYWRITE, "denywrite" }, + {VM_LOCKONFAULT, "lockonfault" }, {VM_LOCKED, "locked" }, {VM_IO, "io" }, {VM_SEQ_READ, "seqread" }, diff --git a/mm/gup.c b/mm/gup.c index 6297f6b..dce6ccd 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -92,7 +92,7 @@ retry: */ mark_page_accessed(page); } - if ((flags & FOLL_POPULATE) && (vma->vm_flags & VM_LOCKED)) { + if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) { /* * The preliminary mapping check is mainly to avoid the * pointless overhead of lock_page on the ZERO_PAGE @@ -265,6 +265,9 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma, unsigned int fault_flags = 0; int ret;
+ /* mlock all present pages, but do not fault in new pages */ + if ((*flags & (FOLL_POPULATE | FOLL_MLOCK)) == FOLL_MLOCK) + return -ENOENT; /* For mm_populate(), just skip the stack guard page. */ if ((*flags & FOLL_POPULATE) && (stack_guard_page_start(vma, address) || @@ -850,7 +853,10 @@ long populate_vma_page_range(struct vm_area_struct *vma, VM_BUG_ON_VMA(end > vma->vm_end, vma); VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_sem), mm);
- gup_flags = FOLL_TOUCH | FOLL_POPULATE; + gup_flags = FOLL_TOUCH | FOLL_POPULATE | FOLL_MLOCK; + if (vma->vm_flags & VM_LOCKONFAULT) + gup_flags &= ~FOLL_POPULATE; + /* * We want to touch writable mappings with a write fault in order * to break COW, except for shared mappings because these don't COW diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 097c7a4..cba783e 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1238,7 +1238,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, pmd, _pmd, 1)) update_mmu_cache_pmd(vma, addr, pmd); } - if ((flags & FOLL_POPULATE) && (vma->vm_flags & VM_LOCKED)) { + if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) { if (page->mapping && trylock_page(page)) { lru_add_drain(); if (page->mapping) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index a8c3087..4ed9e93 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3764,8 +3764,8 @@ static unsigned long page_table_shareable(struct vm_area_struct *svma, unsigned long s_end = sbase + PUD_SIZE;
/* Allow segments to share if only one is marked locked */ - unsigned long vm_flags = vma->vm_flags & ~VM_LOCKED; - unsigned long svm_flags = svma->vm_flags & ~VM_LOCKED; + unsigned long vm_flags = vma->vm_flags & ~(VM_LOCKED|VM_LOCKONFAULT); + unsigned long svm_flags = svma->vm_flags & ~(VM_LOCKED|VM_LOCKONFAULT);
/* * match the virtual addresses, permission and the alignment of the diff --git a/mm/mlock.c b/mm/mlock.c index 3094f27..029a75b 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -422,7 +422,7 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec, void munlock_vma_pages_range(struct vm_area_struct *vma, unsigned long start, unsigned long end) { - vma->vm_flags &= ~VM_LOCKED; + vma->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT);
while (start < end) { struct page *page = NULL; diff --git a/mm/mmap.c b/mm/mmap.c index aa632ad..bdbefc3 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1651,7 +1651,7 @@ out: vma == get_gate_vma(current->mm))) mm->locked_vm += (len >> PAGE_SHIFT); else - vma->vm_flags &= ~VM_LOCKED; + vma->vm_flags &= ~(VM_LOCKED | VM_LOCKONFAULT); }
if (file) diff --git a/mm/rmap.c b/mm/rmap.c index 171b687..14ce002 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -744,7 +744,8 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
if (vma->vm_flags & VM_LOCKED) { spin_unlock(ptl); - pra->vm_flags |= VM_LOCKED; + pra->vm_flags |= + (vma->vm_flags & (VM_LOCKED | VM_LOCKONFAULT)); return SWAP_FAIL; /* To break the loop */ }
@@ -765,7 +766,8 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
if (vma->vm_flags & VM_LOCKED) { pte_unmap_unlock(pte, ptl); - pra->vm_flags |= VM_LOCKED; + pra->vm_flags |= + (vma->vm_flags & (VM_LOCKED | VM_LOCKONFAULT)); return SWAP_FAIL; /* To break the loop */ }
On Sun 09-08-15 01:22:53, Eric B Munson wrote:
The cost of faulting in all memory to be locked can be very high when working with large mappings. If only portions of the mapping will be used this can incur a high penalty for locking.
For the example of a large file, this is the usage pattern for a large statical language model (probably applies to other statical or graphical models as well). For the security example, any application transacting in data that cannot be swapped out (credit card data, medical records, etc).
This patch introduces the ability to request that pages are not pre-faulted, but are placed on the unevictable LRU when they are finally faulted in. The VM_LOCKONFAULT flag will be used together with VM_LOCKED and has no effect when set without VM_LOCKED.
I do not like this very much to be honest. We have only few bits left there and it seems this is not really necessary. I thought that LOCKONFAULT acts as a modifier to the mlock call to tell whether to poppulate or not. The only place we have to persist it is mlockall(MCL_FUTURE) AFAICS. And this can be handled by an additional field in the mm_struct. This could be handled at __mm_populate level. So unless I am missing something this would be much more easier in the end we no new bit in VM flags would be necessary.
This would obviously mean that the LOCKONFAULT couldn't be exported to the userspace but is this really necessary?
On Wed, 12 Aug 2015, Michal Hocko wrote:
On Sun 09-08-15 01:22:53, Eric B Munson wrote:
The cost of faulting in all memory to be locked can be very high when working with large mappings. If only portions of the mapping will be used this can incur a high penalty for locking.
For the example of a large file, this is the usage pattern for a large statical language model (probably applies to other statical or graphical models as well). For the security example, any application transacting in data that cannot be swapped out (credit card data, medical records, etc).
This patch introduces the ability to request that pages are not pre-faulted, but are placed on the unevictable LRU when they are finally faulted in. The VM_LOCKONFAULT flag will be used together with VM_LOCKED and has no effect when set without VM_LOCKED.
I do not like this very much to be honest. We have only few bits left there and it seems this is not really necessary. I thought that LOCKONFAULT acts as a modifier to the mlock call to tell whether to poppulate or not. The only place we have to persist it is mlockall(MCL_FUTURE) AFAICS. And this can be handled by an additional field in the mm_struct. This could be handled at __mm_populate level. So unless I am missing something this would be much more easier in the end we no new bit in VM flags would be necessary.
This would obviously mean that the LOCKONFAULT couldn't be exported to the userspace but is this really necessary?
Sorry for the latency here, I was on vacation and am now at plumbers.
I am not sure that growing the mm_struct by another flags field instead of using available bits in the vm_flags is the right choice. After this patch, we still have 3 free bits on 32 bit architectures (2 after the userfaultfd set IIRC). The group which asked for this feature here wants the ability to distinguish between LOCKED and LOCKONFAULT regions and without the VMA flag there isn't a way to do that.
Do we know that these last two open flags are needed right now or is this speculation that they will be and that none of the other VMA flags can be reclaimed?
On 08/19/2015 11:33 PM, Eric B Munson wrote:
On Wed, 12 Aug 2015, Michal Hocko wrote:
On Sun 09-08-15 01:22:53, Eric B Munson wrote:
I do not like this very much to be honest. We have only few bits left there and it seems this is not really necessary. I thought that LOCKONFAULT acts as a modifier to the mlock call to tell whether to poppulate or not. The only place we have to persist it is mlockall(MCL_FUTURE) AFAICS. And this can be handled by an additional field in the mm_struct. This could be handled at __mm_populate level. So unless I am missing something this would be much more easier in the end we no new bit in VM flags would be necessary.
This would obviously mean that the LOCKONFAULT couldn't be exported to the userspace but is this really necessary?
Sorry for the latency here, I was on vacation and am now at plumbers.
I am not sure that growing the mm_struct by another flags field instead of using available bits in the vm_flags is the right choice.
I was making the same objection on one of the earlier versions and since you sticked with a new vm flag, I thought it doesn't matter, as we could change it later if we run out of bits. But now I realize that since you export this difference to userspace (and below you say that it's by request), we won't be able to change it later. So it's a more difficult choice.
After this patch, we still have 3 free bits on 32 bit architectures (2 after the userfaultfd set IIRC). The group which asked for this feature here wants the ability to distinguish between LOCKED and LOCKONFAULT regions and without the VMA flag there isn't a way to do that.
Do we know that these last two open flags are needed right now or is this speculation that they will be and that none of the other VMA flags can be reclaimed?
I think it's the latter, we can expect that flags will be added rather than removed, as removal is hard or impossible.
On Wed 19-08-15 17:33:45, Eric B Munson wrote: [...]
The group which asked for this feature here wants the ability to distinguish between LOCKED and LOCKONFAULT regions and without the VMA flag there isn't a way to do that.
Could you be more specific on why this is needed?
Do we know that these last two open flags are needed right now or is this speculation that they will be and that none of the other VMA flags can be reclaimed?
I do not think they are needed by anybody right now but that is not a reason why it should be used without a really strong justification. If the discoverability is really needed then fair enough but I haven't seen any justification for that yet.
On Thu, 20 Aug 2015, Michal Hocko wrote:
On Wed 19-08-15 17:33:45, Eric B Munson wrote: [...]
The group which asked for this feature here wants the ability to distinguish between LOCKED and LOCKONFAULT regions and without the VMA flag there isn't a way to do that.
Could you be more specific on why this is needed?
They want to keep metrics on the amount of memory used in a LOCKONFAULT region versus the address space of the region.
Do we know that these last two open flags are needed right now or is this speculation that they will be and that none of the other VMA flags can be reclaimed?
I do not think they are needed by anybody right now but that is not a reason why it should be used without a really strong justification. If the discoverability is really needed then fair enough but I haven't seen any justification for that yet.
To be completely clear you believe that if the metrics collection is not a strong enough justification, it is better to expand the mm_struct by another unsigned long than to use one of these bits right?
On Thu 20-08-15 13:03:09, Eric B Munson wrote:
On Thu, 20 Aug 2015, Michal Hocko wrote:
On Wed 19-08-15 17:33:45, Eric B Munson wrote: [...]
The group which asked for this feature here wants the ability to distinguish between LOCKED and LOCKONFAULT regions and without the VMA flag there isn't a way to do that.
Could you be more specific on why this is needed?
They want to keep metrics on the amount of memory used in a LOCKONFAULT region versus the address space of the region.
/proc/<pid>/smaps already exports that information AFAICS. It exports VMA flags including VM_LOCKED and if rss < size then this is clearly LOCKONFAULT because the standard mlock semantic is to populate. Would that be sufficient?
Now, it is true that LOCKONFAULT wouldn't be distinguishable from MAP_LOCKED which failed to populate but does that really matter? It is LOCKONFAULT in a way as well.
Do we know that these last two open flags are needed right now or is this speculation that they will be and that none of the other VMA flags can be reclaimed?
I do not think they are needed by anybody right now but that is not a reason why it should be used without a really strong justification. If the discoverability is really needed then fair enough but I haven't seen any justification for that yet.
To be completely clear you believe that if the metrics collection is not a strong enough justification, it is better to expand the mm_struct by another unsigned long than to use one of these bits right?
A simple bool is sufficient for that. And yes I think we should go with per mm_struct flag rather than the additional vma flag if it has only the global (whole address space) scope - which would be the case if the LOCKONFAULT is always an mlock modifier and the persistance is needed only for MCL_FUTURE. Which is imho a sane semantic.
On Fri, 21 Aug 2015, Michal Hocko wrote:
On Thu 20-08-15 13:03:09, Eric B Munson wrote:
On Thu, 20 Aug 2015, Michal Hocko wrote:
On Wed 19-08-15 17:33:45, Eric B Munson wrote: [...]
The group which asked for this feature here wants the ability to distinguish between LOCKED and LOCKONFAULT regions and without the VMA flag there isn't a way to do that.
Could you be more specific on why this is needed?
They want to keep metrics on the amount of memory used in a LOCKONFAULT region versus the address space of the region.
/proc/<pid>/smaps already exports that information AFAICS. It exports VMA flags including VM_LOCKED and if rss < size then this is clearly LOCKONFAULT because the standard mlock semantic is to populate. Would that be sufficient?
Now, it is true that LOCKONFAULT wouldn't be distinguishable from MAP_LOCKED which failed to populate but does that really matter? It is LOCKONFAULT in a way as well.
Does that matter to my users? No, they do not use MAP_LOCKED at all so any VMA with VM_LOCKED set and rss < size is lock on fault. Will it matter to others? I suspect so, but these are likely to be the same group of users which will be suprised to learn that MAP_LOCKED does not guarantee that the entire range is faulted in on return from mmap.
Do we know that these last two open flags are needed right now or is this speculation that they will be and that none of the other VMA flags can be reclaimed?
I do not think they are needed by anybody right now but that is not a reason why it should be used without a really strong justification. If the discoverability is really needed then fair enough but I haven't seen any justification for that yet.
To be completely clear you believe that if the metrics collection is not a strong enough justification, it is better to expand the mm_struct by another unsigned long than to use one of these bits right?
A simple bool is sufficient for that. And yes I think we should go with per mm_struct flag rather than the additional vma flag if it has only the global (whole address space) scope - which would be the case if the LOCKONFAULT is always an mlock modifier and the persistance is needed only for MCL_FUTURE. Which is imho a sane semantic.
I am in the middle of implementing lock on fault this way, but I cannot see how we will hanlde mremap of a lock on fault region. Say we have the following:
addr = mmap(len, MAP_ANONYMOUS, ...); mlock(addr, len, MLOCK_ONFAULT); ... mremap(addr, len, 2 * len, ...)
There is no way for mremap to know that the area being remapped was lock on fault so it will be locked and prefaulted by remap. How can we avoid this without tracking per vma if it was locked with lock or lock on fault?
On Fri, Aug 21, 2015 at 9:31 PM, Eric B Munson emunson@akamai.com wrote:
On Fri, 21 Aug 2015, Michal Hocko wrote:
On Thu 20-08-15 13:03:09, Eric B Munson wrote:
On Thu, 20 Aug 2015, Michal Hocko wrote:
On Wed 19-08-15 17:33:45, Eric B Munson wrote: [...]
The group which asked for this feature here wants the ability to distinguish between LOCKED and LOCKONFAULT regions and without the VMA flag there isn't a way to do that.
Could you be more specific on why this is needed?
They want to keep metrics on the amount of memory used in a LOCKONFAULT region versus the address space of the region.
/proc/<pid>/smaps already exports that information AFAICS. It exports VMA flags including VM_LOCKED and if rss < size then this is clearly LOCKONFAULT because the standard mlock semantic is to populate. Would that be sufficient?
Now, it is true that LOCKONFAULT wouldn't be distinguishable from MAP_LOCKED which failed to populate but does that really matter? It is LOCKONFAULT in a way as well.
Does that matter to my users? No, they do not use MAP_LOCKED at all so any VMA with VM_LOCKED set and rss < size is lock on fault. Will it matter to others? I suspect so, but these are likely to be the same group of users which will be suprised to learn that MAP_LOCKED does not guarantee that the entire range is faulted in on return from mmap.
Do we know that these last two open flags are needed right now or is this speculation that they will be and that none of the other VMA flags can be reclaimed?
I do not think they are needed by anybody right now but that is not a reason why it should be used without a really strong justification. If the discoverability is really needed then fair enough but I haven't seen any justification for that yet.
To be completely clear you believe that if the metrics collection is not a strong enough justification, it is better to expand the mm_struct by another unsigned long than to use one of these bits right?
A simple bool is sufficient for that. And yes I think we should go with per mm_struct flag rather than the additional vma flag if it has only the global (whole address space) scope - which would be the case if the LOCKONFAULT is always an mlock modifier and the persistance is needed only for MCL_FUTURE. Which is imho a sane semantic.
I am in the middle of implementing lock on fault this way, but I cannot see how we will hanlde mremap of a lock on fault region. Say we have the following:
addr = mmap(len, MAP_ANONYMOUS, ...); mlock(addr, len, MLOCK_ONFAULT); ... mremap(addr, len, 2 * len, ...)
There is no way for mremap to know that the area being remapped was lock on fault so it will be locked and prefaulted by remap. How can we avoid this without tracking per vma if it was locked with lock or lock on fault?
remap can count filled ptes and prefault only completely populated areas.
There might be a problem after failed populate: remap will handle them as lock on fault. In this case we can fill ptes with swap-like non-present entries to remember that fact and count them as should-be-locked pages.
On 08/24/2015 12:17 PM, Konstantin Khlebnikov wrote:
I am in the middle of implementing lock on fault this way, but I cannot see how we will hanlde mremap of a lock on fault region. Say we have the following:
addr = mmap(len, MAP_ANONYMOUS, ...); mlock(addr, len, MLOCK_ONFAULT); ... mremap(addr, len, 2 * len, ...)
There is no way for mremap to know that the area being remapped was lock on fault so it will be locked and prefaulted by remap. How can we avoid this without tracking per vma if it was locked with lock or lock on fault?
remap can count filled ptes and prefault only completely populated areas.
Does (and should) mremap really prefault non-present pages? Shouldn't it just prepare the page tables and that's it?
There might be a problem after failed populate: remap will handle them as lock on fault. In this case we can fill ptes with swap-like non-present entries to remember that fact and count them as should-be-locked pages.
I don't think we should strive to have mremap try to fix the inherent unreliability of mmap (MAP_POPULATE)?
On Mon, Aug 24, 2015 at 4:30 PM, Vlastimil Babka vbabka@suse.cz wrote:
On 08/24/2015 12:17 PM, Konstantin Khlebnikov wrote:
I am in the middle of implementing lock on fault this way, but I cannot see how we will hanlde mremap of a lock on fault region. Say we have the following:
addr = mmap(len, MAP_ANONYMOUS, ...); mlock(addr, len, MLOCK_ONFAULT); ... mremap(addr, len, 2 * len, ...)
There is no way for mremap to know that the area being remapped was lock on fault so it will be locked and prefaulted by remap. How can we avoid this without tracking per vma if it was locked with lock or lock on fault?
remap can count filled ptes and prefault only completely populated areas.
Does (and should) mremap really prefault non-present pages? Shouldn't it just prepare the page tables and that's it?
As I see mremap prefaults pages when it extends mlocked area.
Also quote from manpage : If the memory segment specified by old_address and old_size is locked : (using mlock(2) or similar), then this lock is maintained when the segment is : resized and/or relocated. As a consequence, the amount of memory locked : by the process may change.
There might be a problem after failed populate: remap will handle them as lock on fault. In this case we can fill ptes with swap-like non-present entries to remember that fact and count them as should-be-locked pages.
I don't think we should strive to have mremap try to fix the inherent unreliability of mmap (MAP_POPULATE)?
I don't think so. MAP_POPULATE works only when mmap happens. Flag MREMAP_POPULATE might be a good idea. Just for symmetry.
On 08/24/2015 03:50 PM, Konstantin Khlebnikov wrote:
On Mon, Aug 24, 2015 at 4:30 PM, Vlastimil Babka vbabka@suse.cz wrote:
On 08/24/2015 12:17 PM, Konstantin Khlebnikov wrote:
I am in the middle of implementing lock on fault this way, but I cannot see how we will hanlde mremap of a lock on fault region. Say we have the following:
addr = mmap(len, MAP_ANONYMOUS, ...); mlock(addr, len, MLOCK_ONFAULT); ... mremap(addr, len, 2 * len, ...)
There is no way for mremap to know that the area being remapped was lock on fault so it will be locked and prefaulted by remap. How can we avoid this without tracking per vma if it was locked with lock or lock on fault?
remap can count filled ptes and prefault only completely populated areas.
Does (and should) mremap really prefault non-present pages? Shouldn't it just prepare the page tables and that's it?
As I see mremap prefaults pages when it extends mlocked area.
Also quote from manpage : If the memory segment specified by old_address and old_size is locked : (using mlock(2) or similar), then this lock is maintained when the segment is : resized and/or relocated. As a consequence, the amount of memory locked : by the process may change.
Oh, right... Well that looks like a convincing argument for having a sticky VM_LOCKONFAULT after all. Having mremap guess by scanning existing pte's would slow it down, and be unreliable (was the area completely populated because MLOCK_ONFAULT was not used or because the process aulted it already? Was it not populated because MLOCK_ONFAULT was used, or because mmap(MAP_LOCKED) failed to populate it all?).
The only sane alternative is to populate always for mremap() of VM_LOCKED areas, and document this loss of MLOCK_ONFAULT information as a limitation of mlock2(MLOCK_ONFAULT). Which might or might not be enough for Eric's usecase, but it's somewhat ugly.
There might be a problem after failed populate: remap will handle them as lock on fault. In this case we can fill ptes with swap-like non-present entries to remember that fact and count them as should-be-locked pages.
I don't think we should strive to have mremap try to fix the inherent unreliability of mmap (MAP_POPULATE)?
I don't think so. MAP_POPULATE works only when mmap happens. Flag MREMAP_POPULATE might be a good idea. Just for symmetry.
Maybe, but please do it as a separate series.
On Mon, 24 Aug 2015, Vlastimil Babka wrote:
On 08/24/2015 03:50 PM, Konstantin Khlebnikov wrote:
On Mon, Aug 24, 2015 at 4:30 PM, Vlastimil Babka vbabka@suse.cz wrote:
On 08/24/2015 12:17 PM, Konstantin Khlebnikov wrote:
I am in the middle of implementing lock on fault this way, but I cannot see how we will hanlde mremap of a lock on fault region. Say we have the following:
addr = mmap(len, MAP_ANONYMOUS, ...); mlock(addr, len, MLOCK_ONFAULT); ... mremap(addr, len, 2 * len, ...)
There is no way for mremap to know that the area being remapped was lock on fault so it will be locked and prefaulted by remap. How can we avoid this without tracking per vma if it was locked with lock or lock on fault?
remap can count filled ptes and prefault only completely populated areas.
Does (and should) mremap really prefault non-present pages? Shouldn't it just prepare the page tables and that's it?
As I see mremap prefaults pages when it extends mlocked area.
Also quote from manpage : If the memory segment specified by old_address and old_size is locked : (using mlock(2) or similar), then this lock is maintained when the segment is : resized and/or relocated. As a consequence, the amount of memory locked : by the process may change.
Oh, right... Well that looks like a convincing argument for having a sticky VM_LOCKONFAULT after all. Having mremap guess by scanning existing pte's would slow it down, and be unreliable (was the area completely populated because MLOCK_ONFAULT was not used or because the process aulted it already? Was it not populated because MLOCK_ONFAULT was used, or because mmap(MAP_LOCKED) failed to populate it all?).
Given this, I am going to stop working in v8 and leave the vma flag in place.
The only sane alternative is to populate always for mremap() of VM_LOCKED areas, and document this loss of MLOCK_ONFAULT information as a limitation of mlock2(MLOCK_ONFAULT). Which might or might not be enough for Eric's usecase, but it's somewhat ugly.
I don't think that this is the right solution, I would be really surprised as a user if an area I locked with MLOCK_ONFAULT was then fully locked and prepopulated after mremap().
There might be a problem after failed populate: remap will handle them as lock on fault. In this case we can fill ptes with swap-like non-present entries to remember that fact and count them as should-be-locked pages.
I don't think we should strive to have mremap try to fix the inherent unreliability of mmap (MAP_POPULATE)?
I don't think so. MAP_POPULATE works only when mmap happens. Flag MREMAP_POPULATE might be a good idea. Just for symmetry.
Maybe, but please do it as a separate series.
-- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
On Mon, Aug 24, 2015 at 6:09 PM, Eric B Munson emunson@akamai.com wrote:
On Mon, 24 Aug 2015, Vlastimil Babka wrote:
On 08/24/2015 03:50 PM, Konstantin Khlebnikov wrote:
On Mon, Aug 24, 2015 at 4:30 PM, Vlastimil Babka vbabka@suse.cz wrote:
On 08/24/2015 12:17 PM, Konstantin Khlebnikov wrote:
I am in the middle of implementing lock on fault this way, but I cannot see how we will hanlde mremap of a lock on fault region. Say we have the following:
addr = mmap(len, MAP_ANONYMOUS, ...); mlock(addr, len, MLOCK_ONFAULT); ... mremap(addr, len, 2 * len, ...)
There is no way for mremap to know that the area being remapped was lock on fault so it will be locked and prefaulted by remap. How can we avoid this without tracking per vma if it was locked with lock or lock on fault?
remap can count filled ptes and prefault only completely populated areas.
Does (and should) mremap really prefault non-present pages? Shouldn't it just prepare the page tables and that's it?
As I see mremap prefaults pages when it extends mlocked area.
Also quote from manpage : If the memory segment specified by old_address and old_size is locked : (using mlock(2) or similar), then this lock is maintained when the segment is : resized and/or relocated. As a consequence, the amount of memory locked : by the process may change.
Oh, right... Well that looks like a convincing argument for having a sticky VM_LOCKONFAULT after all. Having mremap guess by scanning existing pte's would slow it down, and be unreliable (was the area completely populated because MLOCK_ONFAULT was not used or because the process aulted it already? Was it not populated because MLOCK_ONFAULT was used, or because mmap(MAP_LOCKED) failed to populate it all?).
Given this, I am going to stop working in v8 and leave the vma flag in place.
The only sane alternative is to populate always for mremap() of VM_LOCKED areas, and document this loss of MLOCK_ONFAULT information as a limitation of mlock2(MLOCK_ONFAULT). Which might or might not be enough for Eric's usecase, but it's somewhat ugly.
I don't think that this is the right solution, I would be really surprised as a user if an area I locked with MLOCK_ONFAULT was then fully locked and prepopulated after mremap().
If mremap is the only problem then we can add opposite flag for it:
"MREMAP_NOPOPULATE" - do not populate new segment of locked areas - do not copy normal areas if possible (anonymous/special must be copied)
addr = mmap(len, MAP_ANONYMOUS, ...); mlock(addr, len, MLOCK_ONFAULT); ... addr2 = mremap(addr, len, 2 * len, MREMAP_NOPOPULATE); ...
There might be a problem after failed populate: remap will handle them as lock on fault. In this case we can fill ptes with swap-like non-present entries to remember that fact and count them as should-be-locked pages.
I don't think we should strive to have mremap try to fix the inherent unreliability of mmap (MAP_POPULATE)?
I don't think so. MAP_POPULATE works only when mmap happens. Flag MREMAP_POPULATE might be a good idea. Just for symmetry.
Maybe, but please do it as a separate series.
-- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
On Mon, 24 Aug 2015, Konstantin Khlebnikov wrote:
On Mon, Aug 24, 2015 at 6:09 PM, Eric B Munson emunson@akamai.com wrote:
On Mon, 24 Aug 2015, Vlastimil Babka wrote:
On 08/24/2015 03:50 PM, Konstantin Khlebnikov wrote:
On Mon, Aug 24, 2015 at 4:30 PM, Vlastimil Babka vbabka@suse.cz wrote:
On 08/24/2015 12:17 PM, Konstantin Khlebnikov wrote:
> > >I am in the middle of implementing lock on fault this way, but I cannot >see how we will hanlde mremap of a lock on fault region. Say we have >the following: > > addr = mmap(len, MAP_ANONYMOUS, ...); > mlock(addr, len, MLOCK_ONFAULT); > ... > mremap(addr, len, 2 * len, ...) > >There is no way for mremap to know that the area being remapped was lock >on fault so it will be locked and prefaulted by remap. How can we avoid >this without tracking per vma if it was locked with lock or lock on >fault?
remap can count filled ptes and prefault only completely populated areas.
Does (and should) mremap really prefault non-present pages? Shouldn't it just prepare the page tables and that's it?
As I see mremap prefaults pages when it extends mlocked area.
Also quote from manpage : If the memory segment specified by old_address and old_size is locked : (using mlock(2) or similar), then this lock is maintained when the segment is : resized and/or relocated. As a consequence, the amount of memory locked : by the process may change.
Oh, right... Well that looks like a convincing argument for having a sticky VM_LOCKONFAULT after all. Having mremap guess by scanning existing pte's would slow it down, and be unreliable (was the area completely populated because MLOCK_ONFAULT was not used or because the process aulted it already? Was it not populated because MLOCK_ONFAULT was used, or because mmap(MAP_LOCKED) failed to populate it all?).
Given this, I am going to stop working in v8 and leave the vma flag in place.
The only sane alternative is to populate always for mremap() of VM_LOCKED areas, and document this loss of MLOCK_ONFAULT information as a limitation of mlock2(MLOCK_ONFAULT). Which might or might not be enough for Eric's usecase, but it's somewhat ugly.
I don't think that this is the right solution, I would be really surprised as a user if an area I locked with MLOCK_ONFAULT was then fully locked and prepopulated after mremap().
If mremap is the only problem then we can add opposite flag for it:
"MREMAP_NOPOPULATE"
- do not populate new segment of locked areas
- do not copy normal areas if possible (anonymous/special must be copied)
addr = mmap(len, MAP_ANONYMOUS, ...); mlock(addr, len, MLOCK_ONFAULT); ... addr2 = mremap(addr, len, 2 * len, MREMAP_NOPOPULATE); ...
But with this, the user must remember what areas are locked with MLOCK_LOCKONFAULT and which are locked the with prepopulate so the correct mremap flags can be used.
On Mon, Aug 24, 2015 at 6:55 PM, Eric B Munson emunson@akamai.com wrote:
On Mon, 24 Aug 2015, Konstantin Khlebnikov wrote:
On Mon, Aug 24, 2015 at 6:09 PM, Eric B Munson emunson@akamai.com wrote:
On Mon, 24 Aug 2015, Vlastimil Babka wrote:
On 08/24/2015 03:50 PM, Konstantin Khlebnikov wrote:
On Mon, Aug 24, 2015 at 4:30 PM, Vlastimil Babka vbabka@suse.cz wrote:
On 08/24/2015 12:17 PM, Konstantin Khlebnikov wrote: >> >> >>I am in the middle of implementing lock on fault this way, but I cannot >>see how we will hanlde mremap of a lock on fault region. Say we have >>the following: >> >> addr = mmap(len, MAP_ANONYMOUS, ...); >> mlock(addr, len, MLOCK_ONFAULT); >> ... >> mremap(addr, len, 2 * len, ...) >> >>There is no way for mremap to know that the area being remapped was lock >>on fault so it will be locked and prefaulted by remap. How can we avoid >>this without tracking per vma if it was locked with lock or lock on >>fault? > > >remap can count filled ptes and prefault only completely populated areas.
Does (and should) mremap really prefault non-present pages? Shouldn't it just prepare the page tables and that's it?
As I see mremap prefaults pages when it extends mlocked area.
Also quote from manpage : If the memory segment specified by old_address and old_size is locked : (using mlock(2) or similar), then this lock is maintained when the segment is : resized and/or relocated. As a consequence, the amount of memory locked : by the process may change.
Oh, right... Well that looks like a convincing argument for having a sticky VM_LOCKONFAULT after all. Having mremap guess by scanning existing pte's would slow it down, and be unreliable (was the area completely populated because MLOCK_ONFAULT was not used or because the process aulted it already? Was it not populated because MLOCK_ONFAULT was used, or because mmap(MAP_LOCKED) failed to populate it all?).
Given this, I am going to stop working in v8 and leave the vma flag in place.
The only sane alternative is to populate always for mremap() of VM_LOCKED areas, and document this loss of MLOCK_ONFAULT information as a limitation of mlock2(MLOCK_ONFAULT). Which might or might not be enough for Eric's usecase, but it's somewhat ugly.
I don't think that this is the right solution, I would be really surprised as a user if an area I locked with MLOCK_ONFAULT was then fully locked and prepopulated after mremap().
If mremap is the only problem then we can add opposite flag for it:
"MREMAP_NOPOPULATE"
- do not populate new segment of locked areas
- do not copy normal areas if possible (anonymous/special must be copied)
addr = mmap(len, MAP_ANONYMOUS, ...); mlock(addr, len, MLOCK_ONFAULT); ... addr2 = mremap(addr, len, 2 * len, MREMAP_NOPOPULATE); ...
But with this, the user must remember what areas are locked with MLOCK_LOCKONFAULT and which are locked the with prepopulate so the correct mremap flags can be used.
Yep. Shouldn't be hard. You anyway have to do some changes in user-space.
Much simpler for users-pace solution is a mm-wide flag which turns all further mlocks and MAP_LOCKED into lock-on-fault. Something like mlockall(MCL_NOPOPULATE_LOCKED).
On Mon, 24 Aug 2015, Konstantin Khlebnikov wrote:
On Mon, Aug 24, 2015 at 6:55 PM, Eric B Munson emunson@akamai.com wrote:
On Mon, 24 Aug 2015, Konstantin Khlebnikov wrote:
On Mon, Aug 24, 2015 at 6:09 PM, Eric B Munson emunson@akamai.com wrote:
On Mon, 24 Aug 2015, Vlastimil Babka wrote:
On 08/24/2015 03:50 PM, Konstantin Khlebnikov wrote:
On Mon, Aug 24, 2015 at 4:30 PM, Vlastimil Babka vbabka@suse.cz wrote: >On 08/24/2015 12:17 PM, Konstantin Khlebnikov wrote: >>> >>> >>>I am in the middle of implementing lock on fault this way, but I cannot >>>see how we will hanlde mremap of a lock on fault region. Say we have >>>the following: >>> >>> addr = mmap(len, MAP_ANONYMOUS, ...); >>> mlock(addr, len, MLOCK_ONFAULT); >>> ... >>> mremap(addr, len, 2 * len, ...) >>> >>>There is no way for mremap to know that the area being remapped was lock >>>on fault so it will be locked and prefaulted by remap. How can we avoid >>>this without tracking per vma if it was locked with lock or lock on >>>fault? >> >> >>remap can count filled ptes and prefault only completely populated areas. > > >Does (and should) mremap really prefault non-present pages? Shouldn't it >just prepare the page tables and that's it?
As I see mremap prefaults pages when it extends mlocked area.
Also quote from manpage : If the memory segment specified by old_address and old_size is locked : (using mlock(2) or similar), then this lock is maintained when the segment is : resized and/or relocated. As a consequence, the amount of memory locked : by the process may change.
Oh, right... Well that looks like a convincing argument for having a sticky VM_LOCKONFAULT after all. Having mremap guess by scanning existing pte's would slow it down, and be unreliable (was the area completely populated because MLOCK_ONFAULT was not used or because the process aulted it already? Was it not populated because MLOCK_ONFAULT was used, or because mmap(MAP_LOCKED) failed to populate it all?).
Given this, I am going to stop working in v8 and leave the vma flag in place.
The only sane alternative is to populate always for mremap() of VM_LOCKED areas, and document this loss of MLOCK_ONFAULT information as a limitation of mlock2(MLOCK_ONFAULT). Which might or might not be enough for Eric's usecase, but it's somewhat ugly.
I don't think that this is the right solution, I would be really surprised as a user if an area I locked with MLOCK_ONFAULT was then fully locked and prepopulated after mremap().
If mremap is the only problem then we can add opposite flag for it:
"MREMAP_NOPOPULATE"
- do not populate new segment of locked areas
- do not copy normal areas if possible (anonymous/special must be copied)
addr = mmap(len, MAP_ANONYMOUS, ...); mlock(addr, len, MLOCK_ONFAULT); ... addr2 = mremap(addr, len, 2 * len, MREMAP_NOPOPULATE); ...
But with this, the user must remember what areas are locked with MLOCK_LOCKONFAULT and which are locked the with prepopulate so the correct mremap flags can be used.
Yep. Shouldn't be hard. You anyway have to do some changes in user-space.
Sorry if I wasn't clear enough in my last reply, I think forcing userspace to track this is the wrong choice. The VM system is responsible for tracking these attributes and should continue to be.
Much simpler for users-pace solution is a mm-wide flag which turns all further mlocks and MAP_LOCKED into lock-on-fault. Something like mlockall(MCL_NOPOPULATE_LOCKED).
This set certainly adds the foundation for such a change if you think it would be useful. That particular behavior was not part of my inital use case though.
On Mon, Aug 24, 2015 at 8:00 PM, Eric B Munson emunson@akamai.com wrote:
On Mon, 24 Aug 2015, Konstantin Khlebnikov wrote:
On Mon, Aug 24, 2015 at 6:55 PM, Eric B Munson emunson@akamai.com wrote:
On Mon, 24 Aug 2015, Konstantin Khlebnikov wrote:
On Mon, Aug 24, 2015 at 6:09 PM, Eric B Munson emunson@akamai.com wrote:
On Mon, 24 Aug 2015, Vlastimil Babka wrote:
On 08/24/2015 03:50 PM, Konstantin Khlebnikov wrote: >On Mon, Aug 24, 2015 at 4:30 PM, Vlastimil Babka vbabka@suse.cz wrote: >>On 08/24/2015 12:17 PM, Konstantin Khlebnikov wrote: >>>> >>>> >>>>I am in the middle of implementing lock on fault this way, but I cannot >>>>see how we will hanlde mremap of a lock on fault region. Say we have >>>>the following: >>>> >>>> addr = mmap(len, MAP_ANONYMOUS, ...); >>>> mlock(addr, len, MLOCK_ONFAULT); >>>> ... >>>> mremap(addr, len, 2 * len, ...) >>>> >>>>There is no way for mremap to know that the area being remapped was lock >>>>on fault so it will be locked and prefaulted by remap. How can we avoid >>>>this without tracking per vma if it was locked with lock or lock on >>>>fault? >>> >>> >>>remap can count filled ptes and prefault only completely populated areas. >> >> >>Does (and should) mremap really prefault non-present pages? Shouldn't it >>just prepare the page tables and that's it? > >As I see mremap prefaults pages when it extends mlocked area. > >Also quote from manpage >: If the memory segment specified by old_address and old_size is locked >: (using mlock(2) or similar), then this lock is maintained when the segment is >: resized and/or relocated. As a consequence, the amount of memory locked >: by the process may change.
Oh, right... Well that looks like a convincing argument for having a sticky VM_LOCKONFAULT after all. Having mremap guess by scanning existing pte's would slow it down, and be unreliable (was the area completely populated because MLOCK_ONFAULT was not used or because the process aulted it already? Was it not populated because MLOCK_ONFAULT was used, or because mmap(MAP_LOCKED) failed to populate it all?).
Given this, I am going to stop working in v8 and leave the vma flag in place.
The only sane alternative is to populate always for mremap() of VM_LOCKED areas, and document this loss of MLOCK_ONFAULT information as a limitation of mlock2(MLOCK_ONFAULT). Which might or might not be enough for Eric's usecase, but it's somewhat ugly.
I don't think that this is the right solution, I would be really surprised as a user if an area I locked with MLOCK_ONFAULT was then fully locked and prepopulated after mremap().
If mremap is the only problem then we can add opposite flag for it:
"MREMAP_NOPOPULATE"
- do not populate new segment of locked areas
- do not copy normal areas if possible (anonymous/special must be copied)
addr = mmap(len, MAP_ANONYMOUS, ...); mlock(addr, len, MLOCK_ONFAULT); ... addr2 = mremap(addr, len, 2 * len, MREMAP_NOPOPULATE); ...
But with this, the user must remember what areas are locked with MLOCK_LOCKONFAULT and which are locked the with prepopulate so the correct mremap flags can be used.
Yep. Shouldn't be hard. You anyway have to do some changes in user-space.
Sorry if I wasn't clear enough in my last reply, I think forcing userspace to track this is the wrong choice. The VM system is responsible for tracking these attributes and should continue to be.
Userspace tracks addresses and sizes of these areas. Plus mremap obviously works only with page granularity so memory allocator in userspace have to know a lot about these structures. So keeping one more bit isn't a rocket science.
Much simpler for users-pace solution is a mm-wide flag which turns all further mlocks and MAP_LOCKED into lock-on-fault. Something like mlockall(MCL_NOPOPULATE_LOCKED).
This set certainly adds the foundation for such a change if you think it would be useful. That particular behavior was not part of my inital use case though.
This looks like much easier solution: you don't need new syscall and after enabling that lock-on-fault mode userspace still can get old behaviour simply by touching newly locked area.
On Mon, 24 Aug 2015, Konstantin Khlebnikov wrote:
On Mon, Aug 24, 2015 at 8:00 PM, Eric B Munson emunson@akamai.com wrote:
On Mon, 24 Aug 2015, Konstantin Khlebnikov wrote:
On Mon, Aug 24, 2015 at 6:55 PM, Eric B Munson emunson@akamai.com wrote:
On Mon, 24 Aug 2015, Konstantin Khlebnikov wrote:
On Mon, Aug 24, 2015 at 6:09 PM, Eric B Munson emunson@akamai.com wrote:
On Mon, 24 Aug 2015, Vlastimil Babka wrote:
> On 08/24/2015 03:50 PM, Konstantin Khlebnikov wrote: > >On Mon, Aug 24, 2015 at 4:30 PM, Vlastimil Babka vbabka@suse.cz wrote: > >>On 08/24/2015 12:17 PM, Konstantin Khlebnikov wrote: > >>>> > >>>> > >>>>I am in the middle of implementing lock on fault this way, but I cannot > >>>>see how we will hanlde mremap of a lock on fault region. Say we have > >>>>the following: > >>>> > >>>> addr = mmap(len, MAP_ANONYMOUS, ...); > >>>> mlock(addr, len, MLOCK_ONFAULT); > >>>> ... > >>>> mremap(addr, len, 2 * len, ...) > >>>> > >>>>There is no way for mremap to know that the area being remapped was lock > >>>>on fault so it will be locked and prefaulted by remap. How can we avoid > >>>>this without tracking per vma if it was locked with lock or lock on > >>>>fault? > >>> > >>> > >>>remap can count filled ptes and prefault only completely populated areas. > >> > >> > >>Does (and should) mremap really prefault non-present pages? Shouldn't it > >>just prepare the page tables and that's it? > > > >As I see mremap prefaults pages when it extends mlocked area. > > > >Also quote from manpage > >: If the memory segment specified by old_address and old_size is locked > >: (using mlock(2) or similar), then this lock is maintained when the segment is > >: resized and/or relocated. As a consequence, the amount of memory locked > >: by the process may change. > > Oh, right... Well that looks like a convincing argument for having a > sticky VM_LOCKONFAULT after all. Having mremap guess by scanning > existing pte's would slow it down, and be unreliable (was the area > completely populated because MLOCK_ONFAULT was not used or because > the process aulted it already? Was it not populated because > MLOCK_ONFAULT was used, or because mmap(MAP_LOCKED) failed to > populate it all?).
Given this, I am going to stop working in v8 and leave the vma flag in place.
> > The only sane alternative is to populate always for mremap() of > VM_LOCKED areas, and document this loss of MLOCK_ONFAULT information > as a limitation of mlock2(MLOCK_ONFAULT). Which might or might not > be enough for Eric's usecase, but it's somewhat ugly. >
I don't think that this is the right solution, I would be really surprised as a user if an area I locked with MLOCK_ONFAULT was then fully locked and prepopulated after mremap().
If mremap is the only problem then we can add opposite flag for it:
"MREMAP_NOPOPULATE"
- do not populate new segment of locked areas
- do not copy normal areas if possible (anonymous/special must be copied)
addr = mmap(len, MAP_ANONYMOUS, ...); mlock(addr, len, MLOCK_ONFAULT); ... addr2 = mremap(addr, len, 2 * len, MREMAP_NOPOPULATE); ...
But with this, the user must remember what areas are locked with MLOCK_LOCKONFAULT and which are locked the with prepopulate so the correct mremap flags can be used.
Yep. Shouldn't be hard. You anyway have to do some changes in user-space.
Sorry if I wasn't clear enough in my last reply, I think forcing userspace to track this is the wrong choice. The VM system is responsible for tracking these attributes and should continue to be.
Userspace tracks addresses and sizes of these areas. Plus mremap obviously works only with page granularity so memory allocator in userspace have to know a lot about these structures. So keeping one more bit isn't a rocket science.
Fair enough, however, my current implementation does not require that userspace keep track of any extra information. With the VM_LOCKONFAULT flag mremap() keeps the properties that were set with mlock() or equivalent across remaps.
Much simpler for users-pace solution is a mm-wide flag which turns all further mlocks and MAP_LOCKED into lock-on-fault. Something like mlockall(MCL_NOPOPULATE_LOCKED).
This set certainly adds the foundation for such a change if you think it would be useful. That particular behavior was not part of my inital use case though.
This looks like much easier solution: you don't need new syscall and after enabling that lock-on-fault mode userspace still can get old behaviour simply by touching newly locked area.
Again, this suggestion requires that userspace know more about VM than with my implementation and will require it to walk an entire mapping before use to fault it in if required. With the current implementation, mlock continues to function as it has, with the additional flexibility of being able to request that areas not be prepopulated.
On Fri 21-08-15 14:31:32, Eric B Munson wrote: [...]
I am in the middle of implementing lock on fault this way, but I cannot see how we will hanlde mremap of a lock on fault region. Say we have the following:
addr = mmap(len, MAP_ANONYMOUS, ...); mlock(addr, len, MLOCK_ONFAULT); ... mremap(addr, len, 2 * len, ...)
There is no way for mremap to know that the area being remapped was lock on fault so it will be locked and prefaulted by remap. How can we avoid this without tracking per vma if it was locked with lock or lock on fault?
Yes mremap is a problem and it is very much similar to mmap(MAP_LOCKED). It doesn't guarantee the full mlock semantic because it leaves partially populated ranges behind without reporting any error.
Considering the current behavior I do not thing it would be terrible thing to do what Konstantin was suggesting and populate only the full ranges in a best effort mode (it is done so anyway) and document the behavior properly. " If the memory segment specified by old_address and old_size is locked (using mlock(2) or similar), then this lock is maintained when the segment is resized and/or relocated. As a consequence, the amount of memory locked by the process may change.
If the range is already fully populated and the range is enlarged the new range is attempted to be fully populated as well to preserve the full mlock semantic but there is no guarantee this will succeed. Partially populated (e.g. created by mlock(MLOCK_ONFAULT)) ranges do not have the full mlock semantic so they are not populated on resize. "
So what we have as a result is that partially populated ranges are preserved and fully populated ones work in the best effort mode the same way as they are now.
Does that sound at least remotely reasonably?
On 08/25/2015 03:41 PM, Michal Hocko wrote:
On Fri 21-08-15 14:31:32, Eric B Munson wrote: [...]
I am in the middle of implementing lock on fault this way, but I cannot see how we will hanlde mremap of a lock on fault region. Say we have the following:
addr = mmap(len, MAP_ANONYMOUS, ...); mlock(addr, len, MLOCK_ONFAULT); ... mremap(addr, len, 2 * len, ...)
There is no way for mremap to know that the area being remapped was lock on fault so it will be locked and prefaulted by remap. How can we avoid this without tracking per vma if it was locked with lock or lock on fault?
Yes mremap is a problem and it is very much similar to mmap(MAP_LOCKED). It doesn't guarantee the full mlock semantic because it leaves partially populated ranges behind without reporting any error.
Hm, that's right.
Considering the current behavior I do not thing it would be terrible thing to do what Konstantin was suggesting and populate only the full ranges in a best effort mode (it is done so anyway) and document the behavior properly. " If the memory segment specified by old_address and old_size is locked (using mlock(2) or similar), then this lock is maintained when the segment is resized and/or relocated. As a consequence, the amount of memory locked by the process may change.
If the range is already fully populated and the range is enlarged the new range is attempted to be fully populated as well to preserve the full mlock semantic but there is no guarantee this will succeed. Partially populated (e.g. created by mlock(MLOCK_ONFAULT)) ranges do not have the full mlock semantic so they are not populated on resize.
"
So what we have as a result is that partially populated ranges are preserved and fully populated ones work in the best effort mode the same way as they are now.
Does that sound at least remotely reasonably?
I'll basically repeat what I said earlier:
- mremap scanning existing pte's to figure out the population would slow it down for no good reason - it would be unreliable anyway: - example: was the area completely populated because MLOCK_ONFAULT was not used or because the process faulted it already - example: was the area not completely populated because MLOCK_ONFAULT was used, or because mmap(MAP_LOCKED) failed to populate it fully?
I think the first point is a pointless regression for workloads that use just plain mlock() and don't want the onfault semantics. Unless there's some shortcut? Does vma have a counter of how much is populated? (I don't think so?)
On Tue 25-08-15 15:55:46, Vlastimil Babka wrote:
On 08/25/2015 03:41 PM, Michal Hocko wrote:
[...]
So what we have as a result is that partially populated ranges are preserved and fully populated ones work in the best effort mode the same way as they are now.
Does that sound at least remotely reasonably?
I'll basically repeat what I said earlier:
- mremap scanning existing pte's to figure out the population would slow it
down for no good reason
So do we really need to populate the enlarged range? All the man page is saying is that the lock is maintained. Which will be still the case. It is true that the failure is unlikely (unless you are running in the memcg) but you cannot rely on the full mlock semantic so what would be a problem?
- it would be unreliable anyway:
- example: was the area completely populated because MLOCK_ONFAULT was not
used or because the process faulted it already
OK, I see this as being a problem. Especially if the buffer is increase 2*original_len
- example: was the area not completely populated because MLOCK_ONFAULT was
used, or because mmap(MAP_LOCKED) failed to populate it fully?
What would be the difference? Both are ONFAULT now.
I think the first point is a pointless regression for workloads that use just plain mlock() and don't want the onfault semantics. Unless there's some shortcut? Does vma have a counter of how much is populated? (I don't think so?)
On Tue, Aug 25, 2015 at 4:41 PM, Michal Hocko mhocko@kernel.org wrote:
On Fri 21-08-15 14:31:32, Eric B Munson wrote: [...]
I am in the middle of implementing lock on fault this way, but I cannot see how we will hanlde mremap of a lock on fault region. Say we have the following:
addr = mmap(len, MAP_ANONYMOUS, ...); mlock(addr, len, MLOCK_ONFAULT); ... mremap(addr, len, 2 * len, ...)
There is no way for mremap to know that the area being remapped was lock on fault so it will be locked and prefaulted by remap. How can we avoid this without tracking per vma if it was locked with lock or lock on fault?
Yes mremap is a problem and it is very much similar to mmap(MAP_LOCKED). It doesn't guarantee the full mlock semantic because it leaves partially populated ranges behind without reporting any error.
Considering the current behavior I do not thing it would be terrible thing to do what Konstantin was suggesting and populate only the full ranges in a best effort mode (it is done so anyway) and document the behavior properly. " If the memory segment specified by old_address and old_size is locked (using mlock(2) or similar), then this lock is maintained when the segment is resized and/or relocated. As a consequence, the amount of memory locked by the process may change.
If the range is already fully populated and the range is enlarged the new range is attempted to be fully populated as well to preserve the full mlock semantic but there is no guarantee this will succeed. Partially populated (e.g. created by mlock(MLOCK_ONFAULT)) ranges do not have the full mlock semantic so they are not populated on resize.
"
So what we have as a result is that partially populated ranges are preserved and fully populated ones work in the best effort mode the same way as they are now.
Does that sound at least remotely reasonably?
The problem is that mremap have to scan ptes to detect that and old behaviour becomes very fragile: one fail and mremap will never populate that vma again. For now I think new flag "MREMAP_NOPOPULATE" is a better option.
-- Michal Hocko SUSE Labs
-- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
On Tue, 25 Aug 2015, Michal Hocko wrote:
On Fri 21-08-15 14:31:32, Eric B Munson wrote: [...]
I am in the middle of implementing lock on fault this way, but I cannot see how we will hanlde mremap of a lock on fault region. Say we have the following:
addr = mmap(len, MAP_ANONYMOUS, ...); mlock(addr, len, MLOCK_ONFAULT); ... mremap(addr, len, 2 * len, ...)
There is no way for mremap to know that the area being remapped was lock on fault so it will be locked and prefaulted by remap. How can we avoid this without tracking per vma if it was locked with lock or lock on fault?
Yes mremap is a problem and it is very much similar to mmap(MAP_LOCKED). It doesn't guarantee the full mlock semantic because it leaves partially populated ranges behind without reporting any error.
This was not my concern. Instead, I was wondering how to keep lock on fault sematics with mremap if we do not have a VMA flag. As a user, it would surprise me if a region I mlocked with lock on fault and then remapped to a larger size was fully populated and locked by the mremap call.
Considering the current behavior I do not thing it would be terrible thing to do what Konstantin was suggesting and populate only the full ranges in a best effort mode (it is done so anyway) and document the behavior properly. " If the memory segment specified by old_address and old_size is locked (using mlock(2) or similar), then this lock is maintained when the segment is resized and/or relocated. As a consequence, the amount of memory locked by the process may change.
If the range is already fully populated and the range is enlarged the new range is attempted to be fully populated as well to preserve the full mlock semantic but there is no guarantee this will succeed. Partially populated (e.g. created by mlock(MLOCK_ONFAULT)) ranges do not have the full mlock semantic so they are not populated on resize.
"
You are proposing that mremap would scan the PTEs as Vlastimil has suggested?
So what we have as a result is that partially populated ranges are preserved and fully populated ones work in the best effort mode the same way as they are now.
Does that sound at least remotely reasonably?
-- Michal Hocko SUSE Labs
On Tue 25-08-15 10:29:02, Eric B Munson wrote:
On Tue, 25 Aug 2015, Michal Hocko wrote:
[...]
Considering the current behavior I do not thing it would be terrible thing to do what Konstantin was suggesting and populate only the full ranges in a best effort mode (it is done so anyway) and document the behavior properly. " If the memory segment specified by old_address and old_size is locked (using mlock(2) or similar), then this lock is maintained when the segment is resized and/or relocated. As a consequence, the amount of memory locked by the process may change.
If the range is already fully populated and the range is enlarged the new range is attempted to be fully populated as well to preserve the full mlock semantic but there is no guarantee this will succeed. Partially populated (e.g. created by mlock(MLOCK_ONFAULT)) ranges do not have the full mlock semantic so they are not populated on resize.
"
You are proposing that mremap would scan the PTEs as Vlastimil has suggested?
As Vlastimil pointed out this would be unnecessarily too costly. But I am wondering whether we should populate at all during mremap considering the full mlock semantic is not guaranteed anyway. Man page mentions only that the lock is maintained which will be true without population as well.
If somebody really depends on the current (and broken) implementation we can offer MREMAP_POPULATE which would do a best effort population. This would be independent on the locked state and would be usable for other mappings as well (the usecase would be to save page fault overhead by batching them).
If this would be seen as an unacceptable user visible change of behavior then we can go with the VMA flag but I would still prefer to not export it to the userspace so that we have a way to change this in future.
On Tue, 25 Aug 2015, Michal Hocko wrote:
On Tue 25-08-15 10:29:02, Eric B Munson wrote:
On Tue, 25 Aug 2015, Michal Hocko wrote:
[...]
Considering the current behavior I do not thing it would be terrible thing to do what Konstantin was suggesting and populate only the full ranges in a best effort mode (it is done so anyway) and document the behavior properly. " If the memory segment specified by old_address and old_size is locked (using mlock(2) or similar), then this lock is maintained when the segment is resized and/or relocated. As a consequence, the amount of memory locked by the process may change.
If the range is already fully populated and the range is enlarged the new range is attempted to be fully populated as well to preserve the full mlock semantic but there is no guarantee this will succeed. Partially populated (e.g. created by mlock(MLOCK_ONFAULT)) ranges do not have the full mlock semantic so they are not populated on resize.
"
You are proposing that mremap would scan the PTEs as Vlastimil has suggested?
As Vlastimil pointed out this would be unnecessarily too costly. But I am wondering whether we should populate at all during mremap considering the full mlock semantic is not guaranteed anyway. Man page mentions only that the lock is maintained which will be true without population as well.
If somebody really depends on the current (and broken) implementation we can offer MREMAP_POPULATE which would do a best effort population. This would be independent on the locked state and would be usable for other mappings as well (the usecase would be to save page fault overhead by batching them).
If this would be seen as an unacceptable user visible change of behavior then we can go with the VMA flag but I would still prefer to not export it to the userspace so that we have a way to change this in future.
Would you drop your objections to the VMA flag if I drop the portions of the patch that expose it to userspace?
The rework to not use the VMA flag is pretty sizeable and is much more ugly IMO. I know that you are not wild about using bit 30 of 32 for this, but perhaps we can settle on not exporting it to userspace so we can reclaim it if we really need it in the future? I can teach the folks here to check for size vs RSS of the locked mappings for stats on lock on fault usage so from my point of view, the proc changes are not necessary.
-- Michal Hocko SUSE Labs
On Tue 25-08-15 15:03:00, Eric B Munson wrote: [...]
Would you drop your objections to the VMA flag if I drop the portions of the patch that expose it to userspace?
The rework to not use the VMA flag is pretty sizeable and is much more ugly IMO. I know that you are not wild about using bit 30 of 32 for this, but perhaps we can settle on not exporting it to userspace so we can reclaim it if we really need it in the future?
Yes, that would be definitely more acceptable for me. I do understand that you are not wild about changing mremap behavior.
Anyway, I would really prefer if the vma flag was really used only at few places - when we are clearing it along with VM_LOCKED (which could be hidden in VM_LOCKED_CLEAR_MASK or something like that) and when we decide whether the populate or not (this should be __mm_populate). But maybe I am missing some call paths where gup is called unconditionally, I haven't checked that.
On 08/26/2015 09:20 AM, Michal Hocko wrote:
On Tue 25-08-15 15:03:00, Eric B Munson wrote: [...]
Would you drop your objections to the VMA flag if I drop the portions of the patch that expose it to userspace?
The rework to not use the VMA flag is pretty sizeable and is much more ugly IMO. I know that you are not wild about using bit 30 of 32 for this, but perhaps we can settle on not exporting it to userspace so we can reclaim it if we really need it in the future?
Yes, that would be definitely more acceptable for me. I do understand that you are not wild about changing mremap behavior.
+1
dri-devel@lists.freedesktop.org