On Tue, Feb 15, 2022 at 07:32:09PM +0100, Christoph Hellwig wrote:
On Tue, Feb 15, 2022 at 10:45:24AM -0400, Jason Gunthorpe wrote:
Do you know if DEVICE_GENERIC pages would end up as PageAnon()? My assumption was that they would be part of a special mapping.
We need to stop using the special PTEs and VMAs for things that have a struct page. This is a mistake DAX created that must be undone.
Yes, we'll get to it. Maybe we can do it for the non-DAX devmap ptes first given that DAX is more complicated.
Probably, I think we can check the page->pgmap type to tell the difference.
I'm not sure how the DEVICE_GENERIC can work without this, as DAX was made safe by using the unmap_mapping_range(), which won't work here. Is there some other trick being used to keep track of references inside the AMD driver?
Jason
On 2022-02-15 14:41, Jason Gunthorpe wrote:
On Tue, Feb 15, 2022 at 07:32:09PM +0100, Christoph Hellwig wrote:
On Tue, Feb 15, 2022 at 10:45:24AM -0400, Jason Gunthorpe wrote:
Do you know if DEVICE_GENERIC pages would end up as PageAnon()? My assumption was that they would be part of a special mapping.
We need to stop using the special PTEs and VMAs for things that have a struct page. This is a mistake DAX created that must be undone.
Yes, we'll get to it. Maybe we can do it for the non-DAX devmap ptes first given that DAX is more complicated.
Probably, I think we can check the page->pgmap type to tell the difference.
I'm not sure how the DEVICE_GENERIC can work without this, as DAX was made safe by using the unmap_mapping_range(), which won't work here. Is there some other trick being used to keep track of references inside the AMD driver?
Not sure I'm following all the discussion about VMAs and DAX. So I may be answering the wrong question: We treat each ZONE_DEVICE page as a reference to the BO (buffer object) that backs the page. We increment the BO refcount for each page we migrate into it. In the dev_pagemap_ops.page_free callback we drop that reference. Once all pages backed by a BO are freed, the BO refcount reaches 0 [*] and we can free the BO allocation.
Regards, Felix
[*] That's a somewhat simplified view. There may be other references to the BO, which allows us to reuse the same BO for the same virtual address range.
Jason
On Tue, Feb 15, 2022 at 04:35:56PM -0500, Felix Kuehling wrote:
On 2022-02-15 14:41, Jason Gunthorpe wrote:
On Tue, Feb 15, 2022 at 07:32:09PM +0100, Christoph Hellwig wrote:
On Tue, Feb 15, 2022 at 10:45:24AM -0400, Jason Gunthorpe wrote:
Do you know if DEVICE_GENERIC pages would end up as PageAnon()? My assumption was that they would be part of a special mapping.
We need to stop using the special PTEs and VMAs for things that have a struct page. This is a mistake DAX created that must be undone.
Yes, we'll get to it. Maybe we can do it for the non-DAX devmap ptes first given that DAX is more complicated.
Probably, I think we can check the page->pgmap type to tell the difference.
I'm not sure how the DEVICE_GENERIC can work without this, as DAX was made safe by using the unmap_mapping_range(), which won't work here. Is there some other trick being used to keep track of references inside the AMD driver?
Not sure I'm following all the discussion about VMAs and DAX. So I may be answering the wrong question: We treat each ZONE_DEVICE page as a reference to the BO (buffer object) that backs the page. We increment the BO refcount for each page we migrate into it. In the dev_pagemap_ops.page_free callback we drop that reference. Once all pages backed by a BO are freed, the BO refcount reaches 0 [*] and we can free the BO allocation.
Userspace does 1) mmap(MAP_PRIVATE) to allocate anon memory 2) something to trigger migration to install a ZONE_DEVICE page 3) munmap()
Who decrements the refcout on the munmap?
When a ZONE_DEVICE page is installed in the PTE is supposed to be marked as pte_devmap and that disables all the normal page refcounting during munmap().
fsdax makes this work by working the refcounts backwards, the page is refcounted while it exists in the driver, when the driver decides to remove it then unmap_mapping_range() is called to purge it from all PTEs and then refcount is decrd. munmap/fork/etc don't change the refcount.
Jason
On 2022-02-15 16:47, Jason Gunthorpe wrote:
On Tue, Feb 15, 2022 at 04:35:56PM -0500, Felix Kuehling wrote:
On 2022-02-15 14:41, Jason Gunthorpe wrote:
On Tue, Feb 15, 2022 at 07:32:09PM +0100, Christoph Hellwig wrote:
On Tue, Feb 15, 2022 at 10:45:24AM -0400, Jason Gunthorpe wrote:
Do you know if DEVICE_GENERIC pages would end up as PageAnon()? My assumption was that they would be part of a special mapping.
We need to stop using the special PTEs and VMAs for things that have a struct page. This is a mistake DAX created that must be undone.
Yes, we'll get to it. Maybe we can do it for the non-DAX devmap ptes first given that DAX is more complicated.
Probably, I think we can check the page->pgmap type to tell the difference.
I'm not sure how the DEVICE_GENERIC can work without this, as DAX was made safe by using the unmap_mapping_range(), which won't work here. Is there some other trick being used to keep track of references inside the AMD driver?
Not sure I'm following all the discussion about VMAs and DAX. So I may be answering the wrong question: We treat each ZONE_DEVICE page as a reference to the BO (buffer object) that backs the page. We increment the BO refcount for each page we migrate into it. In the dev_pagemap_ops.page_free callback we drop that reference. Once all pages backed by a BO are freed, the BO refcount reaches 0 [*] and we can free the BO allocation.
Userspace does
- mmap(MAP_PRIVATE) to allocate anon memory
- something to trigger migration to install a ZONE_DEVICE page
- munmap()
Who decrements the refcout on the munmap?
When a ZONE_DEVICE page is installed in the PTE is supposed to be marked as pte_devmap and that disables all the normal page refcounting during munmap().
fsdax makes this work by working the refcounts backwards, the page is refcounted while it exists in the driver, when the driver decides to remove it then unmap_mapping_range() is called to purge it from all PTEs and then refcount is decrd. munmap/fork/etc don't change the refcount.
Hmm, that just means, whether or not there are PTEs doesn't really matter. It should work the same as it does for DEVICE_PRIVATE pages. I'm not sure where DEVICE_PRIVATE page's refcounts are decremented on unmap, TBH. But I can't find it in our driver, or in the test_hmm driver for that matter.
Regards, Felix
Jason
On Tue, Feb 15, 2022 at 05:49:07PM -0500, Felix Kuehling wrote:
Userspace does
- mmap(MAP_PRIVATE) to allocate anon memory
- something to trigger migration to install a ZONE_DEVICE page
- munmap()
Who decrements the refcout on the munmap?
When a ZONE_DEVICE page is installed in the PTE is supposed to be marked as pte_devmap and that disables all the normal page refcounting during munmap().
fsdax makes this work by working the refcounts backwards, the page is refcounted while it exists in the driver, when the driver decides to remove it then unmap_mapping_range() is called to purge it from all PTEs and then refcount is decrd. munmap/fork/etc don't change the refcount.
Hmm, that just means, whether or not there are PTEs doesn't really matter.
Yes, that is the FSDAX model
It should work the same as it does for DEVICE_PRIVATE pages. I'm not sure where DEVICE_PRIVATE page's refcounts are decremented on unmap, TBH. But I can't find it in our driver, or in the test_hmm driver for that matter.
It is not the same as DEVICE_PRIVATE because DEVICE_PRIVATE uses swap entries. The put_page for that case is here:
static unsigned long zap_pte_range(struct mmu_gather *tlb, struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, unsigned long end, struct zap_details *details) { [..] if (is_device_private_entry(entry) || is_device_exclusive_entry(entry)) { struct page *page = pfn_swap_entry_to_page(entry);
if (unlikely(zap_skip_check_mapping(details, page))) continue; pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); rss[mm_counter(page)]--;
if (is_device_private_entry(entry)) page_remove_rmap(page, false);
put_page(page);
However the devmap case will return NULL from vm_normal_page() and won't do the put_page() embedded inside the __tlb_remove_page() in the pte_present() block in the same function.
After reflecting for awhile, I think Christoph's idea is quite good. Just make it so you don't set pte_devmap() on your pages and then lets avoid pte_devmap for all refcount correct ZONE_DEVICE pages.
Jason
Am 2022-02-15 um 21:01 schrieb Jason Gunthorpe:
On Tue, Feb 15, 2022 at 05:49:07PM -0500, Felix Kuehling wrote:
Userspace does
- mmap(MAP_PRIVATE) to allocate anon memory
- something to trigger migration to install a ZONE_DEVICE page
- munmap()
Who decrements the refcout on the munmap?
When a ZONE_DEVICE page is installed in the PTE is supposed to be marked as pte_devmap and that disables all the normal page refcounting during munmap().
fsdax makes this work by working the refcounts backwards, the page is refcounted while it exists in the driver, when the driver decides to remove it then unmap_mapping_range() is called to purge it from all PTEs and then refcount is decrd. munmap/fork/etc don't change the refcount.
Hmm, that just means, whether or not there are PTEs doesn't really matter.
Yes, that is the FSDAX model
It should work the same as it does for DEVICE_PRIVATE pages. I'm not sure where DEVICE_PRIVATE page's refcounts are decremented on unmap, TBH. But I can't find it in our driver, or in the test_hmm driver for that matter.
It is not the same as DEVICE_PRIVATE because DEVICE_PRIVATE uses swap entries. The put_page for that case is here:
static unsigned long zap_pte_range(struct mmu_gather *tlb, struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, unsigned long end, struct zap_details *details) { [..] if (is_device_private_entry(entry) || is_device_exclusive_entry(entry)) { struct page *page = pfn_swap_entry_to_page(entry);
if (unlikely(zap_skip_check_mapping(details, page))) continue; pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); rss[mm_counter(page)]--; if (is_device_private_entry(entry)) page_remove_rmap(page, false); put_page(page);
However the devmap case will return NULL from vm_normal_page() and won't do the put_page() embedded inside the __tlb_remove_page() in the pte_present() block in the same function.
After reflecting for awhile, I think Christoph's idea is quite good. Just make it so you don't set pte_devmap() on your pages and then lets avoid pte_devmap for all refcount correct ZONE_DEVICE pages.
I'm not sure if pte_devmap is actually set for our DEVICE_COHERENT pages. As far as I can tell, this comes from a bit in the pfn:
#define PFN_DEV (1ULL << (BITS_PER_LONG_LONG - 3)) #define PFN_MAP (1ULL << (BITS_PER_LONG_LONG - 4)) ... static inline bool pfn_t_devmap(pfn_t pfn) { const u64 flags = PFN_DEV|PFN_MAP;
return (pfn.val & flags) == flags; }
In the case of DEVICE_COHERENT memory, the pfns correspond to real physical memory addresses. I don't think they have those PFN_DEV|PFN_MAP bits set.
Regards, Felix
Jason
On Wed, Feb 16, 2022 at 11:56:51AM -0500, Felix Kuehling wrote:
In the case of DEVICE_COHERENT memory, the pfns correspond to real physical memory addresses. I don't think they have those PFN_DEV|PFN_MAP bits set.
So do DAX pages. The PTE flag does several things. As this would be the first time ZONE_DEVICE pages do not set devmap it needs a full audit.
eg the gup_fast bug Alistair pointed at needs fixing at least.
Jason
Jason Gunthorpe jgg@nvidia.com writes:
On Tue, Feb 15, 2022 at 04:35:56PM -0500, Felix Kuehling wrote:
On 2022-02-15 14:41, Jason Gunthorpe wrote:
On Tue, Feb 15, 2022 at 07:32:09PM +0100, Christoph Hellwig wrote:
On Tue, Feb 15, 2022 at 10:45:24AM -0400, Jason Gunthorpe wrote:
Do you know if DEVICE_GENERIC pages would end up as PageAnon()? My assumption was that they would be part of a special mapping.
We need to stop using the special PTEs and VMAs for things that have a struct page. This is a mistake DAX created that must be undone.
Yes, we'll get to it. Maybe we can do it for the non-DAX devmap ptes first given that DAX is more complicated.
Probably, I think we can check the page->pgmap type to tell the difference.
I'm not sure how the DEVICE_GENERIC can work without this, as DAX was made safe by using the unmap_mapping_range(), which won't work here. Is there some other trick being used to keep track of references inside the AMD driver?
Not sure I'm following all the discussion about VMAs and DAX. So I may be answering the wrong question: We treat each ZONE_DEVICE page as a reference to the BO (buffer object) that backs the page. We increment the BO refcount for each page we migrate into it. In the dev_pagemap_ops.page_free callback we drop that reference. Once all pages backed by a BO are freed, the BO refcount reaches 0 [*] and we can free the BO allocation.
Userspace does
- mmap(MAP_PRIVATE) to allocate anon memory
- something to trigger migration to install a ZONE_DEVICE page
- munmap()
Who decrements the refcout on the munmap?
When a ZONE_DEVICE page is installed in the PTE is supposed to be marked as pte_devmap and that disables all the normal page refcounting during munmap().
Device private and device coherent pages are not marked with pte_devmap and they are backed by a struct page. The only way of inserting them is via migrate_vma. The refcount is decremented in zap_pte_range() on munmap() with special handling for device private pages. Looking at it again though I wonder if there is any special treatment required in zap_pte_range() for device coherent pages given they count as present pages.
fsdax makes this work by working the refcounts backwards, the page is refcounted while it exists in the driver, when the driver decides to remove it then unmap_mapping_range() is called to purge it from all PTEs and then refcount is decrd. munmap/fork/etc don't change the refcount.
The equivalent here is for drivers to use migrate_vma to migrate the pages back from device memory to CPU memory. In this case the refcounting is (mostly) handled by migration code which decrements the refcount on the original source device page during the migration.
- Alistair
Jason
On Wed, Feb 16, 2022 at 12:23:44PM +1100, Alistair Popple wrote:
Device private and device coherent pages are not marked with pte_devmap and they are backed by a struct page. The only way of inserting them is via migrate_vma. The refcount is decremented in zap_pte_range() on munmap() with special handling for device private pages. Looking at it again though I wonder if there is any special treatment required in zap_pte_range() for device coherent pages given they count as present pages.
This is what I guessed, but we shouldn't be able to just drop pte_devmap on these pages without any other work?? Granted it does very little already..
I thought at least gup_fast needed to be touched or did this get handled by scanning the page list after the fact?
Jason
On Wednesday, 16 February 2022 1:03:57 PM AEDT Jason Gunthorpe wrote:
On Wed, Feb 16, 2022 at 12:23:44PM +1100, Alistair Popple wrote:
Device private and device coherent pages are not marked with pte_devmap and they are backed by a struct page. The only way of inserting them is via migrate_vma. The refcount is decremented in zap_pte_range() on munmap() with special handling for device private pages. Looking at it again though I wonder if there is any special treatment required in zap_pte_range() for device coherent pages given they count as present pages.
This is what I guessed, but we shouldn't be able to just drop pte_devmap on these pages without any other work?? Granted it does very little already..
Yes, I agree we need to check this more closely. For device private pages not having pte_devmap is fine, because they are non-present swap entries so they always get special handling in the swap entry paths but the same isn't true for coherent device pages.
I thought at least gup_fast needed to be touched or did this get handled by scanning the page list after the fact?
Right, for gup I think the only special handling required is to prevent pinning. I had assumed that check_and_migrate_movable_pages() would still get called for gup_fast but unless I've missed something I don't think it does. That means gup_fast could still pin movable and coherent pages. Technically that is ok for coherent pages, but it's undesirable.
- Alistair
Jason
On 16.02.22 03:36, Alistair Popple wrote:
On Wednesday, 16 February 2022 1:03:57 PM AEDT Jason Gunthorpe wrote:
On Wed, Feb 16, 2022 at 12:23:44PM +1100, Alistair Popple wrote:
Device private and device coherent pages are not marked with pte_devmap and they are backed by a struct page. The only way of inserting them is via migrate_vma. The refcount is decremented in zap_pte_range() on munmap() with special handling for device private pages. Looking at it again though I wonder if there is any special treatment required in zap_pte_range() for device coherent pages given they count as present pages.
This is what I guessed, but we shouldn't be able to just drop pte_devmap on these pages without any other work?? Granted it does very little already..
Yes, I agree we need to check this more closely. For device private pages not having pte_devmap is fine, because they are non-present swap entries so they always get special handling in the swap entry paths but the same isn't true for coherent device pages.
I'm curious, how does the refcount of a PageAnon() DEVICE_COHERENT page look like when mapped? I'd assume it's also (currently) still offset by one, meaning, if it's mapped into a single page table it's always at least 2.
Just a note that if my assumption is correct and if we'd have such a page mapped R/O, do_wp_page() would always have to copy it unconditionally and would not be able to reuse it on write faults. (while I'm working on improving the reuse logic, I think there is also work in progress to avoid this additional reference on some ZONE_DEVICE stuff -- I'd assume that would include DEVICE_COHERENT ?)
I thought at least gup_fast needed to be touched or did this get handled by scanning the page list after the fact?
Right, for gup I think the only special handling required is to prevent pinning. I had assumed that check_and_migrate_movable_pages() would still get called for gup_fast but unless I've missed something I don't think it does. That means gup_fast could still pin movable and coherent pages. Technically that is ok for coherent pages, but it's undesirable.
We really should have the same pinning rules for GUP vs. GUP-fast. is_pinnable_page() should be the right place for such checks (similarly as indicated in my reply to the migration series).
On Wed, Feb 16, 2022 at 09:31:03AM +0100, David Hildenbrand wrote:
On 16.02.22 03:36, Alistair Popple wrote:
On Wednesday, 16 February 2022 1:03:57 PM AEDT Jason Gunthorpe wrote:
On Wed, Feb 16, 2022 at 12:23:44PM +1100, Alistair Popple wrote:
Device private and device coherent pages are not marked with pte_devmap and they are backed by a struct page. The only way of inserting them is via migrate_vma. The refcount is decremented in zap_pte_range() on munmap() with special handling for device private pages. Looking at it again though I wonder if there is any special treatment required in zap_pte_range() for device coherent pages given they count as present pages.
This is what I guessed, but we shouldn't be able to just drop pte_devmap on these pages without any other work?? Granted it does very little already..
Yes, I agree we need to check this more closely. For device private pages not having pte_devmap is fine, because they are non-present swap entries so they always get special handling in the swap entry paths but the same isn't true for coherent device pages.
I'm curious, how does the refcount of a PageAnon() DEVICE_COHERENT page look like when mapped? I'd assume it's also (currently) still offset by one, meaning, if it's mapped into a single page table it's always at least 2.
Christoph fixed this offset by one and updated the DEVICE_COHERENT patchset, I hope we will see that version merged.
I thought at least gup_fast needed to be touched or did this get handled by scanning the page list after the fact?
Right, for gup I think the only special handling required is to prevent pinning. I had assumed that check_and_migrate_movable_pages() would still get called for gup_fast but unless I've missed something I don't think it does. That means gup_fast could still pin movable and coherent pages. Technically that is ok for coherent pages, but it's undesirable.
We really should have the same pinning rules for GUP vs. GUP-fast. is_pinnable_page() should be the right place for such checks (similarly as indicated in my reply to the migration series).
Yes, I think this is a bug too.
The other place that needs careful audit is all the callers using vm_normal_page() - they must all be able to accept a ZONE_DEVICE page if we don't set pte_devmap.
Jason
Jason Gunthorpe jgg@nvidia.com writes:
On Wed, Feb 16, 2022 at 09:31:03AM +0100, David Hildenbrand wrote:
On 16.02.22 03:36, Alistair Popple wrote:
On Wednesday, 16 February 2022 1:03:57 PM AEDT Jason Gunthorpe wrote:
On Wed, Feb 16, 2022 at 12:23:44PM +1100, Alistair Popple wrote:
Device private and device coherent pages are not marked with pte_devmap and they are backed by a struct page. The only way of inserting them is via migrate_vma. The refcount is decremented in zap_pte_range() on munmap() with special handling for device private pages. Looking at it again though I wonder if there is any special treatment required in zap_pte_range() for device coherent pages given they count as present pages.
This is what I guessed, but we shouldn't be able to just drop pte_devmap on these pages without any other work?? Granted it does very little already..
Yes, I agree we need to check this more closely. For device private pages not having pte_devmap is fine, because they are non-present swap entries so they always get special handling in the swap entry paths but the same isn't true for coherent device pages.
I'm curious, how does the refcount of a PageAnon() DEVICE_COHERENT page look like when mapped? I'd assume it's also (currently) still offset by one, meaning, if it's mapped into a single page table it's always at least 2.
Christoph fixed this offset by one and updated the DEVICE_COHERENT patchset, I hope we will see that version merged.
I thought at least gup_fast needed to be touched or did this get handled by scanning the page list after the fact?
Right, for gup I think the only special handling required is to prevent pinning. I had assumed that check_and_migrate_movable_pages() would still get called for gup_fast but unless I've missed something I don't think it does. That means gup_fast could still pin movable and coherent pages. Technically that is ok for coherent pages, but it's undesirable.
We really should have the same pinning rules for GUP vs. GUP-fast. is_pinnable_page() should be the right place for such checks (similarly as indicated in my reply to the migration series).
Yes, I think this is a bug too.
Agreed, I will add a fix for it to my series as I was surprised the rules for PUP-fast were different. I can see how this happened though - check_and_migrate_cma_pages() (the precursor to check_and_migrate_movable_pages()) was added before PUP-fast and FOLL_LONGTERM so I guess we just never added this check there.
- Alistair
The other place that needs careful audit is all the callers using vm_normal_page() - they must all be able to accept a ZONE_DEVICE page if we don't set pte_devmap.
Jason
Am 2022-02-16 um 07:26 schrieb Jason Gunthorpe:
The other place that needs careful audit is all the callers using vm_normal_page() - they must all be able to accept a ZONE_DEVICE page if we don't set pte_devmap.
How much code are we talking about here? A quick search finds 26 call-sites in 12 files in current master:
fs/proc/task_mmu.c mm/hmm.c mm/gup.c mm/huge_memory.c (vm_normal_page_pmd) mm/khugepaged.c mm/madvise.c mm/mempolicy.c mm/memory.c mm/mlock.c mm/migrate.c mm/mprotect.c mm/memcontrol.c
I'm thinking of a more theoretical approach: Instead of auditing all users, I'd ask, what are the invariants that a vm_normal_page should have. Then check, whether our DEVICE_COHERENT pages satisfy them. But maybe the concept of a vm_normal_page isn't defined clearly enough for that.
That said, I think we (Alex and myself) made an implicit assumption from the start, that a DEVICE_COHERENT page should behave a lot like a normal page in terms of VMA mappings, even if we didn't know what that means in detail.
I can now at least name some differences between DEVICE_COHERENT and normal pages: how the memory is allocated, how data is migrated into DEVICE_COHERENT pages and that it can't be on any LRU list (because the lru list_head in struct page is aliased by pgmap and zone_device_data). Maybe I'll find more differences if I keep digging.
Regards, Felix
Jason
On Thu, Feb 17, 2022 at 04:12:20PM -0500, Felix Kuehling wrote:
I'm thinking of a more theoretical approach: Instead of auditing all users, I'd ask, what are the invariants that a vm_normal_page should have. Then check, whether our DEVICE_COHERENT pages satisfy them. But maybe the concept of a vm_normal_page isn't defined clearly enough for that.
I would say the expectation is that only 'page cache and anon' pages are returned - ie the first union in struct page
This is because the first file in your list I looked at:
static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk)
{ page = vm_normal_page(vma, addr, ptent); [..] if (pageout) { if (!isolate_lru_page(page)) {
Uses the LRU field, so this is incompatible with all the other page types.
One mitigation of this might be to formally make vm_normal_page() == 'pte to page cache and anon page' and add a new function that is 'pte to any struct page'
Then go through and sort callers as appropriate.
The 'pte to page cache and anon page' can detect ZONE_DEVICE by calling is_zone_device_page() insted of pte_devmap() and then continue to return NULL. This same trick will fix GUP_fast.
Jason
Am 2022-02-17 um 19:19 schrieb Jason Gunthorpe:
On Thu, Feb 17, 2022 at 04:12:20PM -0500, Felix Kuehling wrote:
I'm thinking of a more theoretical approach: Instead of auditing all users, I'd ask, what are the invariants that a vm_normal_page should have. Then check, whether our DEVICE_COHERENT pages satisfy them. But maybe the concept of a vm_normal_page isn't defined clearly enough for that.
I would say the expectation is that only 'page cache and anon' pages are returned - ie the first union in struct page
This is because the first file in your list I looked at:
static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk)
{ page = vm_normal_page(vma, addr, ptent); [..] if (pageout) { if (!isolate_lru_page(page)) {
Uses the LRU field, so this is incompatible with all the other page types.
One mitigation of this might be to formally make vm_normal_page() == 'pte to page cache and anon page' and add a new function that is 'pte to any struct page'
Then go through and sort callers as appropriate.
The 'pte to page cache and anon page' can detect ZONE_DEVICE by calling is_zone_device_page() insted of pte_devmap() and then continue to return NULL. This same trick will fix GUP_fast.
Sounds good to me. What about vm_normal_page_pmd? Should we remove the pmd_devmap check from that function as well. I'm not even sure what a huge zone_device page would look like, but maybe that's a worthwhile future optimization for our driver.
I'd propose the function names vm_normal_page and vm_normal_or_device_page for the two functions you described. The latter would basically be the current vm_normal_page with the pte_devmap check removed. vm_normal_page could be implemented as a wrapper around vm_normal_or_device_page, which just adds the !is_zone_device_page() check.
Regards, Felix
Jason
On Fri, Feb 18, 2022 at 02:20:45PM -0500, Felix Kuehling wrote:
Am 2022-02-17 um 19:19 schrieb Jason Gunthorpe:
On Thu, Feb 17, 2022 at 04:12:20PM -0500, Felix Kuehling wrote:
I'm thinking of a more theoretical approach: Instead of auditing all users, I'd ask, what are the invariants that a vm_normal_page should have. Then check, whether our DEVICE_COHERENT pages satisfy them. But maybe the concept of a vm_normal_page isn't defined clearly enough for that.
I would say the expectation is that only 'page cache and anon' pages are returned - ie the first union in struct page
This is because the first file in your list I looked at:
static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk)
{ page = vm_normal_page(vma, addr, ptent); [..] if (pageout) { if (!isolate_lru_page(page)) {
Uses the LRU field, so this is incompatible with all the other page types.
One mitigation of this might be to formally make vm_normal_page() == 'pte to page cache and anon page' and add a new function that is 'pte to any struct page'
Then go through and sort callers as appropriate.
The 'pte to page cache and anon page' can detect ZONE_DEVICE by calling is_zone_device_page() insted of pte_devmap() and then continue to return NULL. This same trick will fix GUP_fast.
Sounds good to me. What about vm_normal_page_pmd? Should we remove the pmd_devmap check from that function as well. I'm not even sure what a huge zone_device page would look like, but maybe that's a worthwhile future optimization for our driver.
IIRC there are other problems here as PMDs are currently wired to THPs and not general at all..
We have huge zone_device pages, it is just any compound page of contiguous pfns - you should be aggregating any contiguous string of logical PFNs together into a folio for performance. If the folio is stuffed into a PMD or not is a different question..
I'd propose the function names vm_normal_page and vm_normal_or_device_page for the two functions you described.
I wouldn't say device_page, it should be any type of page - though device_page is the only other option ATM, AFIAK.
current vm_normal_page with the pte_devmap check removed. vm_normal_page could be implemented as a wrapper around vm_normal_or_device_page, which just adds the !is_zone_device_page() check.
Yes
Jason
Am 2022-02-18 um 14:26 schrieb Jason Gunthorpe:
On Fri, Feb 18, 2022 at 02:20:45PM -0500, Felix Kuehling wrote:
Am 2022-02-17 um 19:19 schrieb Jason Gunthorpe:
On Thu, Feb 17, 2022 at 04:12:20PM -0500, Felix Kuehling wrote:
I'm thinking of a more theoretical approach: Instead of auditing all users, I'd ask, what are the invariants that a vm_normal_page should have. Then check, whether our DEVICE_COHERENT pages satisfy them. But maybe the concept of a vm_normal_page isn't defined clearly enough for that.
I would say the expectation is that only 'page cache and anon' pages are returned - ie the first union in struct page
This is because the first file in your list I looked at:
static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, struct mm_walk *walk)
{ page = vm_normal_page(vma, addr, ptent); [..] if (pageout) { if (!isolate_lru_page(page)) {
Uses the LRU field, so this is incompatible with all the other page types.
One mitigation of this might be to formally make vm_normal_page() == 'pte to page cache and anon page' and add a new function that is 'pte to any struct page'
Then go through and sort callers as appropriate.
The 'pte to page cache and anon page' can detect ZONE_DEVICE by calling is_zone_device_page() insted of pte_devmap() and then continue to return NULL. This same trick will fix GUP_fast.
Sounds good to me. What about vm_normal_page_pmd? Should we remove the pmd_devmap check from that function as well. I'm not even sure what a huge zone_device page would look like, but maybe that's a worthwhile future optimization for our driver.
IIRC there are other problems here as PMDs are currently wired to THPs and not general at all..
We have huge zone_device pages, it is just any compound page of contiguous pfns - you should be aggregating any contiguous string of logical PFNs together into a folio for performance. If the folio is stuffed into a PMD or not is a different question..
I'd propose the function names vm_normal_page and vm_normal_or_device_page for the two functions you described.
I wouldn't say device_page, it should be any type of page - though device_page is the only other option ATM, AFIAK.
Ok, then how about vm_normal_lru_page and vm_normal_any_page?
Regards, Felix
current vm_normal_page with the pte_devmap check removed. vm_normal_page could be implemented as a wrapper around vm_normal_or_device_page, which just adds the !is_zone_device_page() check.
Yes
Jason
DEVICE_COHERENT pages introduce a subtle distinction in the way "normal" pages can be used by various callers throughout the kernel. They behave like normal pages for purposes of mapping in CPU page tables, and for COW. But they do not support LRU lists, NUMA migration or THP. Therefore we split vm_normal_page into two functions vm_normal_any_page and vm_normal_lru_page. The latter will only return pages that can be put on an LRU list and that support NUMA migration and THP.
We also introduced a FOLL_LRU flag that adds the same behaviour to follow_page and related APIs, to allow callers to specify that they expect to put pages on an LRU list.
Signed-off-by: Alex Sierra alex.sierra@amd.com --- fs/proc/task_mmu.c | 12 +++++----- include/linux/mm.h | 53 ++++++++++++++++++++++++--------------------- mm/gup.c | 10 +++++---- mm/hmm.c | 2 +- mm/huge_memory.c | 2 +- mm/khugepaged.c | 8 +++---- mm/ksm.c | 4 ++-- mm/madvise.c | 4 ++-- mm/memcontrol.c | 2 +- mm/memory.c | 38 ++++++++++++++++++++++---------- mm/mempolicy.c | 4 ++-- mm/migrate.c | 2 +- mm/migrate_device.c | 2 +- mm/mlock.c | 6 ++--- mm/mprotect.c | 2 +- 15 files changed, 85 insertions(+), 66 deletions(-)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 18f8c3acbb85..4274128fbb4c 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -519,7 +519,7 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr, struct page *page = NULL;
if (pte_present(*pte)) { - page = vm_normal_page(vma, addr, *pte); + page = vm_normal_any_page(vma, addr, *pte); } else if (is_swap_pte(*pte)) { swp_entry_t swpent = pte_to_swp_entry(*pte);
@@ -705,7 +705,7 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask, struct page *page = NULL;
if (pte_present(*pte)) { - page = vm_normal_page(vma, addr, *pte); + page = vm_normal_any_page(vma, addr, *pte); } else if (is_swap_pte(*pte)) { swp_entry_t swpent = pte_to_swp_entry(*pte);
@@ -1059,7 +1059,7 @@ static inline bool pte_is_pinned(struct vm_area_struct *vma, unsigned long addr, return false; if (likely(!test_bit(MMF_HAS_PINNED, &vma->vm_mm->flags))) return false; - page = vm_normal_page(vma, addr, pte); + page = vm_normal_any_page(vma, addr, pte); if (!page) return false; return page_maybe_dma_pinned(page); @@ -1172,7 +1172,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr, if (!pte_present(ptent)) continue;
- page = vm_normal_page(vma, addr, ptent); + page = vm_normal_any_page(vma, addr, ptent); if (!page) continue;
@@ -1383,7 +1383,7 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm, if (pm->show_pfn) frame = pte_pfn(pte); flags |= PM_PRESENT; - page = vm_normal_page(vma, addr, pte); + page = vm_normal_any_page(vma, addr, pte); if (pte_soft_dirty(pte)) flags |= PM_SOFT_DIRTY; if (pte_uffd_wp(pte)) @@ -1761,7 +1761,7 @@ static struct page *can_gather_numa_stats(pte_t pte, struct vm_area_struct *vma, if (!pte_present(pte)) return NULL;
- page = vm_normal_page(vma, addr, pte); + page = vm_normal_lru_page(vma, addr, pte); if (!page) return NULL;
diff --git a/include/linux/mm.h b/include/linux/mm.h index ff9f149ca201..8c9f87151d93 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -593,8 +593,8 @@ struct vm_operations_struct { unsigned long addr); #endif /* - * Called by vm_normal_page() for special PTEs to find the - * page for @addr. This is useful if the default behavior + * Called by vm_normal_x_page() for special PTEs to find the + * page for @addr. This is useful if the default behavior * (using pte_page()) would not find the correct page. */ struct page *(*find_special_page)(struct vm_area_struct *vma, @@ -1781,7 +1781,9 @@ static inline bool can_do_mlock(void) { return false; } extern int user_shm_lock(size_t, struct ucounts *); extern void user_shm_unlock(size_t, struct ucounts *);
-struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr, +struct page *vm_normal_any_page(struct vm_area_struct *vma, unsigned long addr, + pte_t pte); +struct page *vm_normal_lru_page(struct vm_area_struct *vma, unsigned long addr, pte_t pte); struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t pmd); @@ -2880,27 +2882,28 @@ static inline vm_fault_t vmf_error(int err) struct page *follow_page(struct vm_area_struct *vma, unsigned long address, unsigned int foll_flags);
-#define FOLL_WRITE 0x01 /* check pte is writable */ -#define FOLL_TOUCH 0x02 /* mark page accessed */ -#define FOLL_GET 0x04 /* do get_page on page */ -#define FOLL_DUMP 0x08 /* give error on hole if it would be zero */ -#define FOLL_FORCE 0x10 /* get_user_pages read/write w/o permission */ -#define FOLL_NOWAIT 0x20 /* if a disk transfer is needed, start the IO - * and return without waiting upon it */ -#define FOLL_POPULATE 0x40 /* fault in pages (with FOLL_MLOCK) */ -#define FOLL_NOFAULT 0x80 /* do not fault in pages */ -#define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */ -#define FOLL_NUMA 0x200 /* force NUMA hinting page fault */ -#define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */ -#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */ -#define FOLL_MLOCK 0x1000 /* lock present pages */ -#define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */ -#define FOLL_COW 0x4000 /* internal GUP flag */ -#define FOLL_ANON 0x8000 /* don't do file mappings */ -#define FOLL_LONGTERM 0x10000 /* mapping lifetime is indefinite: see below */ -#define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */ -#define FOLL_PIN 0x40000 /* pages must be released via unpin_user_page */ -#define FOLL_FAST_ONLY 0x80000 /* gup_fast: prevent fall-back to slow gup */ +#define FOLL_WRITE 0x01 /* check pte is writable */ +#define FOLL_TOUCH 0x02 /* mark page accessed */ +#define FOLL_GET 0x04 /* do get_page on page */ +#define FOLL_DUMP 0x08 /* give error on hole if it would be zero */ +#define FOLL_FORCE 0x10 /* get_user_pages read/write w/o permission */ +#define FOLL_NOWAIT 0x20 /* if a disk transfer is needed, start the IO + * and return without waiting upon it */ +#define FOLL_POPULATE 0x40 /* fault in pages (with FOLL_MLOCK) */ +#define FOLL_NOFAULT 0x80 /* do not fault in pages */ +#define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */ +#define FOLL_NUMA 0x200 /* force NUMA hinting page fault */ +#define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */ +#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */ +#define FOLL_MLOCK 0x1000 /* lock present pages */ +#define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */ +#define FOLL_COW 0x4000 /* internal GUP flag */ +#define FOLL_ANON 0x8000 /* don't do file mappings */ +#define FOLL_LONGTERM 0x10000 /* mapping lifetime is indefinite: see below */ +#define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */ +#define FOLL_PIN 0x40000 /* pages must be released via unpin_user_page */ +#define FOLL_FAST_ONLY 0x80000 /* gup_fast: prevent fall-back to slow gup */ +#define FOLL_LRU 0x100000 /* return only LRU (anon or page cache) */
/* * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each @@ -3227,7 +3230,7 @@ extern long copy_huge_page_from_user(struct page *dst_page, * @vma: Pointer to the struct vm_area_struct to consider * * Whether transhuge page-table entries are considered "special" following - * the definition in vm_normal_page(). + * the definition in vm_normal_x_page(). * * Return: true if transhuge page-table entries should be considered special, * false otherwise. diff --git a/mm/gup.c b/mm/gup.c index 41349b685eaf..9e172c906ded 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -539,8 +539,10 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, pte_unmap_unlock(ptep, ptl); return NULL; } - - page = vm_normal_page(vma, address, pte); + if (flags & (FOLL_MLOCK | FOLL_LRU)) + page = vm_normal_lru_page(vma, address, pte); + else + page = vm_normal_any_page(vma, address, pte); if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) { /* * Only return device mapping pages in the FOLL_GET or FOLL_PIN @@ -824,7 +826,7 @@ static struct page *follow_p4d_mask(struct vm_area_struct *vma, * * Return: the mapped (struct page *), %NULL if no mapping exists, or * an error pointer if there is a mapping to something not represented - * by a page descriptor (see also vm_normal_page()). + * by a page descriptor (see also vm_normal_x_page()). */ static struct page *follow_page_mask(struct vm_area_struct *vma, unsigned long address, unsigned int flags, @@ -917,7 +919,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address, *vma = get_gate_vma(mm); if (!page) goto out; - *page = vm_normal_page(*vma, address, *pte); + *page = vm_normal_any_page(*vma, address, *pte); if (!*page) { if ((gup_flags & FOLL_DUMP) || !is_zero_pfn(pte_pfn(*pte))) goto unmap; diff --git a/mm/hmm.c b/mm/hmm.c index bd56641c79d4..90c949d66712 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -300,7 +300,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, * Since each architecture defines a struct page for the zero page, just * fall through and treat it like a normal page. */ - if (!vm_normal_page(walk->vma, addr, pte) && + if (!vm_normal_any_page(walk->vma, addr, pte) && !pte_devmap(pte) && !is_zero_pfn(pte_pfn(pte))) { if (hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0)) { diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 406a3c28c026..ea1efc825774 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2966,7 +2966,7 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start, }
/* FOLL_DUMP to ignore special (like zero) pages */ - follflags = FOLL_GET | FOLL_DUMP; + follflags = FOLL_GET | FOLL_DUMP | FOLL_LRU; page = follow_page(vma, addr, follflags);
if (IS_ERR(page)) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 131492fd1148..a7153db09afa 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -627,7 +627,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, result = SCAN_PTE_NON_PRESENT; goto out; } - page = vm_normal_page(vma, address, pteval); + page = vm_normal_lru_page(vma, address, pteval); if (unlikely(!page)) { result = SCAN_PAGE_NULL; goto out; @@ -1286,7 +1286,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, if (pte_write(pteval)) writable = true;
- page = vm_normal_page(vma, _address, pteval); + page = vm_normal_lru_page(vma, _address, pteval); if (unlikely(!page)) { result = SCAN_PAGE_NULL; goto out_unmap; @@ -1494,7 +1494,7 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr) if (!pte_present(*pte)) goto abort;
- page = vm_normal_page(vma, addr, *pte); + page = vm_normal_lru_page(vma, addr, *pte);
/* * Note that uprobe, debugger, or MAP_PRIVATE may change the @@ -1512,7 +1512,7 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
if (pte_none(*pte)) continue; - page = vm_normal_page(vma, addr, *pte); + page = vm_normal_lru_page(vma, addr, *pte); page_remove_rmap(page, false); }
diff --git a/mm/ksm.c b/mm/ksm.c index c20bd4d9a0d9..352d37e44694 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -474,7 +474,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr) do { cond_resched(); page = follow_page(vma, addr, - FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE); + FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE | FOLL_LRU); if (IS_ERR_OR_NULL(page)) break; if (PageKsm(page)) @@ -559,7 +559,7 @@ static struct page *get_mergeable_page(struct rmap_item *rmap_item) if (!vma) goto out;
- page = follow_page(vma, addr, FOLL_GET); + page = follow_page(vma, addr, FOLL_GET | FOLL_LRU); if (IS_ERR_OR_NULL(page)) goto out; if (PageAnon(page)) { diff --git a/mm/madvise.c b/mm/madvise.c index 5604064df464..1a553aad9aa3 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -439,7 +439,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, if (!pte_present(ptent)) continue;
- page = vm_normal_page(vma, addr, ptent); + page = vm_normal_lru_page(vma, addr, ptent); if (!page) continue;
@@ -649,7 +649,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, continue; }
- page = vm_normal_page(vma, addr, ptent); + page = vm_normal_lru_page(vma, addr, ptent); if (!page) continue;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 10259c35fde2..9677eb27dea8 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5476,7 +5476,7 @@ enum mc_target_type { static struct page *mc_handle_present_pte(struct vm_area_struct *vma, unsigned long addr, pte_t ptent) { - struct page *page = vm_normal_page(vma, addr, ptent); + struct page *page = vm_normal_any_page(vma, addr, ptent);
if (!page || !page_mapped(page)) return NULL; diff --git a/mm/memory.c b/mm/memory.c index c125c4969913..cff84e6a6c4b 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -565,7 +565,7 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr, }
/* - * vm_normal_page -- This function gets the "struct page" associated with a pte. + * vm_normal_any_page -- This function gets the "struct page" associated with a pte. * * "Special" mappings do not wish to be associated with a "struct page" (either * it doesn't exist, or it exists but they don't want to touch it). In this @@ -606,7 +606,7 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr, * PFNMAP mappings in order to support COWable mappings. * */ -struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr, +struct page *vm_normal_any_page(struct vm_area_struct *vma, unsigned long addr, pte_t pte) { unsigned long pfn = pte_pfn(pte); @@ -620,8 +620,6 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr, return NULL; if (is_zero_pfn(pfn)) return NULL; - if (pte_devmap(pte)) - return NULL;
print_bad_pte(vma, addr, pte, NULL); return NULL; @@ -661,6 +659,22 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr, return pfn_to_page(pfn); }
+/* + * vm_normal_lru_page -- This function gets the "struct page" associated + * with a pte only for page cache and anon page. These pages are LRU handled. + */ +struct page *vm_normal_lru_page(struct vm_area_struct *vma, unsigned long addr, + pte_t pte) +{ + struct page *page; + + page = vm_normal_any_page(vma, addr, pte); + if (is_zone_device_page(page)) + return NULL; + + return page; +} + #ifdef CONFIG_TRANSPARENT_HUGEPAGE struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t pmd) @@ -670,7 +684,7 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr, /* * There is no pmd_special() but there may be special pmds, e.g. * in a direct-access (dax) mapping, so let's just replicate the - * !CONFIG_ARCH_HAS_PTE_SPECIAL case from vm_normal_page() here. + * !CONFIG_ARCH_HAS_PTE_SPECIAL case from vm_normal_any_page() here. */ if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) { if (vma->vm_flags & VM_MIXEDMAP) { @@ -946,7 +960,7 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, pte_t pte = *src_pte; struct page *page;
- page = vm_normal_page(src_vma, addr, pte); + page = vm_normal_any_page(src_vma, addr, pte); if (page) { int retval;
@@ -1358,7 +1372,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, if (pte_present(ptent)) { struct page *page;
- page = vm_normal_page(vma, addr, ptent); + page = vm_normal_any_page(vma, addr, ptent); if (unlikely(zap_skip_check_mapping(details, page))) continue; ptent = ptep_get_and_clear_full(mm, addr, pte, @@ -2168,7 +2182,7 @@ EXPORT_SYMBOL(vmf_insert_pfn);
static bool vm_mixed_ok(struct vm_area_struct *vma, pfn_t pfn) { - /* these checks mirror the abort conditions in vm_normal_page */ + /* these checks mirror the abort conditions in vm_normal_lru_page */ if (vma->vm_flags & VM_MIXEDMAP) return true; if (pfn_t_devmap(pfn)) @@ -2198,7 +2212,7 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma,
/* * If we don't have pte special, then we have to use the pfn_valid() - * based VM_MIXEDMAP scheme (see vm_normal_page), and thus we *must* + * based VM_MIXEDMAP scheme (see vm_normal_any_page), and thus we *must* * refcount the page if pfn_valid is true (hence insert_page rather * than insert_pfn). If a zero_pfn were inserted into a VM_MIXEDMAP * without pte special, it would there be refcounted as a normal page. @@ -2408,7 +2422,7 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr, * There's a horrible special case to handle copy-on-write * behaviour that some programs depend on. We mark the "original" * un-COW'ed pages by matching them up with "vma->vm_pgoff". - * See vm_normal_page() for details. + * See vm_normal_any_page() for details. */ if (is_cow_mapping(vma->vm_flags)) { if (addr != vma->vm_start || end != vma->vm_end) @@ -3267,7 +3281,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) mm_tlb_flush_pending(vmf->vma->vm_mm))) flush_tlb_page(vmf->vma, vmf->address);
- vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte); + vmf->page = vm_normal_any_page(vma, vmf->address, vmf->orig_pte); if (!vmf->page) { /* * VM_MIXEDMAP !pfn_valid() case, or VM_SOFTDIRTY clear on a @@ -4364,7 +4378,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) old_pte = ptep_get(vmf->pte); pte = pte_modify(old_pte, vma->vm_page_prot);
- page = vm_normal_page(vma, vmf->address, pte); + page = vm_normal_lru_page(vma, vmf->address, pte); if (!page) goto out_map;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 028e8dd82b44..9962de4981d6 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -527,11 +527,11 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr, for (; addr != end; pte++, addr += PAGE_SIZE) { if (!pte_present(*pte)) continue; - page = vm_normal_page(vma, addr, *pte); + page = vm_normal_lru_page(vma, addr, *pte); if (!page) continue; /* - * vm_normal_page() filters out zero pages, but there might + * vm_normal_lru_page() filters out zero pages, but there might * still be PageReserved pages to skip, perhaps in a VDSO. */ if (PageReserved(page)) diff --git a/mm/migrate.c b/mm/migrate.c index c31d04b46a5e..17d049311b78 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1614,7 +1614,7 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr, goto out;
/* FOLL_DUMP to ignore special (like zero) pages */ - follflags = FOLL_GET | FOLL_DUMP; + follflags = FOLL_GET | FOLL_DUMP | FOLL_LRU; page = follow_page(vma, addr, follflags);
err = PTR_ERR(page); diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 3373b535d5c9..fac1b6978361 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -154,7 +154,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, migrate->cpages++; goto next; } - page = vm_normal_page(migrate->vma, addr, pte); + page = vm_normal_any_page(migrate->vma, addr, pte); if (page && !is_zone_device_page(page) && !(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) goto next; diff --git a/mm/mlock.c b/mm/mlock.c index 8f584eddd305..52613e2f2a70 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -342,7 +342,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone) * a non-TPH page already pinned and in the @pvec, and that it belongs to @zone. * * The rest of @pvec is filled by subsequent pages within the same pmd and same - * zone, as long as the pte's are present and vm_normal_page() succeeds. These + * zone, as long as the pte's are present and vm_normal_any_page() succeeds. These * pages also get pinned. * * Returns the address of the next page that should be scanned. This equals @@ -373,7 +373,7 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec, struct page *page = NULL; pte++; if (pte_present(*pte)) - page = vm_normal_page(vma, start, *pte); + page = vm_normal_lru_page(vma, start, *pte); /* * Break if page could not be obtained or the page's node+zone does not * match @@ -439,7 +439,7 @@ void munlock_vma_pages_range(struct vm_area_struct *vma, * suits munlock very well (and if somehow an abnormal page * has sneaked into the range, we won't oops here: great). */ - page = follow_page(vma, start, FOLL_GET | FOLL_DUMP); + page = follow_page(vma, start, FOLL_GET | FOLL_DUMP | FOLL_LRU);
if (page && !IS_ERR(page)) { if (PageTransTail(page)) { diff --git a/mm/mprotect.c b/mm/mprotect.c index 0138dfcdb1d8..d236394d41d5 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -88,7 +88,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, if (pte_protnone(oldpte)) continue;
- page = vm_normal_page(vma, addr, oldpte); + page = vm_normal_lru_page(vma, addr, oldpte); if (!page || PageKsm(page)) continue;
On 2022-02-28 15:34, Alex Sierra wrote:
DEVICE_COHERENT pages introduce a subtle distinction in the way "normal" pages can be used by various callers throughout the kernel. They behave like normal pages for purposes of mapping in CPU page tables, and for COW. But they do not support LRU lists, NUMA migration or THP.
Should have mentioned KSM here as well for completeness.
Therefore we split vm_normal_page into two functions vm_normal_any_page and vm_normal_lru_page. The latter will only return pages that can be put on an LRU list and that support NUMA migration and THP.
We also introduced a FOLL_LRU flag that adds the same behaviour to follow_page and related APIs, to allow callers to specify that they expect to put pages on an LRU list.
Signed-off-by: Alex Sierra alex.sierra@amd.com
Acked-by: Felix Kuehling Felix.Kuehling@amd.com
FWIW. Full disclosure, Alex and I worked on this together, but it's a bit like the blind leading the blind. ;) It's mostly untested at this point. Alex is working on adding tests for get_user_pages of DEVICE_COHERENT pages without FOLL_LONGTERM to test_hmm and also a test for COW of DEVICE_COHERENT pages.
A few more nit-picks inline.
fs/proc/task_mmu.c | 12 +++++----- include/linux/mm.h | 53 ++++++++++++++++++++++++--------------------- mm/gup.c | 10 +++++---- mm/hmm.c | 2 +- mm/huge_memory.c | 2 +- mm/khugepaged.c | 8 +++---- mm/ksm.c | 4 ++-- mm/madvise.c | 4 ++-- mm/memcontrol.c | 2 +- mm/memory.c | 38 ++++++++++++++++++++++---------- mm/mempolicy.c | 4 ++-- mm/migrate.c | 2 +- mm/migrate_device.c | 2 +- mm/mlock.c | 6 ++--- mm/mprotect.c | 2 +- 15 files changed, 85 insertions(+), 66 deletions(-)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 18f8c3acbb85..4274128fbb4c 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -519,7 +519,7 @@ static void smaps_pte_entry(pte_t *pte, unsigned long addr, struct page *page = NULL;
if (pte_present(*pte)) {
page = vm_normal_page(vma, addr, *pte);
} else if (is_swap_pte(*pte)) { swp_entry_t swpent = pte_to_swp_entry(*pte);page = vm_normal_any_page(vma, addr, *pte);
@@ -705,7 +705,7 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask, struct page *page = NULL;
if (pte_present(*pte)) {
page = vm_normal_page(vma, addr, *pte);
} else if (is_swap_pte(*pte)) { swp_entry_t swpent = pte_to_swp_entry(*pte);page = vm_normal_any_page(vma, addr, *pte);
@@ -1059,7 +1059,7 @@ static inline bool pte_is_pinned(struct vm_area_struct *vma, unsigned long addr, return false; if (likely(!test_bit(MMF_HAS_PINNED, &vma->vm_mm->flags))) return false;
- page = vm_normal_page(vma, addr, pte);
- page = vm_normal_any_page(vma, addr, pte); if (!page) return false; return page_maybe_dma_pinned(page);
@@ -1172,7 +1172,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr, if (!pte_present(ptent)) continue;
page = vm_normal_page(vma, addr, ptent);
if (!page) continue;page = vm_normal_any_page(vma, addr, ptent);
@@ -1383,7 +1383,7 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm, if (pm->show_pfn) frame = pte_pfn(pte); flags |= PM_PRESENT;
page = vm_normal_page(vma, addr, pte);
if (pte_soft_dirty(pte)) flags |= PM_SOFT_DIRTY; if (pte_uffd_wp(pte))page = vm_normal_any_page(vma, addr, pte);
@@ -1761,7 +1761,7 @@ static struct page *can_gather_numa_stats(pte_t pte, struct vm_area_struct *vma, if (!pte_present(pte)) return NULL;
- page = vm_normal_page(vma, addr, pte);
- page = vm_normal_lru_page(vma, addr, pte); if (!page) return NULL;
diff --git a/include/linux/mm.h b/include/linux/mm.h index ff9f149ca201..8c9f87151d93 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -593,8 +593,8 @@ struct vm_operations_struct { unsigned long addr); #endif /*
* Called by vm_normal_page() for special PTEs to find the
* page for @addr. This is useful if the default behavior
* Called by vm_normal_x_page() for special PTEs to find the
I'd use vm_normal_*_page in these comments to avoid confusion, because vm_normal_x_page is actually a valid symbol name, which doesn't exist.
* page for @addr. This is useful if the default behavior
*/ struct page *(*find_special_page)(struct vm_area_struct *vma,
- (using pte_page()) would not find the correct page.
@@ -1781,7 +1781,9 @@ static inline bool can_do_mlock(void) { return false; } extern int user_shm_lock(size_t, struct ucounts *); extern void user_shm_unlock(size_t, struct ucounts *);
-struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr, +struct page *vm_normal_any_page(struct vm_area_struct *vma, unsigned long addr,
pte_t pte);
+struct page *vm_normal_lru_page(struct vm_area_struct *vma, unsigned long addr, pte_t pte); struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t pmd); @@ -2880,27 +2882,28 @@ static inline vm_fault_t vmf_error(int err) struct page *follow_page(struct vm_area_struct *vma, unsigned long address, unsigned int foll_flags);
-#define FOLL_WRITE 0x01 /* check pte is writable */ -#define FOLL_TOUCH 0x02 /* mark page accessed */ -#define FOLL_GET 0x04 /* do get_page on page */ -#define FOLL_DUMP 0x08 /* give error on hole if it would be zero */ -#define FOLL_FORCE 0x10 /* get_user_pages read/write w/o permission */ -#define FOLL_NOWAIT 0x20 /* if a disk transfer is needed, start the IO
* and return without waiting upon it */
-#define FOLL_POPULATE 0x40 /* fault in pages (with FOLL_MLOCK) */ -#define FOLL_NOFAULT 0x80 /* do not fault in pages */ -#define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */ -#define FOLL_NUMA 0x200 /* force NUMA hinting page fault */ -#define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */ -#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */ -#define FOLL_MLOCK 0x1000 /* lock present pages */ -#define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */ -#define FOLL_COW 0x4000 /* internal GUP flag */ -#define FOLL_ANON 0x8000 /* don't do file mappings */ -#define FOLL_LONGTERM 0x10000 /* mapping lifetime is indefinite: see below */ -#define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */ -#define FOLL_PIN 0x40000 /* pages must be released via unpin_user_page */ -#define FOLL_FAST_ONLY 0x80000 /* gup_fast: prevent fall-back to slow gup */ +#define FOLL_WRITE 0x01 /* check pte is writable */ +#define FOLL_TOUCH 0x02 /* mark page accessed */ +#define FOLL_GET 0x04 /* do get_page on page */ +#define FOLL_DUMP 0x08 /* give error on hole if it would be zero */ +#define FOLL_FORCE 0x10 /* get_user_pages read/write w/o permission */ +#define FOLL_NOWAIT 0x20 /* if a disk transfer is needed, start the IO
* and return without waiting upon it */
+#define FOLL_POPULATE 0x40 /* fault in pages (with FOLL_MLOCK) */ +#define FOLL_NOFAULT 0x80 /* do not fault in pages */ +#define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */ +#define FOLL_NUMA 0x200 /* force NUMA hinting page fault */ +#define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */ +#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */ +#define FOLL_MLOCK 0x1000 /* lock present pages */ +#define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */ +#define FOLL_COW 0x4000 /* internal GUP flag */ +#define FOLL_ANON 0x8000 /* don't do file mappings */ +#define FOLL_LONGTERM 0x10000 /* mapping lifetime is indefinite: see below */ +#define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */ +#define FOLL_PIN 0x40000 /* pages must be released via unpin_user_page */ +#define FOLL_FAST_ONLY 0x80000 /* gup_fast: prevent fall-back to slow gup */ +#define FOLL_LRU 0x100000 /* return only LRU (anon or page cache) */
/*
- FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
@@ -3227,7 +3230,7 @@ extern long copy_huge_page_from_user(struct page *dst_page,
- @vma: Pointer to the struct vm_area_struct to consider
- Whether transhuge page-table entries are considered "special" following
- the definition in vm_normal_page().
- the definition in vm_normal_x_page().
vm_normal_*_page
- Return: true if transhuge page-table entries should be considered special,
- false otherwise.
diff --git a/mm/gup.c b/mm/gup.c index 41349b685eaf..9e172c906ded 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -539,8 +539,10 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, pte_unmap_unlock(ptep, ptl); return NULL; }
- page = vm_normal_page(vma, address, pte);
- if (flags & (FOLL_MLOCK | FOLL_LRU))
page = vm_normal_lru_page(vma, address, pte);
- else
if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) { /*page = vm_normal_any_page(vma, address, pte);
- Only return device mapping pages in the FOLL_GET or FOLL_PIN
@@ -824,7 +826,7 @@ static struct page *follow_p4d_mask(struct vm_area_struct *vma,
- Return: the mapped (struct page *), %NULL if no mapping exists, or
- an error pointer if there is a mapping to something not represented
- by a page descriptor (see also vm_normal_page()).
- by a page descriptor (see also vm_normal_x_page()).
vm_normal_*_page
*/ static struct page *follow_page_mask(struct vm_area_struct *vma, unsigned long address, unsigned int flags, @@ -917,7 +919,7 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address, *vma = get_gate_vma(mm); if (!page) goto out;
- *page = vm_normal_page(*vma, address, *pte);
- *page = vm_normal_any_page(*vma, address, *pte); if (!*page) { if ((gup_flags & FOLL_DUMP) || !is_zero_pfn(pte_pfn(*pte))) goto unmap;
diff --git a/mm/hmm.c b/mm/hmm.c index bd56641c79d4..90c949d66712 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -300,7 +300,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, * Since each architecture defines a struct page for the zero page, just * fall through and treat it like a normal page. */
- if (!vm_normal_page(walk->vma, addr, pte) &&
- if (!vm_normal_any_page(walk->vma, addr, pte) && !pte_devmap(pte) && !is_zero_pfn(pte_pfn(pte))) { if (hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0)) {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 406a3c28c026..ea1efc825774 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2966,7 +2966,7 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start, }
/* FOLL_DUMP to ignore special (like zero) pages */
follflags = FOLL_GET | FOLL_DUMP;
follflags = FOLL_GET | FOLL_DUMP | FOLL_LRU;
page = follow_page(vma, addr, follflags);
if (IS_ERR(page))
diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 131492fd1148..a7153db09afa 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -627,7 +627,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, result = SCAN_PTE_NON_PRESENT; goto out; }
page = vm_normal_page(vma, address, pteval);
if (unlikely(!page)) { result = SCAN_PAGE_NULL; goto out;page = vm_normal_lru_page(vma, address, pteval);
@@ -1286,7 +1286,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, if (pte_write(pteval)) writable = true;
page = vm_normal_page(vma, _address, pteval);
if (unlikely(!page)) { result = SCAN_PAGE_NULL; goto out_unmap;page = vm_normal_lru_page(vma, _address, pteval);
@@ -1494,7 +1494,7 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr) if (!pte_present(*pte)) goto abort;
page = vm_normal_page(vma, addr, *pte);
page = vm_normal_lru_page(vma, addr, *pte);
/*
- Note that uprobe, debugger, or MAP_PRIVATE may change the
@@ -1512,7 +1512,7 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
if (pte_none(*pte)) continue;
page = vm_normal_page(vma, addr, *pte);
page_remove_rmap(page, false); }page = vm_normal_lru_page(vma, addr, *pte);
diff --git a/mm/ksm.c b/mm/ksm.c index c20bd4d9a0d9..352d37e44694 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -474,7 +474,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr) do { cond_resched(); page = follow_page(vma, addr,
FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE);
if (IS_ERR_OR_NULL(page)) break; if (PageKsm(page))FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE | FOLL_LRU);
@@ -559,7 +559,7 @@ static struct page *get_mergeable_page(struct rmap_item *rmap_item) if (!vma) goto out;
- page = follow_page(vma, addr, FOLL_GET);
- page = follow_page(vma, addr, FOLL_GET | FOLL_LRU); if (IS_ERR_OR_NULL(page)) goto out; if (PageAnon(page)) {
diff --git a/mm/madvise.c b/mm/madvise.c index 5604064df464..1a553aad9aa3 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -439,7 +439,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd, if (!pte_present(ptent)) continue;
page = vm_normal_page(vma, addr, ptent);
if (!page) continue;page = vm_normal_lru_page(vma, addr, ptent);
@@ -649,7 +649,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, continue; }
page = vm_normal_page(vma, addr, ptent);
if (!page) continue;page = vm_normal_lru_page(vma, addr, ptent);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 10259c35fde2..9677eb27dea8 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -5476,7 +5476,7 @@ enum mc_target_type { static struct page *mc_handle_present_pte(struct vm_area_struct *vma, unsigned long addr, pte_t ptent) {
- struct page *page = vm_normal_page(vma, addr, ptent);
struct page *page = vm_normal_any_page(vma, addr, ptent);
if (!page || !page_mapped(page)) return NULL;
diff --git a/mm/memory.c b/mm/memory.c index c125c4969913..cff84e6a6c4b 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -565,7 +565,7 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr, }
/*
- vm_normal_page -- This function gets the "struct page" associated with a pte.
- vm_normal_any_page -- This function gets the "struct page" associated with a pte.
- "Special" mappings do not wish to be associated with a "struct page" (either
- it doesn't exist, or it exists but they don't want to touch it). In this
@@ -606,7 +606,7 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
- PFNMAP mappings in order to support COWable mappings.
*/ -struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr, +struct page *vm_normal_any_page(struct vm_area_struct *vma, unsigned long addr, pte_t pte) { unsigned long pfn = pte_pfn(pte); @@ -620,8 +620,6 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr, return NULL; if (is_zero_pfn(pfn)) return NULL;
if (pte_devmap(pte))
return NULL;
print_bad_pte(vma, addr, pte, NULL); return NULL;
@@ -661,6 +659,22 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr, return pfn_to_page(pfn); }
+/*
- vm_normal_lru_page -- This function gets the "struct page" associated
- with a pte only for page cache and anon page. These pages are LRU handled.
- */
+struct page *vm_normal_lru_page(struct vm_area_struct *vma, unsigned long addr,
pte_t pte)
+{
- struct page *page;
- page = vm_normal_any_page(vma, addr, pte);
- if (is_zone_device_page(page))
return NULL;
- return page;
+}
- #ifdef CONFIG_TRANSPARENT_HUGEPAGE struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t pmd)
@@ -670,7 +684,7 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr, /* * There is no pmd_special() but there may be special pmds, e.g. * in a direct-access (dax) mapping, so let's just replicate the
* !CONFIG_ARCH_HAS_PTE_SPECIAL case from vm_normal_page() here.
*/ if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) { if (vma->vm_flags & VM_MIXEDMAP) {* !CONFIG_ARCH_HAS_PTE_SPECIAL case from vm_normal_any_page() here.
@@ -946,7 +960,7 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, pte_t pte = *src_pte; struct page *page;
- page = vm_normal_page(src_vma, addr, pte);
- page = vm_normal_any_page(src_vma, addr, pte); if (page) { int retval;
@@ -1358,7 +1372,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, if (pte_present(ptent)) { struct page *page;
page = vm_normal_page(vma, addr, ptent);
page = vm_normal_any_page(vma, addr, ptent); if (unlikely(zap_skip_check_mapping(details, page))) continue; ptent = ptep_get_and_clear_full(mm, addr, pte,
@@ -2168,7 +2182,7 @@ EXPORT_SYMBOL(vmf_insert_pfn);
static bool vm_mixed_ok(struct vm_area_struct *vma, pfn_t pfn) {
- /* these checks mirror the abort conditions in vm_normal_page */
- /* these checks mirror the abort conditions in vm_normal_lru_page */ if (vma->vm_flags & VM_MIXEDMAP) return true; if (pfn_t_devmap(pfn))
If this is to match the new vm_normal_lru_page, it should replace "if (pfn_t_devmap(pfn))" with a check that the page is not a device page. But for that it would have to actually look up the struct page.
I'm not sure what to do about this. __vm_insert_mixed still does something special with devmap pages, which no longer matches vm_normal_*_page.
@@ -2198,7 +2212,7 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma,
/* * If we don't have pte special, then we have to use the pfn_valid()
* based VM_MIXEDMAP scheme (see vm_normal_page), and thus we *must*
* based VM_MIXEDMAP scheme (see vm_normal_any_page), and thus we *must*
- refcount the page if pfn_valid is true (hence insert_page rather
- than insert_pfn). If a zero_pfn were inserted into a VM_MIXEDMAP
- without pte special, it would there be refcounted as a normal page.
@@ -2408,7 +2422,7 @@ int remap_pfn_range_notrack(struct vm_area_struct *vma, unsigned long addr, * There's a horrible special case to handle copy-on-write * behaviour that some programs depend on. We mark the "original" * un-COW'ed pages by matching them up with "vma->vm_pgoff".
* See vm_normal_page() for details.
*/ if (is_cow_mapping(vma->vm_flags)) { if (addr != vma->vm_start || end != vma->vm_end)* See vm_normal_any_page() for details.
@@ -3267,7 +3281,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) mm_tlb_flush_pending(vmf->vma->vm_mm))) flush_tlb_page(vmf->vma, vmf->address);
- vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte);
- vmf->page = vm_normal_any_page(vma, vmf->address, vmf->orig_pte); if (!vmf->page) { /*
- VM_MIXEDMAP !pfn_valid() case, or VM_SOFTDIRTY clear on a
@@ -4364,7 +4378,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf) old_pte = ptep_get(vmf->pte); pte = pte_modify(old_pte, vma->vm_page_prot);
- page = vm_normal_page(vma, vmf->address, pte);
- page = vm_normal_lru_page(vma, vmf->address, pte); if (!page) goto out_map;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 028e8dd82b44..9962de4981d6 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -527,11 +527,11 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr, for (; addr != end; pte++, addr += PAGE_SIZE) { if (!pte_present(*pte)) continue;
page = vm_normal_page(vma, addr, *pte);
if (!page) continue; /*page = vm_normal_lru_page(vma, addr, *pte);
* vm_normal_page() filters out zero pages, but there might
* vm_normal_lru_page() filters out zero pages, but there might
*/ if (PageReserved(page))
- still be PageReserved pages to skip, perhaps in a VDSO.
diff --git a/mm/migrate.c b/mm/migrate.c index c31d04b46a5e..17d049311b78 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1614,7 +1614,7 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr, goto out;
/* FOLL_DUMP to ignore special (like zero) pages */
- follflags = FOLL_GET | FOLL_DUMP;
follflags = FOLL_GET | FOLL_DUMP | FOLL_LRU; page = follow_page(vma, addr, follflags);
err = PTR_ERR(page);
diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 3373b535d5c9..fac1b6978361 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -154,7 +154,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, migrate->cpages++; goto next; }
page = vm_normal_page(migrate->vma, addr, pte);
page = vm_normal_any_page(migrate->vma, addr, pte); if (page && !is_zone_device_page(page) && !(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) goto next;
diff --git a/mm/mlock.c b/mm/mlock.c index 8f584eddd305..52613e2f2a70 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -342,7 +342,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
- a non-TPH page already pinned and in the @pvec, and that it belongs to @zone.
- The rest of @pvec is filled by subsequent pages within the same pmd and same
- zone, as long as the pte's are present and vm_normal_page() succeeds. These
- zone, as long as the pte's are present and vm_normal_any_page() succeeds. These
The comment says vm_normal_any_page. But the function uses vm_normal_lru_page.
Regards, Felix
- pages also get pinned.
- Returns the address of the next page that should be scanned. This equals
@@ -373,7 +373,7 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec, struct page *page = NULL; pte++; if (pte_present(*pte))
page = vm_normal_page(vma, start, *pte);
/*page = vm_normal_lru_page(vma, start, *pte);
- Break if page could not be obtained or the page's node+zone does not
- match
@@ -439,7 +439,7 @@ void munlock_vma_pages_range(struct vm_area_struct *vma, * suits munlock very well (and if somehow an abnormal page * has sneaked into the range, we won't oops here: great). */
page = follow_page(vma, start, FOLL_GET | FOLL_DUMP);
page = follow_page(vma, start, FOLL_GET | FOLL_DUMP | FOLL_LRU);
if (page && !IS_ERR(page)) { if (PageTransTail(page)) {
diff --git a/mm/mprotect.c b/mm/mprotect.c index 0138dfcdb1d8..d236394d41d5 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -88,7 +88,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, if (pte_protnone(oldpte)) continue;
page = vm_normal_page(vma, addr, oldpte);
page = vm_normal_lru_page(vma, addr, oldpte); if (!page || PageKsm(page)) continue;
On 28.02.22 21:34, Alex Sierra wrote:
DEVICE_COHERENT pages introduce a subtle distinction in the way "normal" pages can be used by various callers throughout the kernel. They behave like normal pages for purposes of mapping in CPU page tables, and for COW. But they do not support LRU lists, NUMA migration or THP. Therefore we split vm_normal_page into two functions vm_normal_any_page and vm_normal_lru_page. The latter will only return pages that can be put on an LRU list and that support NUMA migration and THP.
Why not s/vm_normal_any_page/vm_normal_page/ and avoid code churn?
We also introduced a FOLL_LRU flag that adds the same behaviour to follow_page and related APIs, to allow callers to specify that they expect to put pages on an LRU list.
[...]
-#define FOLL_WRITE 0x01 /* check pte is writable */ -#define FOLL_TOUCH 0x02 /* mark page accessed */ -#define FOLL_GET 0x04 /* do get_page on page */ -#define FOLL_DUMP 0x08 /* give error on hole if it would be zero */ -#define FOLL_FORCE 0x10 /* get_user_pages read/write w/o permission */ -#define FOLL_NOWAIT 0x20 /* if a disk transfer is needed, start the IO
* and return without waiting upon it */
-#define FOLL_POPULATE 0x40 /* fault in pages (with FOLL_MLOCK) */ -#define FOLL_NOFAULT 0x80 /* do not fault in pages */ -#define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */ -#define FOLL_NUMA 0x200 /* force NUMA hinting page fault */ -#define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */ -#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */ -#define FOLL_MLOCK 0x1000 /* lock present pages */ -#define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */ -#define FOLL_COW 0x4000 /* internal GUP flag */ -#define FOLL_ANON 0x8000 /* don't do file mappings */ -#define FOLL_LONGTERM 0x10000 /* mapping lifetime is indefinite: see below */ -#define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */ -#define FOLL_PIN 0x40000 /* pages must be released via unpin_user_page */ -#define FOLL_FAST_ONLY 0x80000 /* gup_fast: prevent fall-back to slow gup */ +#define FOLL_WRITE 0x01 /* check pte is writable */ +#define FOLL_TOUCH 0x02 /* mark page accessed */ +#define FOLL_GET 0x04 /* do get_page on page */ +#define FOLL_DUMP 0x08 /* give error on hole if it would be zero */ +#define FOLL_FORCE 0x10 /* get_user_pages read/write w/o permission */ +#define FOLL_NOWAIT 0x20 /* if a disk transfer is needed, start the IO
* and return without waiting upon it */
+#define FOLL_POPULATE 0x40 /* fault in pages (with FOLL_MLOCK) */ +#define FOLL_NOFAULT 0x80 /* do not fault in pages */ +#define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */ +#define FOLL_NUMA 0x200 /* force NUMA hinting page fault */ +#define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */ +#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */ +#define FOLL_MLOCK 0x1000 /* lock present pages */ +#define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */ +#define FOLL_COW 0x4000 /* internal GUP flag */ +#define FOLL_ANON 0x8000 /* don't do file mappings */ +#define FOLL_LONGTERM 0x10000 /* mapping lifetime is indefinite: see below */ +#define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */ +#define FOLL_PIN 0x40000 /* pages must be released via unpin_user_page */ +#define FOLL_FAST_ONLY 0x80000 /* gup_fast: prevent fall-back to slow gup */ +#define FOLL_LRU 0x100000 /* return only LRU (anon or page cache) */
Can we minimize code churn, please?
if (PageReserved(page))
diff --git a/mm/migrate.c b/mm/migrate.c index c31d04b46a5e..17d049311b78 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1614,7 +1614,7 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr, goto out;
/* FOLL_DUMP to ignore special (like zero) pages */
- follflags = FOLL_GET | FOLL_DUMP;
- follflags = FOLL_GET | FOLL_DUMP | FOLL_LRU; page = follow_page(vma, addr, follflags);
Why wouldn't we want to dump DEVICE_COHERENT pages? This looks wrong.
Am 2022-03-01 um 03:03 schrieb David Hildenbrand:
On 28.02.22 21:34, Alex Sierra wrote:
DEVICE_COHERENT pages introduce a subtle distinction in the way "normal" pages can be used by various callers throughout the kernel. They behave like normal pages for purposes of mapping in CPU page tables, and for COW. But they do not support LRU lists, NUMA migration or THP. Therefore we split vm_normal_page into two functions vm_normal_any_page and vm_normal_lru_page. The latter will only return pages that can be put on an LRU list and that support NUMA migration and THP.
Why not s/vm_normal_any_page/vm_normal_page/ and avoid code churn?
I don't care much, personally, what names we end up with. I'll wait for a consensus here.
We also introduced a FOLL_LRU flag that adds the same behaviour to follow_page and related APIs, to allow callers to specify that they expect to put pages on an LRU list.
[...]
-#define FOLL_WRITE 0x01 /* check pte is writable */ -#define FOLL_TOUCH 0x02 /* mark page accessed */ -#define FOLL_GET 0x04 /* do get_page on page */ -#define FOLL_DUMP 0x08 /* give error on hole if it would be zero */ -#define FOLL_FORCE 0x10 /* get_user_pages read/write w/o permission */ -#define FOLL_NOWAIT 0x20 /* if a disk transfer is needed, start the IO
* and return without waiting upon it */
-#define FOLL_POPULATE 0x40 /* fault in pages (with FOLL_MLOCK) */ -#define FOLL_NOFAULT 0x80 /* do not fault in pages */ -#define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */ -#define FOLL_NUMA 0x200 /* force NUMA hinting page fault */ -#define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */ -#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */ -#define FOLL_MLOCK 0x1000 /* lock present pages */ -#define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */ -#define FOLL_COW 0x4000 /* internal GUP flag */ -#define FOLL_ANON 0x8000 /* don't do file mappings */ -#define FOLL_LONGTERM 0x10000 /* mapping lifetime is indefinite: see below */ -#define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */ -#define FOLL_PIN 0x40000 /* pages must be released via unpin_user_page */ -#define FOLL_FAST_ONLY 0x80000 /* gup_fast: prevent fall-back to slow gup */ +#define FOLL_WRITE 0x01 /* check pte is writable */ +#define FOLL_TOUCH 0x02 /* mark page accessed */ +#define FOLL_GET 0x04 /* do get_page on page */ +#define FOLL_DUMP 0x08 /* give error on hole if it would be zero */ +#define FOLL_FORCE 0x10 /* get_user_pages read/write w/o permission */ +#define FOLL_NOWAIT 0x20 /* if a disk transfer is needed, start the IO
* and return without waiting upon it */
+#define FOLL_POPULATE 0x40 /* fault in pages (with FOLL_MLOCK) */ +#define FOLL_NOFAULT 0x80 /* do not fault in pages */ +#define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */ +#define FOLL_NUMA 0x200 /* force NUMA hinting page fault */ +#define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */ +#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */ +#define FOLL_MLOCK 0x1000 /* lock present pages */ +#define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */ +#define FOLL_COW 0x4000 /* internal GUP flag */ +#define FOLL_ANON 0x8000 /* don't do file mappings */ +#define FOLL_LONGTERM 0x10000 /* mapping lifetime is indefinite: see below */ +#define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */ +#define FOLL_PIN 0x40000 /* pages must be released via unpin_user_page */ +#define FOLL_FAST_ONLY 0x80000 /* gup_fast: prevent fall-back to slow gup */ +#define FOLL_LRU 0x100000 /* return only LRU (anon or page cache) */
Can we minimize code churn, please?
OK. I guess we could unindent the FOLL_LRU number to avoid changing all the comments.
if (PageReserved(page))
diff --git a/mm/migrate.c b/mm/migrate.c index c31d04b46a5e..17d049311b78 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1614,7 +1614,7 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr, goto out;
/* FOLL_DUMP to ignore special (like zero) pages */
- follflags = FOLL_GET | FOLL_DUMP;
- follflags = FOLL_GET | FOLL_DUMP | FOLL_LRU; page = follow_page(vma, addr, follflags);
Why wouldn't we want to dump DEVICE_COHERENT pages? This looks wrong.
This function later calls isolate_lru_page, which is something you can't do with a device page.
Regards, Felix
if (PageReserved(page))
diff --git a/mm/migrate.c b/mm/migrate.c index c31d04b46a5e..17d049311b78 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1614,7 +1614,7 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr, goto out;
/* FOLL_DUMP to ignore special (like zero) pages */
- follflags = FOLL_GET | FOLL_DUMP;
- follflags = FOLL_GET | FOLL_DUMP | FOLL_LRU; page = follow_page(vma, addr, follflags);
Why wouldn't we want to dump DEVICE_COHERENT pages? This looks wrong.
This function later calls isolate_lru_page, which is something you can't do with a device page.
Then, that code might require care instead. We most certainly don't want to have random memory holes in a dump just because some anonymous memory was migrated to DEVICE_COHERENT.
Am 2022-03-01 um 11:22 schrieb David Hildenbrand:
if (PageReserved(page))
diff --git a/mm/migrate.c b/mm/migrate.c index c31d04b46a5e..17d049311b78 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1614,7 +1614,7 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr, goto out;
/* FOLL_DUMP to ignore special (like zero) pages */
- follflags = FOLL_GET | FOLL_DUMP;
- follflags = FOLL_GET | FOLL_DUMP | FOLL_LRU; page = follow_page(vma, addr, follflags);
Why wouldn't we want to dump DEVICE_COHERENT pages? This looks wrong.
This function later calls isolate_lru_page, which is something you can't do with a device page.
Then, that code might require care instead. We most certainly don't want to have random memory holes in a dump just because some anonymous memory was migrated to DEVICE_COHERENT.
I don't think this code is for core dumps. The call chain I see is
SYSCALL_DEFINE6(move_pages, ...) -> kernel_move_pages -> do_pages_move -> add_page_for_migration
As far as I can tell this is for NUMA migrations.
Regards, Felix
On 01.03.22 17:30, Felix Kuehling wrote:
Am 2022-03-01 um 11:22 schrieb David Hildenbrand:
if (PageReserved(page))
diff --git a/mm/migrate.c b/mm/migrate.c index c31d04b46a5e..17d049311b78 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1614,7 +1614,7 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr, goto out;
/* FOLL_DUMP to ignore special (like zero) pages */
- follflags = FOLL_GET | FOLL_DUMP;
- follflags = FOLL_GET | FOLL_DUMP | FOLL_LRU; page = follow_page(vma, addr, follflags);
Why wouldn't we want to dump DEVICE_COHERENT pages? This looks wrong.
This function later calls isolate_lru_page, which is something you can't do with a device page.
Then, that code might require care instead. We most certainly don't want to have random memory holes in a dump just because some anonymous memory was migrated to DEVICE_COHERENT.
I don't think this code is for core dumps. The call chain I see is
SYSCALL_DEFINE6(move_pages, ...) -> kernel_move_pages -> do_pages_move -> add_page_for_migration
Ah, sorry, I got mislead by FOLL_DUMP and thought we'd be in get_dump_page() . As you said, nothing to do.
Felix Kuehling felix.kuehling@amd.com writes:
Am 2022-02-16 um 07:26 schrieb Jason Gunthorpe:
The other place that needs careful audit is all the callers using vm_normal_page() - they must all be able to accept a ZONE_DEVICE page if we don't set pte_devmap.
How much code are we talking about here? A quick search finds 26 call-sites in 12 files in current master:
fs/proc/task_mmu.c mm/hmm.c mm/gup.c mm/huge_memory.c (vm_normal_page_pmd) mm/khugepaged.c mm/madvise.c mm/mempolicy.c mm/memory.c mm/mlock.c mm/migrate.c mm/mprotect.c mm/memcontrol.c
I'm thinking of a more theoretical approach: Instead of auditing all users, I'd ask, what are the invariants that a vm_normal_page should have. Then check, whether our DEVICE_COHERENT pages satisfy them. But maybe the concept of a vm_normal_page isn't defined clearly enough for that.
That said, I think we (Alex and myself) made an implicit assumption from the start, that a DEVICE_COHERENT page should behave a lot like a normal page in terms of VMA mappings, even if we didn't know what that means in detail.
Yes I'm afraid I made a similar mistake when reviewing this, forgetting that DEVICE_COHERENT pages are not LRU pages and therefore need special treatment in some places. So for now I will have to withdraw my reviewed-by until this has been looked at more closely, because as you note below accidentally treating them as LRU pages leads to a bad time.
I can now at least name some differences between DEVICE_COHERENT and normal pages: how the memory is allocated, how data is migrated into DEVICE_COHERENT pages and that it can't be on any LRU list (because the lru list_head in struct page is aliased by pgmap and zone_device_data). Maybe I'll find more differences if I keep digging.
Regards, Felix
Jason
dri-devel@lists.freedesktop.org