v1: https://lore.kernel.org/linux-mm/20210529064022.GB15834@lst.de/T/
v2: This patch series version has merged "[RFC PATCH v3 0/2] mm: remove extra ZONE_DEVICE struct page refcount" patch series made by Ralph Campbell. It also applies at the top of these series, our changes to support device generic type in migration_vma helpers. This has been tested in systems with device memory that has coherent access by CPU.
Also addresses the following feedback made in v1: - Isolate in one patch kernel/resource.c modification, based on Christoph's feedback. - Add helpers check for generic and private type to avoid duplicated long lines.
I like to provide an overview of what each of the patches does in a series:
Patches 1-2: Rebased Ralph Campbell's ZONE_DEVICE page refcounting patches Patch 3: Export lookup_resource Patches 4-5: AMDGPU driver changes to register and use DEVICE_GENERIC memory Patches 6-8: Handle DEVICE_GENERIC memory in migration helpers
Alex Sierra (6): kernel: resource: lookup_resource as exported symbol drm/amdkfd: add SPM support for SVM drm/amdkfd: generic type as sys mem on migration to ram include/linux/mm.h: helpers to check zone device generic type mm: add generic type support to migrate_vma helpers mm: call pgmap->ops->page_free for DEVICE_GENERIC pages
Ralph Campbell (2): ext4/xfs: add page refcount helper mm: remove extra ZONE_DEVICE struct page refcount
arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 15 ++++-- drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +- fs/dax.c | 8 +-- fs/ext4/inode.c | 5 +- fs/xfs/xfs_file.c | 4 +- include/linux/dax.h | 10 ++++ include/linux/memremap.h | 7 +-- include/linux/mm.h | 52 +++--------------- kernel/resource.c | 2 +- lib/test_hmm.c | 2 +- mm/internal.h | 8 +++ mm/memremap.c | 69 +++++++----------------- mm/migrate.c | 13 ++--- mm/page_alloc.c | 3 ++ mm/swap.c | 45 ++-------------- 16 files changed, 83 insertions(+), 164 deletions(-)
From: Ralph Campbell rcampbell@nvidia.com
There are several places where ZONE_DEVICE struct pages assume a reference count == 1 means the page is idle and free. Instead of open coding this, add a helper function to hide this detail.
Signed-off-by: Ralph Campbell rcampbell@nvidia.com --- fs/dax.c | 4 ++-- fs/ext4/inode.c | 5 +---- fs/xfs/xfs_file.c | 4 +--- include/linux/dax.h | 10 ++++++++++ 4 files changed, 14 insertions(+), 9 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c index 26d5dcd2d69e..321f4ddc6643 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -358,7 +358,7 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping, for_each_mapped_pfn(entry, pfn) { struct page *page = pfn_to_page(pfn);
- WARN_ON_ONCE(trunc && page_ref_count(page) > 1); + WARN_ON_ONCE(trunc && !dax_layout_is_idle_page(page)); WARN_ON_ONCE(page->mapping && page->mapping != mapping); page->mapping = NULL; page->index = 0; @@ -372,7 +372,7 @@ static struct page *dax_busy_page(void *entry) for_each_mapped_pfn(entry, pfn) { struct page *page = pfn_to_page(pfn);
- if (page_ref_count(page) > 1) + if (!dax_layout_is_idle_page(page)) return page; } return NULL; diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index c173c8405856..9ee00186412f 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3972,10 +3972,7 @@ int ext4_break_layouts(struct inode *inode) if (!page) return 0;
- error = ___wait_var_event(&page->_refcount, - atomic_read(&page->_refcount) == 1, - TASK_INTERRUPTIBLE, 0, 0, - ext4_wait_dax_page(ei)); + error = dax_wait_page(ei, page, ext4_wait_dax_page); } while (error == 0);
return error; diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 5b0f93f73837..39565fe5f817 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -782,9 +782,7 @@ xfs_break_dax_layouts( return 0;
*retry = true; - return ___wait_var_event(&page->_refcount, - atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE, - 0, 0, xfs_wait_dax_page(inode)); + return dax_wait_page(inode, page, xfs_wait_dax_page); }
int diff --git a/include/linux/dax.h b/include/linux/dax.h index b52f084aa643..8909a91cd381 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -243,6 +243,16 @@ static inline bool dax_mapping(struct address_space *mapping) return mapping->host && IS_DAX(mapping->host); }
+static inline bool dax_layout_is_idle_page(struct page *page) +{ + return page_ref_count(page) == 1; +} + +#define dax_wait_page(_inode, _page, _wait_cb) \ + ___wait_var_event(&(_page)->_refcount, \ + dax_layout_is_idle_page(_page), \ + TASK_INTERRUPTIBLE, 0, 0, _wait_cb(_inode)) + #ifdef CONFIG_DEV_DAX_HMEM_DEVICES void hmem_register_device(int target_nid, struct resource *r); #else
* Alex Sierra alex.sierra@amd.com [210607 16:43]:
From: Ralph Campbell rcampbell@nvidia.com
There are several places where ZONE_DEVICE struct pages assume a reference count == 1 means the page is idle and free. Instead of open coding this, add a helper function to hide this detail.
Signed-off-by: Ralph Campbell rcampbell@nvidia.com
fs/dax.c | 4 ++-- fs/ext4/inode.c | 5 +---- fs/xfs/xfs_file.c | 4 +--- include/linux/dax.h | 10 ++++++++++ 4 files changed, 14 insertions(+), 9 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c index 26d5dcd2d69e..321f4ddc6643 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -358,7 +358,7 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping, for_each_mapped_pfn(entry, pfn) { struct page *page = pfn_to_page(pfn);
WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
WARN_ON_ONCE(page->mapping && page->mapping != mapping); page->mapping = NULL; page->index = 0;WARN_ON_ONCE(trunc && !dax_layout_is_idle_page(page));
@@ -372,7 +372,7 @@ static struct page *dax_busy_page(void *entry) for_each_mapped_pfn(entry, pfn) { struct page *page = pfn_to_page(pfn);
if (page_ref_count(page) > 1)
} return NULL;if (!dax_layout_is_idle_page(page)) return page;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index c173c8405856..9ee00186412f 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3972,10 +3972,7 @@ int ext4_break_layouts(struct inode *inode) if (!page) return 0;
error = ___wait_var_event(&page->_refcount,
atomic_read(&page->_refcount) == 1,
TASK_INTERRUPTIBLE, 0, 0,
ext4_wait_dax_page(ei));
error = dax_wait_page(ei, page, ext4_wait_dax_page);
} while (error == 0);
return error;
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 5b0f93f73837..39565fe5f817 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -782,9 +782,7 @@ xfs_break_dax_layouts( return 0;
*retry = true;
- return ___wait_var_event(&page->_refcount,
atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
0, 0, xfs_wait_dax_page(inode));
- return dax_wait_page(inode, page, xfs_wait_dax_page);
}
int diff --git a/include/linux/dax.h b/include/linux/dax.h index b52f084aa643..8909a91cd381 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -243,6 +243,16 @@ static inline bool dax_mapping(struct address_space *mapping) return mapping->host && IS_DAX(mapping->host); }
+static inline bool dax_layout_is_idle_page(struct page *page) +{
- return page_ref_count(page) == 1;
+}
If this races with page_ref_count(page) == 0, then it will return false that a page is idle when the page is being freed. I don't know the code well enough to say if this is an issue or not so please let me know.
For example: !dax_layout_is_idle_page() will return true in dax_busy_page() above when the count is 0 and return the page.
Maybe you are sure to have at least one reference when calling this? It might be worth adding a comment.
+#define dax_wait_page(_inode, _page, _wait_cb) \
- ___wait_var_event(&(_page)->_refcount, \
dax_layout_is_idle_page(_page), \
TASK_INTERRUPTIBLE, 0, 0, _wait_cb(_inode))
#ifdef CONFIG_DEV_DAX_HMEM_DEVICES void hmem_register_device(int target_nid, struct resource *r);
#else
2.17.1
On Tue, Jun 08, 2021 at 12:29:04AM +0000, Liam Howlett wrote:
- Alex Sierra alex.sierra@amd.com [210607 16:43]:
From: Ralph Campbell rcampbell@nvidia.com
There are several places where ZONE_DEVICE struct pages assume a reference count == 1 means the page is idle and free. Instead of open coding this, add a helper function to hide this detail.
diff --git a/include/linux/dax.h b/include/linux/dax.h index b52f084aa643..8909a91cd381 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -243,6 +243,16 @@ static inline bool dax_mapping(struct address_space *mapping) return mapping->host && IS_DAX(mapping->host); }
+static inline bool dax_layout_is_idle_page(struct page *page) +{
- return page_ref_count(page) == 1;
+}
If this races with page_ref_count(page) == 0, then it will return false that a page is idle when the page is being freed. I don't know the code well enough to say if this is an issue or not so please let me know.
For example: !dax_layout_is_idle_page() will return true in dax_busy_page() above when the count is 0 and return the page.
Maybe you are sure to have at least one reference when calling this? It might be worth adding a comment.
You're getting confused by the problem that the next patch fixes, which is that devmap pages were stupidly given an elevated refcount. devmap pages are considered "free" when their refcount is 1. See put_page(), put_devmap_managed_page() and so on.
On Mon, Jun 07, 2021 at 03:42:19PM -0500, Alex Sierra wrote:
+++ b/include/linux/dax.h @@ -243,6 +243,16 @@ static inline bool dax_mapping(struct address_space *mapping) return mapping->host && IS_DAX(mapping->host); }
+static inline bool dax_layout_is_idle_page(struct page *page) +{
- return page_ref_count(page) == 1;
+}
We already have something called an idle page, and that's quite a different thing from this. How about dax_page_unused() (it's a use count, so once it's got down to its minimum value, it's unused)?
Am 2021-06-09 um 3:23 p.m. schrieb Matthew Wilcox:
On Mon, Jun 07, 2021 at 03:42:19PM -0500, Alex Sierra wrote:
+++ b/include/linux/dax.h @@ -243,6 +243,16 @@ static inline bool dax_mapping(struct address_space *mapping) return mapping->host && IS_DAX(mapping->host); }
+static inline bool dax_layout_is_idle_page(struct page *page) +{
- return page_ref_count(page) == 1;
+}
We already have something called an idle page, and that's quite a different thing from this. How about dax_page_unused() (it's a use count, so once it's got down to its minimum value, it's unused)?
Hi Matthew,
Thank you very much for your feedback. This patch looks straight-forward enough, but do we need the filesystem maintainers to review this as well? I guess we should CC the linux-ext4 and linux-xfs mailing lists in the next revision.
Hi Ralph,
Are you OK if we update your patch with this suggestion?
Thanks, Felix
From: Ralph Campbell rcampbell@nvidia.com
ZONE_DEVICE struct pages have an extra reference count that complicates the code for put_page() and several places in the kernel that need to check the reference count to see that a page is not being used (gup, compaction, migration, etc.). Clean up the code so the reference count doesn't need to be treated specially for ZONE_DEVICE.
v2: AS: merged this patch in linux 5.11 version
Signed-off-by: Ralph Campbell rcampbell@nvidia.com Signed-off-by: Alex Sierra alex.sierra@amd.com --- arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +- drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +- fs/dax.c | 4 +- include/linux/dax.h | 2 +- include/linux/memremap.h | 7 +-- include/linux/mm.h | 44 ----------------- lib/test_hmm.c | 2 +- mm/internal.h | 8 +++ mm/memremap.c | 68 +++++++------------------- mm/migrate.c | 5 -- mm/page_alloc.c | 3 ++ mm/swap.c | 45 ++--------------- 12 files changed, 45 insertions(+), 147 deletions(-)
diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c index 84e5a2dc8be5..acee67710620 100644 --- a/arch/powerpc/kvm/book3s_hv_uvmem.c +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c @@ -711,7 +711,7 @@ static struct page *kvmppc_uvmem_get_page(unsigned long gpa, struct kvm *kvm)
dpage = pfn_to_page(uvmem_pfn); dpage->zone_device_data = pvt; - get_page(dpage); + init_page_count(dpage); lock_page(dpage); return dpage; out_clear: diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c index 92987daa5e17..8bc7120e1216 100644 --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c @@ -324,7 +324,7 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm) return NULL; }
- get_page(page); + init_page_count(page); lock_page(page); return page; } diff --git a/fs/dax.c b/fs/dax.c index 321f4ddc6643..7b4c6b35b098 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -560,14 +560,14 @@ static void *grab_mapping_entry(struct xa_state *xas,
/** * dax_layout_busy_page_range - find first pinned page in @mapping - * @mapping: address space to scan for a page with ref count > 1 + * @mapping: address space to scan for a page with ref count > 0 * @start: Starting offset. Page containing 'start' is included. * @end: End offset. Page containing 'end' is included. If 'end' is LLONG_MAX, * pages from 'start' till the end of file are included. * * DAX requires ZONE_DEVICE mapped pages. These pages are never * 'onlined' to the page allocator so they are considered idle when - * page->count == 1. A filesystem uses this interface to determine if + * page->count == 0. A filesystem uses this interface to determine if * any page in the mapping is busy, i.e. for DMA, or other * get_user_pages() usages. * diff --git a/include/linux/dax.h b/include/linux/dax.h index 8909a91cd381..9487dc06b220 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -245,7 +245,7 @@ static inline bool dax_mapping(struct address_space *mapping)
static inline bool dax_layout_is_idle_page(struct page *page) { - return page_ref_count(page) == 1; + return page_ref_count(page) == 0; }
#define dax_wait_page(_inode, _page, _wait_cb) \ diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 79c49e7f5c30..327f32427d21 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -66,9 +66,10 @@ enum memory_type {
struct dev_pagemap_ops { /* - * Called once the page refcount reaches 1. (ZONE_DEVICE pages never - * reach 0 refcount unless there is a refcount bug. This allows the - * device driver to implement its own memory management.) + * Called once the page refcount reaches 0. The reference count + * should be reset to one with init_page_count(page) before reusing + * the page. This allows the device driver to implement its own + * memory management. */ void (*page_free)(struct page *page);
diff --git a/include/linux/mm.h b/include/linux/mm.h index c9900aedc195..d8d79bb94be8 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1117,39 +1117,6 @@ static inline bool is_zone_device_page(const struct page *page) } #endif
-#ifdef CONFIG_DEV_PAGEMAP_OPS -void free_devmap_managed_page(struct page *page); -DECLARE_STATIC_KEY_FALSE(devmap_managed_key); - -static inline bool page_is_devmap_managed(struct page *page) -{ - if (!static_branch_unlikely(&devmap_managed_key)) - return false; - if (!is_zone_device_page(page)) - return false; - switch (page->pgmap->type) { - case MEMORY_DEVICE_PRIVATE: - case MEMORY_DEVICE_FS_DAX: - return true; - default: - break; - } - return false; -} - -void put_devmap_managed_page(struct page *page); - -#else /* CONFIG_DEV_PAGEMAP_OPS */ -static inline bool page_is_devmap_managed(struct page *page) -{ - return false; -} - -static inline void put_devmap_managed_page(struct page *page) -{ -} -#endif /* CONFIG_DEV_PAGEMAP_OPS */ - static inline bool is_device_private_page(const struct page *page) { return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) && @@ -1196,17 +1163,6 @@ static inline void put_page(struct page *page) { page = compound_head(page);
- /* - * For devmap managed pages we need to catch refcount transition from - * 2 to 1, when refcount reach one it means the page is free and we - * need to inform the device driver through callback. See - * include/linux/memremap.h and HMM for details. - */ - if (page_is_devmap_managed(page)) { - put_devmap_managed_page(page); - return; - } - if (put_page_testzero(page)) __put_page(page); } diff --git a/lib/test_hmm.c b/lib/test_hmm.c index 80a78877bd93..6998f10350ea 100644 --- a/lib/test_hmm.c +++ b/lib/test_hmm.c @@ -561,7 +561,7 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice) }
dpage->zone_device_data = rpage; - get_page(dpage); + init_page_count(dpage); lock_page(dpage); return dpage;
diff --git a/mm/internal.h b/mm/internal.h index 25d2b2439f19..d3e58746f2d2 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -623,4 +623,12 @@ struct migration_target_control { gfp_t gfp_mask; };
+#ifdef CONFIG_DEV_PAGEMAP_OPS +void free_zone_device_page(struct page *page); +#else +static inline void free_zone_device_page(struct page *page) +{ +} +#endif + #endif /* __MM_INTERNAL_H */ diff --git a/mm/memremap.c b/mm/memremap.c index 16b2fb482da1..614b3d600e95 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -12,6 +12,7 @@ #include <linux/types.h> #include <linux/wait_bit.h> #include <linux/xarray.h> +#include "internal.h"
static DEFINE_XARRAY(pgmap_array);
@@ -37,32 +38,6 @@ unsigned long memremap_compat_align(void) EXPORT_SYMBOL_GPL(memremap_compat_align); #endif
-#ifdef CONFIG_DEV_PAGEMAP_OPS -DEFINE_STATIC_KEY_FALSE(devmap_managed_key); -EXPORT_SYMBOL(devmap_managed_key); - -static void devmap_managed_enable_put(struct dev_pagemap *pgmap) -{ - if (pgmap->type == MEMORY_DEVICE_PRIVATE || - pgmap->type == MEMORY_DEVICE_FS_DAX) - static_branch_dec(&devmap_managed_key); -} - -static void devmap_managed_enable_get(struct dev_pagemap *pgmap) -{ - if (pgmap->type == MEMORY_DEVICE_PRIVATE || - pgmap->type == MEMORY_DEVICE_FS_DAX) - static_branch_inc(&devmap_managed_key); -} -#else -static void devmap_managed_enable_get(struct dev_pagemap *pgmap) -{ -} -static void devmap_managed_enable_put(struct dev_pagemap *pgmap) -{ -} -#endif /* CONFIG_DEV_PAGEMAP_OPS */ - static void pgmap_array_delete(struct range *range) { xa_store_range(&pgmap_array, PHYS_PFN(range->start), PHYS_PFN(range->end), @@ -87,16 +62,6 @@ static unsigned long pfn_end(struct dev_pagemap *pgmap, int range_id) return (range->start + range_len(range)) >> PAGE_SHIFT; }
-static unsigned long pfn_next(unsigned long pfn) -{ - if (pfn % 1024 == 0) - cond_resched(); - return pfn + 1; -} - -#define for_each_device_pfn(pfn, map, i) \ - for (pfn = pfn_first(map, i); pfn < pfn_end(map, i); pfn = pfn_next(pfn)) - static void dev_pagemap_kill(struct dev_pagemap *pgmap) { if (pgmap->ops && pgmap->ops->kill) @@ -152,20 +117,18 @@ static void pageunmap_range(struct dev_pagemap *pgmap, int range_id)
void memunmap_pages(struct dev_pagemap *pgmap) { - unsigned long pfn; int i;
dev_pagemap_kill(pgmap); for (i = 0; i < pgmap->nr_range; i++) - for_each_device_pfn(pfn, pgmap, i) - put_page(pfn_to_page(pfn)); + percpu_ref_put_many(pgmap->ref, pfn_end(pgmap, i) - + pfn_first(pgmap, i)); dev_pagemap_cleanup(pgmap);
for (i = 0; i < pgmap->nr_range; i++) pageunmap_range(pgmap, i);
WARN_ONCE(pgmap->altmap.alloc, "failed to free all reserved pages\n"); - devmap_managed_enable_put(pgmap); } EXPORT_SYMBOL_GPL(memunmap_pages);
@@ -361,8 +324,6 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) } }
- devmap_managed_enable_get(pgmap); - /* * Clear the pgmap nr_range as it will be incremented for each * successfully processed range. This communicates how many @@ -477,16 +438,10 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn, EXPORT_SYMBOL_GPL(get_dev_pagemap);
#ifdef CONFIG_DEV_PAGEMAP_OPS -void free_devmap_managed_page(struct page *page) +static void free_device_private_page(struct page *page) { - /* notify page idle for dax */ - if (!is_device_private_page(page)) { - wake_up_var(&page->_refcount); - return; - }
__ClearPageWaiters(page); - mem_cgroup_uncharge(page);
/* @@ -513,4 +468,19 @@ void free_devmap_managed_page(struct page *page) page->mapping = NULL; page->pgmap->ops->page_free(page); } + +void free_zone_device_page(struct page *page) +{ + switch (page->pgmap->type) { + case MEMORY_DEVICE_FS_DAX: + /* notify page idle */ + wake_up_var(&page->_refcount); + return; + case MEMORY_DEVICE_PRIVATE: + free_device_private_page(page); + return; + default: + return; + } +} #endif /* CONFIG_DEV_PAGEMAP_OPS */ diff --git a/mm/migrate.c b/mm/migrate.c index 20ca887ea769..8c2430d3e77b 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -376,11 +376,6 @@ static int expected_page_refs(struct address_space *mapping, struct page *page) { int expected_count = 1;
- /* - * Device private pages have an extra refcount as they are - * ZONE_DEVICE pages. - */ - expected_count += is_device_private_page(page); if (mapping) expected_count += thp_nr_pages(page) + page_has_private(page);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 519a60d5b6f7..4612c457d0b0 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6210,6 +6210,9 @@ void __ref memmap_init_zone_device(struct zone *zone,
__init_single_page(page, pfn, zone_idx, nid);
+ /* ZONE_DEVICE pages start with a zero reference count. */ + set_page_count(page, 0); + /* * Mark page reserved as it will need to wait for onlining * phase for it to be fully associated with a zone. diff --git a/mm/swap.c b/mm/swap.c index 2cca7141470c..0a12af814065 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -114,12 +114,11 @@ static void __put_compound_page(struct page *page) void __put_page(struct page *page) { if (is_zone_device_page(page)) { - put_dev_pagemap(page->pgmap); - /* * The page belongs to the device that created pgmap. Do * not return it to page allocator. */ + free_zone_device_page(page); return; }
@@ -878,29 +877,18 @@ void release_pages(struct page **pages, int nr) if (is_huge_zero_page(page)) continue;
+ if (!put_page_testzero(page)) + continue; + if (is_zone_device_page(page)) { if (lruvec) { unlock_page_lruvec_irqrestore(lruvec, flags); lruvec = NULL; } - /* - * ZONE_DEVICE pages that return 'false' from - * page_is_devmap_managed() do not require special - * processing, and instead, expect a call to - * put_page_testzero(). - */ - if (page_is_devmap_managed(page)) { - put_devmap_managed_page(page); - continue; - } - if (put_page_testzero(page)) - put_dev_pagemap(page->pgmap); + free_zone_device_page(page); continue; }
- if (!put_page_testzero(page)) - continue; - if (PageCompound(page)) { if (lruvec) { unlock_page_lruvec_irqrestore(lruvec, flags); @@ -1142,26 +1130,3 @@ void __init swap_setup(void) * _really_ don't want to cluster much more */ } - -#ifdef CONFIG_DEV_PAGEMAP_OPS -void put_devmap_managed_page(struct page *page) -{ - int count; - - if (WARN_ON_ONCE(!page_is_devmap_managed(page))) - return; - - count = page_ref_dec_return(page); - - /* - * devmap page refcounts are 1-based, rather than 0-based: if - * refcount is 1, then the page is free and the refcount is - * stable because nobody holds a reference on the page. - */ - if (count == 1) - free_devmap_managed_page(page); - else if (!count) - __put_page(page); -} -EXPORT_SYMBOL(put_devmap_managed_page); -#endif
The AMD architecture for the Frontier supercomputer will have device memory which can be coherently accessed by the CPU. The system BIOS advertises this memory as SPM (special purpose memory) in the UEFI system address map.
The AMDGPU driver needs to be able to lookup this resource in order to claim it as MEMORY_DEVICE_GENERIC using devm_memremap_pages.
Signed-off-by: Alex Sierra alex.sierra@amd.com --- kernel/resource.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/resource.c b/kernel/resource.c index 627e61b0c124..269489bb7097 100644 --- a/kernel/resource.c +++ b/kernel/resource.c @@ -783,7 +783,7 @@ struct resource *lookup_resource(struct resource *root, resource_size_t start)
return res; } - +EXPORT_SYMBOL_GPL(lookup_resource); /* * Insert a resource into the resource tree. If successful, return NULL, * otherwise return the conflicting resource (compare to __request_resource())
When CPU is connected throug XGMI, it has coherent access to VRAM resource. In this case that resource is taken from a table in the device gmc aperture base. This resource is used along with the device type, which could be DEVICE_PRIVATE or DEVICE_GENERIC to create the device page map region.
Signed-off-by: Alex Sierra alex.sierra@amd.com Reviewed-by: Felix Kuehling Felix.Kuehling@amd.com --- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c index c8ca3252cbc2..f5939449a99f 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c @@ -895,6 +895,7 @@ int svm_migrate_init(struct amdgpu_device *adev) struct resource *res; unsigned long size; void *r; + bool xgmi_connected_to_cpu = adev->gmc.xgmi.connected_to_cpu;
/* Page migration works on Vega10 or newer */ if (kfddev->device_info->asic_family < CHIP_VEGA10) @@ -907,17 +908,22 @@ int svm_migrate_init(struct amdgpu_device *adev) * should remove reserved size */ size = ALIGN(adev->gmc.real_vram_size, 2ULL << 20); - res = devm_request_free_mem_region(adev->dev, &iomem_resource, size); + if (xgmi_connected_to_cpu) + res = lookup_resource(&iomem_resource, adev->gmc.aper_base); + else + res = devm_request_free_mem_region(adev->dev, &iomem_resource, size); + if (IS_ERR(res)) return -ENOMEM;
- pgmap->type = MEMORY_DEVICE_PRIVATE; pgmap->nr_range = 1; pgmap->range.start = res->start; pgmap->range.end = res->end; + pgmap->type = xgmi_connected_to_cpu ? + MEMORY_DEVICE_GENERIC : MEMORY_DEVICE_PRIVATE; pgmap->ops = &svm_migrate_pgmap_ops; pgmap->owner = SVM_ADEV_PGMAP_OWNER(adev); - pgmap->flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE; + pgmap->flags = 0; r = devm_memremap_pages(adev->dev, pgmap); if (IS_ERR(r)) { pr_err("failed to register HMM device memory\n");
Generic device type memory on VRAM to RAM migration, has similar access as System RAM from the CPU. This flag sets the source from the sender. Which in Generic type case, should be set as SYSTEM.
Signed-off-by: Alex Sierra alex.sierra@amd.com Reviewed-by: Felix Kuehling Felix.Kuehling@amd.com --- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c index f5939449a99f..7b41006c1164 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c @@ -653,8 +653,9 @@ svm_migrate_vma_to_ram(struct amdgpu_device *adev, struct svm_range *prange, migrate.vma = vma; migrate.start = start; migrate.end = end; - migrate.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE; migrate.pgmap_owner = SVM_ADEV_PGMAP_OWNER(adev); + migrate.flags = adev->gmc.xgmi.connected_to_cpu ? + MIGRATE_VMA_SELECT_SYSTEM : MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
size = 2 * sizeof(*migrate.src) + sizeof(uint64_t) + sizeof(dma_addr_t); size *= npages;
Two helpers added. One checks if zone device page is generic type. The other if page is either private or generic type.
Signed-off-by: Alex Sierra alex.sierra@amd.com --- include/linux/mm.h | 8 ++++++++ 1 file changed, 8 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h index d8d79bb94be8..f5b247a63044 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1125,6 +1125,14 @@ static inline bool is_device_private_page(const struct page *page) page->pgmap->type == MEMORY_DEVICE_PRIVATE; }
+static inline bool is_device_page(const struct page *page) +{ + return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) && + is_zone_device_page(page) && + (page->pgmap->type == MEMORY_DEVICE_PRIVATE || + page->pgmap->type == MEMORY_DEVICE_GENERIC); +} + static inline bool is_pci_p2pdma_page(const struct page *page) { return IS_ENABLED(CONFIG_DEV_PAGEMAP_OPS) &&
Device generic type case added for migrate_vma_pages and migrate_vma_check_page helpers. Both, generic and private device types have the same conditions to decide to migrate pages from/to device memory.
Signed-off-by: Alex Sierra alex.sierra@amd.com --- mm/migrate.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/mm/migrate.c b/mm/migrate.c index 8c2430d3e77b..3b6aaba96fe6 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2602,7 +2602,7 @@ static bool migrate_vma_check_page(struct page *page) * FIXME proper solution is to rework migration_entry_wait() so * it does not need to take a reference on page. */ - return is_device_private_page(page); + return is_device_page(page); }
/* For file back page */ @@ -3064,10 +3064,10 @@ void migrate_vma_pages(struct migrate_vma *migrate) mapping = page_mapping(page);
if (is_zone_device_page(newpage)) { - if (is_device_private_page(newpage)) { + if (is_device_page(newpage)) { /* - * For now only support private anonymous when - * migrating to un-addressable device memory. + * For now only support private and generic + * anonymous when migrating to device memory. */ if (mapping) { migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
Add MEMORY_DEVICE_GENERIC case to free_zone_device_page callback. Device generic type memory case is now able to free its pages properly.
Signed-off-by: Alex Sierra alex.sierra@amd.com --- mm/memremap.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/mm/memremap.c b/mm/memremap.c index 614b3d600e95..6c884e2542a9 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -438,7 +438,7 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn, EXPORT_SYMBOL_GPL(get_dev_pagemap);
#ifdef CONFIG_DEV_PAGEMAP_OPS -static void free_device_private_page(struct page *page) +static void free_device_page(struct page *page) {
__ClearPageWaiters(page); @@ -477,7 +477,8 @@ void free_zone_device_page(struct page *page) wake_up_var(&page->_refcount); return; case MEMORY_DEVICE_PRIVATE: - free_device_private_page(page); + case MEMORY_DEVICE_GENERIC: + free_device_page(page); return; default: return;
On Mon, Jun 07, 2021 at 03:42:18PM -0500, Alex Sierra wrote:
v1: https://lore.kernel.org/linux-mm/20210529064022.GB15834@lst.de/T/
Please copy and paste the rationale into followup patch series instead of sending a link:
AMD is building a system architecture for the Frontier supercomputer with a coherent interconnect between CPUs and GPUs. This hardware architecture allows the CPUs to coherently access GPU device memory. We have hardware in our labs and we are working with our partner HPE on the BIOS, firmware and software for delivery to the DOE.
The system BIOS advertises the GPU device memory (aka VRAM) as SPM (special purpose memory) in the UEFI system address map. The amdgpu driver looks it up with lookup_resource and registers it with devmap as MEMORY_DEVICE_GENERIC using devm_memremap_pages.
Now we're trying to migrate data to and from that memory using the migrate_vma_* helpers so we can support page-based migration in our unified memory allocations, while also supporting CPU access to those pages.
This patch series makes a few changes to make MEMORY_DEVICE_GENERIC pages behave correctly in the migrate_vma_* helpers. We are looking for feedback about this approach. If we're close, what's needed to make our patches acceptable upstream? If we're not close, any suggestions how else to achieve what we are trying to do (i.e. page migration and coherent CPU access to VRAM)?
This work is based on HMM and our SVM memory manager that was recently upstreamed to Dave Airlie's drm-next branch [https://cgit.freedesktop.org/drm/drm/log/?h=drm-next]. On top of that we did some rework of our VRAM management for migrations to remove some incorrect assumptions, allow partially successful migrations and GPU memory mappings that mix pages in VRAM and system memory. [https://patchwork.kernel.org/project/dri-devel/list/?series=489811]
v2: This patch series version has merged "[RFC PATCH v3 0/2] mm: remove extra ZONE_DEVICE struct page refcount" patch series made by Ralph Campbell. It also applies at the top of these series, our changes to support device generic type in migration_vma helpers. This has been tested in systems with device memory that has coherent access by CPU.
Also addresses the following feedback made in v1:
- Isolate in one patch kernel/resource.c modification, based
on Christoph's feedback.
- Add helpers check for generic and private type to avoid
duplicated long lines.
I like to provide an overview of what each of the patches does in a series:
Patches 1-2: Rebased Ralph Campbell's ZONE_DEVICE page refcounting patches Patch 3: Export lookup_resource Patches 4-5: AMDGPU driver changes to register and use DEVICE_GENERIC memory Patches 6-8: Handle DEVICE_GENERIC memory in migration helpers
Alex Sierra (6): kernel: resource: lookup_resource as exported symbol drm/amdkfd: add SPM support for SVM drm/amdkfd: generic type as sys mem on migration to ram include/linux/mm.h: helpers to check zone device generic type mm: add generic type support to migrate_vma helpers mm: call pgmap->ops->page_free for DEVICE_GENERIC pages
Ralph Campbell (2): ext4/xfs: add page refcount helper mm: remove extra ZONE_DEVICE struct page refcount
arch/powerpc/kvm/book3s_hv_uvmem.c | 2 +- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 15 ++++-- drivers/gpu/drm/nouveau/nouveau_dmem.c | 2 +- fs/dax.c | 8 +-- fs/ext4/inode.c | 5 +- fs/xfs/xfs_file.c | 4 +- include/linux/dax.h | 10 ++++ include/linux/memremap.h | 7 +-- include/linux/mm.h | 52 +++--------------- kernel/resource.c | 2 +- lib/test_hmm.c | 2 +- mm/internal.h | 8 +++ mm/memremap.c | 69 +++++++----------------- mm/migrate.c | 13 ++--- mm/page_alloc.c | 3 ++ mm/swap.c | 45 ++-------------- 16 files changed, 83 insertions(+), 164 deletions(-)
-- 2.17.1
dri-devel@lists.freedesktop.org