Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount

14 Oct 2021

      On Thu, Oct 14, 2021 at 11:45 AM Matthew Wilcox willy@infradead.org wrote:
...
It would probably help if you cc'd Dan on this.
Thanks.
[..]
...
On Thu, Oct 14, 2021 at 02:06:34PM -0300, Jason Gunthorpe wrote:
...
On Thu, Oct 14, 2021 at 10:39:28AM -0500, Alex Sierra wrote:
...
From: Ralph Campbell rcampbell@nvidia.com
ZONE_DEVICE struct pages have an extra reference count that complicates the
code for put_page() and several places in the kernel that need to check the
reference count to see that a page is not being used (gup, compaction,
migration, etc.). Clean up the code so the reference count doesn't need to
be treated specially for ZONE_DEVICE.
Signed-off-by: Ralph Campbell rcampbell@nvidia.com
Signed-off-by: Alex Sierra alex.sierra@amd.com
Reviewed-by: Christoph Hellwig hch@lst.de

v2:
AS: merged this patch in linux 5.11 version
v5:
AS: add condition at try_grab_page to check for the zone device type, while
page ref counter is checked less/equal to zero. In case of device zone, pages
ref counter are initialized to zero.
v7:
AS: fix condition at try_grab_page added at v5, is invalid. It supposed
to fix xfstests/generic/413 test, however, there's a known issue on
this test where DAX mapped area DIO to non-DAX expect to fail.
https://patchwork.kernel.org/project/fstests/patch/1489463960-3579-1-git-sen...
This condition was removed after rebase over patch series
https://lore.kernel.org/r/20210813044133.1536842-4-jhubbard@nvidia.com

arch/powerpc/kvm/book3s_hv_uvmem.c     |  2 +-
 drivers/gpu/drm/nouveau/nouveau_dmem.c |  2 +-
 fs/dax.c                               |  4 +-
 include/linux/dax.h                    |  2 +-
 include/linux/memremap.h               |  7 +--
 include/linux/mm.h                     | 11 ----
 lib/test_hmm.c                         |  2 +-
 mm/internal.h                          |  8 +++
 mm/memcontrol.c                        |  6 +--
 mm/memremap.c                          | 69 +++++++-------------------
 mm/migrate.c                           |  5 --
 mm/page_alloc.c                        |  3 ++
 mm/swap.c                              | 45 ++---------------
 13 files changed, 46 insertions(+), 120 deletions(-)
Has anyone tested this with FSDAX? Does get_user_pages() on fsdax
backed memory still work?
What refcount value does the struct pages have when they are installed
in the PTEs? Remember a 0 refcount will make all the get_user_pages()
fail.
I'm looking at the call path starting in ext4_punch_hole() and I would
expect to see something manipulating the page ref count before
the ext4_break_layouts() call path gets to the dax_page_unused() test.
All I see is we go into unmap_mapping_pages() - that would normally
put back the page references held by PTEs but insert_pfn() has this:
  if (pfn_t_devmap(pfn))
          entry = pte_mkdevmap(pfn_t_pte(pfn, prot));

And:
static inline pte_t pte_mkdevmap(pte_t pte)
{
      return pte_set_flags(pte, _PAGE_SPECIAL|_PAGE_DEVMAP);
}
Which interacts with vm_normal_page():
          if (pte_devmap(pte))
                  return NULL;

To disable that refcounting?
So... I have a feeling this will have PTEs pointing to 0 refcount
pages? Unless FSDAX is !pte_devmap which is not the case, right?
This seems further confirmed by this comment:
  /*
   * If we race get_user_pages_fast() here either we'll see the
   * elevated page count in the iteration and wait, or
   * get_user_pages_fast() will see that the page it took a reference
   * against is no longer mapped in the page tables and bail to the
   * get_user_pages() slow path.  The slow path is protected by
   * pte_lock() and pmd_lock(). New references are not taken without
   * holding those locks, and unmap_mapping_pages() will not zero the
   * pte or pmd without holding the respective lock, so we are
   * guaranteed to either see new references or prevent new
   * references from being established.
   */

Which seems to explain this scheme relies on unmap_mapping_pages() to
fence GUP_fast, not on GUP_fast observing 0 refcounts when it should
stop.
This seems like it would be properly fixed by using normal page
refcounting for PTEs - ie stop using special for these pages?
Does anyone know why devmap is pte_special anyhow?
It does not need to be special as mentioned here:
https://lore.kernel.org/all/CAPcyv4iFeVDVPn6uc=aKsyUvkiu3-fK-N16iJVZQ3N8oT00...
The refcount dependencies also go away after this...
https://lore.kernel.org/all/161604050866.1463742.7759521510383551055.stgit@d...
...but you can see that patches 1 and 2 in that series depend on being
able to guarantee that all mappings are invalidated when the undelying
device that owns the pgmap goes away.
For that to happen there needs to be communication back to the FS for
device-gone / failure events. That work is in progress via this
series:
https://lore.kernel.org/all/20210924130959.2695749-1-ruansy.fnst@fujitsu.com...
So there's a path to unwind this awkwardness, but it needs some
dominoes to fall first as far as I can see. My current focus is
getting Shiyang's series unblocked.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

Re: [PATCH v1 2/2] mm: remove extra ZONE_DEVICE struct page refcount