From: John Hubbard jhubbard@nvidia.com
Hi,
These are best characterized as miscellaneous conversions: many (not all) call sites that don't involve biovec or iov_iter, nor mm/. It also leaves out a few call sites that require some more work. These are mostly pretty simple ones.
It's probably best to send all of these via Andrew's -mm tree, assuming that there are no significant merge conflicts with ongoing work in other trees (which I doubt, given that these are small changes).
These patches apply to the latest linux.git. Patch #1 is also already in Andrew's tree, but given the broad non-linux-mm Cc list, I thought it would be more convenient to just include that patch here, so that people can use linux.git as the base--even though these are probably destined for linux-mm.
This is part a tree-wide conversion, as described in commit fc1d8e7cca2d ("mm: introduce put_user_page*(), placeholder versions"). That commit has an extensive description of the problem and the planned steps to solve it, but the highlites are:
1) Provide put_user_page*() routines, intended to be used for releasing pages that were pinned via get_user_pages*().
2) Convert all of the call sites for get_user_pages*(), to invoke put_user_page*(), instead of put_page(). This involves dozens of call sites, and will take some time.
3) After (2) is complete, use get_user_pages*() and put_user_page*() to implement tracking of these pages. This tracking will be separate from the existing struct page refcounting.
4) Use the tracking and identification of these pages, to implement special handling (especially in writeback paths) when the pages are backed by a filesystem.
And a few references, also from that commit:
[1] https://lwn.net/Articles/774411/ : "DMA and get_user_pages()" [2] https://lwn.net/Articles/753027/ : "The Trouble with get_user_pages()"
Ira Weiny (1): fs/binfmt_elf: convert put_page() to put_user_page*()
John Hubbard (33): mm/gup: add make_dirty arg to put_user_pages_dirty_lock() net/rds: convert put_page() to put_user_page*() net/ceph: convert put_page() to put_user_page*() x86/kvm: convert put_page() to put_user_page*() drm/etnaviv: convert release_pages() to put_user_pages() drm/i915: convert put_page() to put_user_page*() drm/radeon: convert put_page() to put_user_page*() media/ivtv: convert put_page() to put_user_page*() media/v4l2-core/mm: convert put_page() to put_user_page*() genwqe: convert put_page() to put_user_page*() scif: convert put_page() to put_user_page*() vmci: convert put_page() to put_user_page*() rapidio: convert put_page() to put_user_page*() oradax: convert put_page() to put_user_page*() staging/vc04_services: convert put_page() to put_user_page*() drivers/tee: convert put_page() to put_user_page*() vfio: convert put_page() to put_user_page*() fbdev/pvr2fb: convert put_page() to put_user_page*() fsl_hypervisor: convert put_page() to put_user_page*() xen: convert put_page() to put_user_page*() fs/exec.c: convert put_page() to put_user_page*() orangefs: convert put_page() to put_user_page*() uprobes: convert put_page() to put_user_page*() futex: convert put_page() to put_user_page*() mm/frame_vector.c: convert put_page() to put_user_page*() mm/gup_benchmark.c: convert put_page() to put_user_page*() mm/memory.c: convert put_page() to put_user_page*() mm/madvise.c: convert put_page() to put_user_page*() mm/process_vm_access.c: convert put_page() to put_user_page*() crypt: convert put_page() to put_user_page*() nfs: convert put_page() to put_user_page*() goldfish_pipe: convert put_page() to put_user_page*() kernel/events/core.c: convert put_page() to put_user_page*()
arch/x86/kvm/svm.c | 4 +- crypto/af_alg.c | 7 +- drivers/gpu/drm/etnaviv/etnaviv_gem.c | 4 +- drivers/gpu/drm/i915/gem/i915_gem_userptr.c | 9 +- drivers/gpu/drm/radeon/radeon_ttm.c | 2 +- drivers/infiniband/core/umem.c | 5 +- drivers/infiniband/hw/hfi1/user_pages.c | 5 +- drivers/infiniband/hw/qib/qib_user_pages.c | 5 +- drivers/infiniband/hw/usnic/usnic_uiom.c | 5 +- drivers/infiniband/sw/siw/siw_mem.c | 10 +- drivers/media/pci/ivtv/ivtv-udma.c | 14 +-- drivers/media/pci/ivtv/ivtv-yuv.c | 10 +- drivers/media/v4l2-core/videobuf-dma-sg.c | 3 +- drivers/misc/genwqe/card_utils.c | 17 +-- drivers/misc/mic/scif/scif_rma.c | 17 ++- drivers/misc/vmw_vmci/vmci_context.c | 2 +- drivers/misc/vmw_vmci/vmci_queue_pair.c | 11 +- drivers/platform/goldfish/goldfish_pipe.c | 9 +- drivers/rapidio/devices/rio_mport_cdev.c | 9 +- drivers/sbus/char/oradax.c | 2 +- .../interface/vchiq_arm/vchiq_2835_arm.c | 10 +- drivers/tee/tee_shm.c | 10 +- drivers/vfio/vfio_iommu_type1.c | 8 +- drivers/video/fbdev/pvr2fb.c | 3 +- drivers/virt/fsl_hypervisor.c | 7 +- drivers/xen/gntdev.c | 5 +- drivers/xen/privcmd.c | 7 +- fs/binfmt_elf.c | 2 +- fs/binfmt_elf_fdpic.c | 2 +- fs/exec.c | 2 +- fs/nfs/direct.c | 4 +- fs/orangefs/orangefs-bufmap.c | 7 +- include/linux/mm.h | 5 +- kernel/events/core.c | 2 +- kernel/events/uprobes.c | 6 +- kernel/futex.c | 10 +- mm/frame_vector.c | 4 +- mm/gup.c | 115 ++++++++---------- mm/gup_benchmark.c | 2 +- mm/madvise.c | 2 +- mm/memory.c | 2 +- mm/process_vm_access.c | 18 +-- net/ceph/pagevec.c | 8 +- net/rds/info.c | 5 +- net/rds/message.c | 2 +- net/rds/rdma.c | 15 ++- virt/kvm/kvm_main.c | 4 +- 47 files changed, 151 insertions(+), 266 deletions(-)
From: John Hubbard jhubbard@nvidia.com
Provide more capable variation of put_user_pages_dirty_lock(), and delete put_user_pages_dirty(). This is based on the following:
1. Lots of call sites become simpler if a bool is passed into put_user_page*(), instead of making the call site choose which put_user_page*() variant to call.
2. Christoph Hellwig's observation that set_page_dirty_lock() is usually correct, and set_page_dirty() is usually a bug, or at least questionable, within a put_user_page*() calling chain.
This leads to the following API choices:
* put_user_pages_dirty_lock(page, npages, make_dirty)
* There is no put_user_pages_dirty(). You have to hand code that, in the rare case that it's required.
Cc: Matthew Wilcox willy@infradead.org Cc: Jan Kara jack@suse.cz Cc: Christoph Hellwig hch@lst.de Cc: Ira Weiny ira.weiny@intel.com Cc: Jason Gunthorpe jgg@ziepe.ca Signed-off-by: John Hubbard jhubbard@nvidia.com --- drivers/infiniband/core/umem.c | 5 +- drivers/infiniband/hw/hfi1/user_pages.c | 5 +- drivers/infiniband/hw/qib/qib_user_pages.c | 5 +- drivers/infiniband/hw/usnic/usnic_uiom.c | 5 +- drivers/infiniband/sw/siw/siw_mem.c | 10 +- include/linux/mm.h | 5 +- mm/gup.c | 115 +++++++++------------ 7 files changed, 58 insertions(+), 92 deletions(-)
diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c index 08da840ed7ee..965cf9dea71a 100644 --- a/drivers/infiniband/core/umem.c +++ b/drivers/infiniband/core/umem.c @@ -54,10 +54,7 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
for_each_sg_page(umem->sg_head.sgl, &sg_iter, umem->sg_nents, 0) { page = sg_page_iter_page(&sg_iter); - if (umem->writable && dirty) - put_user_pages_dirty_lock(&page, 1); - else - put_user_page(page); + put_user_pages_dirty_lock(&page, 1, umem->writable && dirty); }
sg_free_table(&umem->sg_head); diff --git a/drivers/infiniband/hw/hfi1/user_pages.c b/drivers/infiniband/hw/hfi1/user_pages.c index b89a9b9aef7a..469acb961fbd 100644 --- a/drivers/infiniband/hw/hfi1/user_pages.c +++ b/drivers/infiniband/hw/hfi1/user_pages.c @@ -118,10 +118,7 @@ int hfi1_acquire_user_pages(struct mm_struct *mm, unsigned long vaddr, size_t np void hfi1_release_user_pages(struct mm_struct *mm, struct page **p, size_t npages, bool dirty) { - if (dirty) - put_user_pages_dirty_lock(p, npages); - else - put_user_pages(p, npages); + put_user_pages_dirty_lock(p, npages, dirty);
if (mm) { /* during close after signal, mm can be NULL */ atomic64_sub(npages, &mm->pinned_vm); diff --git a/drivers/infiniband/hw/qib/qib_user_pages.c b/drivers/infiniband/hw/qib/qib_user_pages.c index bfbfbb7e0ff4..6bf764e41891 100644 --- a/drivers/infiniband/hw/qib/qib_user_pages.c +++ b/drivers/infiniband/hw/qib/qib_user_pages.c @@ -40,10 +40,7 @@ static void __qib_release_user_pages(struct page **p, size_t num_pages, int dirty) { - if (dirty) - put_user_pages_dirty_lock(p, num_pages); - else - put_user_pages(p, num_pages); + put_user_pages_dirty_lock(p, num_pages, dirty); }
/** diff --git a/drivers/infiniband/hw/usnic/usnic_uiom.c b/drivers/infiniband/hw/usnic/usnic_uiom.c index 0b0237d41613..62e6ffa9ad78 100644 --- a/drivers/infiniband/hw/usnic/usnic_uiom.c +++ b/drivers/infiniband/hw/usnic/usnic_uiom.c @@ -75,10 +75,7 @@ static void usnic_uiom_put_pages(struct list_head *chunk_list, int dirty) for_each_sg(chunk->page_list, sg, chunk->nents, i) { page = sg_page(sg); pa = sg_phys(sg); - if (dirty) - put_user_pages_dirty_lock(&page, 1); - else - put_user_page(page); + put_user_pages_dirty_lock(&page, 1, dirty); usnic_dbg("pa: %pa\n", &pa); } kfree(chunk); diff --git a/drivers/infiniband/sw/siw/siw_mem.c b/drivers/infiniband/sw/siw/siw_mem.c index 67171c82b0c4..ab83a9cec562 100644 --- a/drivers/infiniband/sw/siw/siw_mem.c +++ b/drivers/infiniband/sw/siw/siw_mem.c @@ -63,15 +63,7 @@ struct siw_mem *siw_mem_id2obj(struct siw_device *sdev, int stag_index) static void siw_free_plist(struct siw_page_chunk *chunk, int num_pages, bool dirty) { - struct page **p = chunk->plist; - - while (num_pages--) { - if (!PageDirty(*p) && dirty) - put_user_pages_dirty_lock(p, 1); - else - put_user_page(*p); - p++; - } + put_user_pages_dirty_lock(chunk->plist, num_pages, dirty); }
void siw_umem_release(struct siw_umem *umem, bool dirty) diff --git a/include/linux/mm.h b/include/linux/mm.h index 0334ca97c584..9759b6a24420 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1057,8 +1057,9 @@ static inline void put_user_page(struct page *page) put_page(page); }
-void put_user_pages_dirty(struct page **pages, unsigned long npages); -void put_user_pages_dirty_lock(struct page **pages, unsigned long npages); +void put_user_pages_dirty_lock(struct page **pages, unsigned long npages, + bool make_dirty); + void put_user_pages(struct page **pages, unsigned long npages);
#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) diff --git a/mm/gup.c b/mm/gup.c index 98f13ab37bac..7fefd7ab02c4 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -29,85 +29,70 @@ struct follow_page_context { unsigned int page_mask; };
-typedef int (*set_dirty_func_t)(struct page *page); - -static void __put_user_pages_dirty(struct page **pages, - unsigned long npages, - set_dirty_func_t sdf) -{ - unsigned long index; - - for (index = 0; index < npages; index++) { - struct page *page = compound_head(pages[index]); - - /* - * Checking PageDirty at this point may race with - * clear_page_dirty_for_io(), but that's OK. Two key cases: - * - * 1) This code sees the page as already dirty, so it skips - * the call to sdf(). That could happen because - * clear_page_dirty_for_io() called page_mkclean(), - * followed by set_page_dirty(). However, now the page is - * going to get written back, which meets the original - * intention of setting it dirty, so all is well: - * clear_page_dirty_for_io() goes on to call - * TestClearPageDirty(), and write the page back. - * - * 2) This code sees the page as clean, so it calls sdf(). - * The page stays dirty, despite being written back, so it - * gets written back again in the next writeback cycle. - * This is harmless. - */ - if (!PageDirty(page)) - sdf(page); - - put_user_page(page); - } -} - /** - * put_user_pages_dirty() - release and dirty an array of gup-pinned pages - * @pages: array of pages to be marked dirty and released. + * put_user_pages_dirty_lock() - release and optionally dirty gup-pinned pages + * @pages: array of pages to be maybe marked dirty, and definitely released. * @npages: number of pages in the @pages array. + * @make_dirty: whether to mark the pages dirty * * "gup-pinned page" refers to a page that has had one of the get_user_pages() * variants called on that page. * * For each page in the @pages array, make that page (or its head page, if a - * compound page) dirty, if it was previously listed as clean. Then, release - * the page using put_user_page(). + * compound page) dirty, if @make_dirty is true, and if the page was previously + * listed as clean. In any case, releases all pages using put_user_page(), + * possibly via put_user_pages(), for the non-dirty case. * * Please see the put_user_page() documentation for details. * - * set_page_dirty(), which does not lock the page, is used here. - * Therefore, it is the caller's responsibility to ensure that this is - * safe. If not, then put_user_pages_dirty_lock() should be called instead. + * set_page_dirty_lock() is used internally. If instead, set_page_dirty() is + * required, then the caller should a) verify that this is really correct, + * because _lock() is usually required, and b) hand code it: + * set_page_dirty_lock(), put_user_page(). * */ -void put_user_pages_dirty(struct page **pages, unsigned long npages) +void put_user_pages_dirty_lock(struct page **pages, unsigned long npages, + bool make_dirty) { - __put_user_pages_dirty(pages, npages, set_page_dirty); -} -EXPORT_SYMBOL(put_user_pages_dirty); + unsigned long index;
-/** - * put_user_pages_dirty_lock() - release and dirty an array of gup-pinned pages - * @pages: array of pages to be marked dirty and released. - * @npages: number of pages in the @pages array. - * - * For each page in the @pages array, make that page (or its head page, if a - * compound page) dirty, if it was previously listed as clean. Then, release - * the page using put_user_page(). - * - * Please see the put_user_page() documentation for details. - * - * This is just like put_user_pages_dirty(), except that it invokes - * set_page_dirty_lock(), instead of set_page_dirty(). - * - */ -void put_user_pages_dirty_lock(struct page **pages, unsigned long npages) -{ - __put_user_pages_dirty(pages, npages, set_page_dirty_lock); + /* + * TODO: this can be optimized for huge pages: if a series of pages is + * physically contiguous and part of the same compound page, then a + * single operation to the head page should suffice. + */ + + if (!make_dirty) { + put_user_pages(pages, npages); + return; + } + + for (index = 0; index < npages; index++) { + struct page *page = compound_head(pages[index]); + /* + * Checking PageDirty at this point may race with + * clear_page_dirty_for_io(), but that's OK. Two key + * cases: + * + * 1) This code sees the page as already dirty, so it + * skips the call to set_page_dirty(). That could happen + * because clear_page_dirty_for_io() called + * page_mkclean(), followed by set_page_dirty(). + * However, now the page is going to get written back, + * which meets the original intention of setting it + * dirty, so all is well: clear_page_dirty_for_io() goes + * on to call TestClearPageDirty(), and write the page + * back. + * + * 2) This code sees the page as clean, so it calls + * set_page_dirty(). The page stays dirty, despite being + * written back, so it gets written back again in the + * next writeback cycle. This is harmless. + */ + if (!PageDirty(page)) + set_page_dirty_lock(page); + put_user_page(page); + } } EXPORT_SYMBOL(put_user_pages_dirty_lock);
From: John Hubbard jhubbard@nvidia.com
For pages that were retained via get_user_pages*(), release those pages via the new put_user_page*() routines, instead of via put_page() or release_pages().
This is part a tree-wide conversion, as described in commit fc1d8e7cca2d ("mm: introduce put_user_page*(), placeholder versions").
Cc: Santosh Shilimkar santosh.shilimkar@oracle.com Cc: David S. Miller davem@davemloft.net Cc: netdev@vger.kernel.org Cc: linux-rdma@vger.kernel.org Cc: rds-devel@oss.oracle.com Signed-off-by: John Hubbard jhubbard@nvidia.com --- net/rds/info.c | 5 ++--- net/rds/message.c | 2 +- net/rds/rdma.c | 15 +++++++-------- 3 files changed, 10 insertions(+), 12 deletions(-)
diff --git a/net/rds/info.c b/net/rds/info.c index 03f6fd56d237..ca6af2889adf 100644 --- a/net/rds/info.c +++ b/net/rds/info.c @@ -162,7 +162,6 @@ int rds_info_getsockopt(struct socket *sock, int optname, char __user *optval, struct rds_info_lengths lens; unsigned long nr_pages = 0; unsigned long start; - unsigned long i; rds_info_func func; struct page **pages = NULL; int ret; @@ -235,8 +234,8 @@ int rds_info_getsockopt(struct socket *sock, int optname, char __user *optval, ret = -EFAULT;
out: - for (i = 0; pages && i < nr_pages; i++) - put_page(pages[i]); + if (pages) + put_user_pages(pages, nr_pages); kfree(pages);
return ret; diff --git a/net/rds/message.c b/net/rds/message.c index 50f13f1d4ae0..d7b0d266c437 100644 --- a/net/rds/message.c +++ b/net/rds/message.c @@ -404,7 +404,7 @@ static int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter * int i;
for (i = 0; i < rm->data.op_nents; i++) - put_page(sg_page(&rm->data.op_sg[i])); + put_user_page(sg_page(&rm->data.op_sg[i])); mmp = &rm->data.op_mmp_znotifier->z_mmp; mm_unaccount_pinned_pages(mmp); ret = -EFAULT; diff --git a/net/rds/rdma.c b/net/rds/rdma.c index 916f5ec373d8..6762e8696b99 100644 --- a/net/rds/rdma.c +++ b/net/rds/rdma.c @@ -162,8 +162,7 @@ static int rds_pin_pages(unsigned long user_addr, unsigned int nr_pages, pages);
if (ret >= 0 && ret < nr_pages) { - while (ret--) - put_page(pages[ret]); + put_user_pages(pages, ret); ret = -EFAULT; }
@@ -276,7 +275,7 @@ static int __rds_rdma_map(struct rds_sock *rs, struct rds_get_mr_args *args,
if (IS_ERR(trans_private)) { for (i = 0 ; i < nents; i++) - put_page(sg_page(&sg[i])); + put_user_page(sg_page(&sg[i])); kfree(sg); ret = PTR_ERR(trans_private); goto out; @@ -464,9 +463,10 @@ void rds_rdma_free_op(struct rm_rdma_op *ro) * to local memory */ if (!ro->op_write) { WARN_ON(!page->mapping && irqs_disabled()); - set_page_dirty(page); + put_user_pages_dirty_lock(&page, 1, true); + } else { + put_user_page(page); } - put_page(page); }
kfree(ro->op_notifier); @@ -481,8 +481,7 @@ void rds_atomic_free_op(struct rm_atomic_op *ao) /* Mark page dirty if it was possibly modified, which * is the case for a RDMA_READ which copies from remote * to local memory */ - set_page_dirty(page); - put_page(page); + put_user_pages_dirty_lock(&page, 1, true);
kfree(ao->op_notifier); ao->op_notifier = NULL; @@ -867,7 +866,7 @@ int rds_cmsg_atomic(struct rds_sock *rs, struct rds_message *rm, return ret; err: if (page) - put_page(page); + put_user_page(page); rm->atomic.op_active = 0; kfree(rm->atomic.op_notifier);
On 8/1/19 7:16 PM, john.hubbard@gmail.com wrote:
From: John Hubbard jhubbard@nvidia.com
Hi,
These are best characterized as miscellaneous conversions: many (not all) call sites that don't involve biovec or iov_iter, nor mm/. It also leaves out a few call sites that require some more work. These are mostly pretty simple ones.
It's probably best to send all of these via Andrew's -mm tree, assuming that there are no significant merge conflicts with ongoing work in other trees (which I doubt, given that these are small changes).
In case anyone is wondering, this truncated series is due to a script failure: git-send-email chokes when it hits email addresses whose names have a comma in them, as happened here with patch 0003.
Please disregard this set and reply to the other thread.
thanks,
On Thu, Aug 01, 2019 at 07:16:19PM -0700, john.hubbard@gmail.com wrote:
This is part a tree-wide conversion, as described in commit fc1d8e7cca2d ("mm: introduce put_user_page*(), placeholder versions"). That commit has an extensive description of the problem and the planned steps to solve it, but the highlites are:
That is one horridly mangled Changelog there :-/ It looks like it's partially duplicated.
Anyway; no objections to any of that, but I just wanted to mention that there are other problems with long term pinning that haven't been mentioned, notably they inhibit compaction.
A long time ago I proposed an interface to mark pages as pinned, such that we could run compaction before we actually did the pinning.
On 8/2/19 1:05 AM, Peter Zijlstra wrote:
On Thu, Aug 01, 2019 at 07:16:19PM -0700, john.hubbard@gmail.com wrote:
This is part a tree-wide conversion, as described in commit fc1d8e7cca2d ("mm: introduce put_user_page*(), placeholder versions"). That commit has an extensive description of the problem and the planned steps to solve it, but the highlites are:
That is one horridly mangled Changelog there :-/ It looks like it's partially duplicated.
Yeah. It took so long to merge that I think I was no longer able to actually see the commit description, after N readings. sigh
Anyway; no objections to any of that, but I just wanted to mention that there are other problems with long term pinning that haven't been mentioned, notably they inhibit compaction.
A long time ago I proposed an interface to mark pages as pinned, such that we could run compaction before we actually did the pinning.
This is all heading toward marking pages as pinned, so we should finally get there. I'll post the RFC for tracking pinned pages shortly.
thanks,
dri-devel@lists.freedesktop.org