This series adds support to Nouveau for atomic memory operations on OpenCL shared virtual memory (SVM). This is achieved using the atomic PTE bits on the GPU to only permit atomic operations to system memory when a page is not mapped in userspace on the CPU.
This is implemented by adding a mode to migrate_vma_pages() which unmaps and isolates existing pages from the CPU and pins them. The original userspace page table entries are migrated to point to device private pages allocated by the driver. This allows the driver to enable GPU atomic access to the page as it will receive a callback when CPU userspace needs to access it.
In response to this callback the driver revokes the atomic access permission from the GPU and migrates entries to point back to the original page. The original page is unpinned as part of the migration operation which also returns it to the LRU.
Patch 3 contains the bulk of the memory management changes to implement unmap and pin.
Patches 6-9 extend Nouveau to use the new mode to allow system wide atomics for OpenCL SVM to be implemented on Nouveau.
This has been tested using the latest upstream Mesa userspace with a simple OpenCL test program which checks the results of atomic GPU operations on a buffer whilst also writing to the same buffer from the CPU.
Problems yet to be addressed:
Recent changes to pin_user_pages() prevent the creation of pinned pages in ZONE_MOVABLE. This series allows pinned pages to be created in ZONE_MOVABLE as attempts to migrate may fail which would be fatal to userspace.
In this case migration of the pinned page is unnecessary as the page can be unpinned at anytime by having the driver revoke atomic permission as it does for the migrate_to_ram() callback. However a method of calling this when memory needs to be moved has yet to be resolved so any discussion is welcome.
Alistair Popple (9): mm/migrate.c: Always allow device private pages to migrate mm/migrate.c: Allow pfn flags to be passed to migrate_vma_setup() mm/migrate: Add a unmap and pin migration mode Documentation: Add unmap and pin to HMM hmm-tests: Add test for unmap and pin nouveau/dmem: Only map migrating pages nouveau/svm: Refactor nouveau_range_fault nouveau/dmem: Add support for multiple page types nouveau/svm: Implement atomic SVM access
Documentation/vm/hmm.rst | 22 +- arch/powerpc/kvm/book3s_hv_uvmem.c | 4 +- drivers/gpu/drm/nouveau/include/nvif/if000c.h | 1 + drivers/gpu/drm/nouveau/nouveau_dmem.c | 190 +++++++++++++++--- drivers/gpu/drm/nouveau/nouveau_dmem.h | 9 + drivers/gpu/drm/nouveau/nouveau_svm.c | 148 +++++++++++--- drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h | 1 + .../drm/nouveau/nvkm/subdev/mmu/vmmgp100.c | 6 + include/linux/migrate.h | 2 + include/linux/migrate_mode.h | 1 + lib/test_hmm.c | 109 ++++++++-- lib/test_hmm_uapi.h | 1 + mm/migrate.c | 82 +++++--- tools/testing/selftests/vm/hmm-tests.c | 49 +++++ 14 files changed, 524 insertions(+), 101 deletions(-)
Device private pages are used to represent device memory that is not directly accessible from the CPU. Extra references to a device private page are only used to ensure the struct page itself remains valid whilst waiting for migration entries. Therefore extra references should not prevent device private page migration as this can lead to failures to migrate pages back to the CPU which are fatal to the user process.
Signed-off-by: Alistair Popple apopple@nvidia.com --- mm/migrate.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/mm/migrate.c b/mm/migrate.c index 20ca887ea769..053228559fd3 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -405,8 +405,13 @@ int migrate_page_move_mapping(struct address_space *mapping, int nr = thp_nr_pages(page);
if (!mapping) { - /* Anonymous page without mapping */ - if (page_count(page) != expected_count) + /* + * Anonymous page without mapping. Device private pages should + * never have extra references except during migration, but it + * is safe to ignore these. + */ + if (!is_device_private_page(page) && + page_count(page) != expected_count) return -EAGAIN;
/* No turning back from here */
On Tue, Feb 09, 2021 at 12:07:14PM +1100, Alistair Popple wrote:
This should identify the extra references in expected_count, just disabling this protection seems unsafe, ZONE_DEVICE is not so special that the refcount means nothing
Is this a side effect of the extra refcounts that Ralph was trying to get rid of? I'd rather see that work finished :)
Jason
On Wednesday, 10 February 2021 12:39:32 AM AEDT Jason Gunthorpe wrote:
This is similar to what migarte_vma_check_page() does now. The issue is that a migration wait takes a reference on the device private page so you can end up with one thread stuck waiting for migration whilst the other can't migrate due to the extra refcount.
Given device private pages can't undergo GUP and that it's not possible to differentiate the migration wait refcount from any other refcount we assume any possible extra reference must be from migration wait.
Is this a side effect of the extra refcounts that Ralph was trying to get rid of? I'd rather see that work finished :)
I'd like to see that finished too but I don't think it would help here as this is not a side effect of that.
- Alistair
Jason
On Wed, Feb 10, 2021 at 02:40:10PM +1100, Alistair Popple wrote:
GUP is not the only thing that elevates the refcount, I think this is an unsafe assumption
Why is migration holding an extra refcount anyhow?
Jason
Currently migrate_vma_setup() zeros both src and dst pfn arrays. This means it is not possible to pass per-pfn flags to migrate_vma_setup(). A future patch introduces per-pfn flags for migrate_vma_setup(), so ensure existing callers will not be affected by having the caller zero both src and dst pfn arrays.
Signed-off-by: Alistair Popple apopple@nvidia.com --- arch/powerpc/kvm/book3s_hv_uvmem.c | 4 ++-- lib/test_hmm.c | 6 ++++-- mm/migrate.c | 1 - 3 files changed, 6 insertions(+), 5 deletions(-)
diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c index 84e5a2dc8be5..d434783b272a 100644 --- a/arch/powerpc/kvm/book3s_hv_uvmem.c +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c @@ -506,7 +506,7 @@ static int __kvmppc_svm_page_out(struct vm_area_struct *vma, unsigned long end, unsigned long page_shift, struct kvm *kvm, unsigned long gpa) { - unsigned long src_pfn, dst_pfn = 0; + unsigned long src_pfn = 0, dst_pfn = 0; struct migrate_vma mig; struct page *dpage, *spage; struct kvmppc_uvmem_page_pvt *pvt; @@ -732,7 +732,7 @@ static int kvmppc_svm_page_in(struct vm_area_struct *vma, unsigned long page_shift, bool pagein) { - unsigned long src_pfn, dst_pfn = 0; + unsigned long src_pfn = 0, dst_pfn = 0; struct migrate_vma mig; struct page *spage; unsigned long pfn; diff --git a/lib/test_hmm.c b/lib/test_hmm.c index 80a78877bd93..98848b96ff09 100644 --- a/lib/test_hmm.c +++ b/lib/test_hmm.c @@ -696,6 +696,8 @@ static int dmirror_migrate(struct dmirror *dmirror, if (next > vma->vm_end) next = vma->vm_end;
+ memset(src_pfns, 0, ARRAY_SIZE(src_pfns)); + memset(dst_pfns, 0, ARRAY_SIZE(dst_pfns)); args.vma = vma; args.src = src_pfns; args.dst = dst_pfns; @@ -1025,8 +1027,8 @@ static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args, static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf) { struct migrate_vma args; - unsigned long src_pfns; - unsigned long dst_pfns; + unsigned long src_pfns = 0; + unsigned long dst_pfns = 0; struct page *rpage; struct dmirror *dmirror; vm_fault_t ret; diff --git a/mm/migrate.c b/mm/migrate.c index 053228559fd3..fe8bb322e2e3 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2874,7 +2874,6 @@ int migrate_vma_setup(struct migrate_vma *args) if (!args->src || !args->dst) return -EINVAL;
- memset(args->src, 0, sizeof(*args->src) * nr_pages); args->cpages = 0; args->npages = 0;
Some drivers need to ensure that a device has access to a particular user page whilst preventing userspace access to that page. For example this is required to allow a driver to implement atomic access to a page when the device hardware does not support atomic access to system memory.
This could be implemented by migrating the data to the device, however this is not always optimal and may fail in some circumstances. In these cases it is advantageous to remap the page for device access without actually migrating the data.
To allow this kind of access introduce an unmap and pin flag called MIGRATE_PFN_PIN/UNPIN for migration pfns. This will cause the original page to be remapped to the provided device private page as normal, but instead of returning or freeing the original CPU page it will pin it and leave it isolated from the LRU.
This ensures the page remains pinned so that a device may access it exclusively. Any userspace CPU accesses will fault and trigger the normal device private migrate_to_ram() callback which must migrate the mapping back to the original page, after which the device will no longer have exclusive access to the page.
As the original page does not get freed it is safe to allow the unmap and pin operation to proceed in cases where there are extra page references present.
Signed-off-by: Alistair Popple apopple@nvidia.com --- include/linux/migrate.h | 2 + include/linux/migrate_mode.h | 1 + mm/migrate.c | 74 +++++++++++++++++++++++++----------- 3 files changed, 54 insertions(+), 23 deletions(-)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 4594838a0f7c..449fc61f9a99 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -144,6 +144,8 @@ static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm, #define MIGRATE_PFN_MIGRATE (1UL << 1) #define MIGRATE_PFN_LOCKED (1UL << 2) #define MIGRATE_PFN_WRITE (1UL << 3) +#define MIGRATE_PFN_PIN (1UL << 4) +#define MIGRATE_PFN_UNPIN (1UL << 4) #define MIGRATE_PFN_SHIFT 6
static inline struct page *migrate_pfn_to_page(unsigned long mpfn) diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h index 883c99249033..823497eda927 100644 --- a/include/linux/migrate_mode.h +++ b/include/linux/migrate_mode.h @@ -17,6 +17,7 @@ enum migrate_mode { MIGRATE_SYNC_LIGHT, MIGRATE_SYNC, MIGRATE_SYNC_NO_COPY, + MIGRATE_REFERENCED, };
#endif /* MIGRATE_MODE_H_INCLUDED */ diff --git a/mm/migrate.c b/mm/migrate.c index fe8bb322e2e3..71edc2679c8e 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -410,7 +410,7 @@ int migrate_page_move_mapping(struct address_space *mapping, * never have extra references except during migration, but it * is safe to ignore these. */ - if (!is_device_private_page(page) && + if (!is_device_private_page(page) && extra_count >= 0 && page_count(page) != expected_count) return -EAGAIN;
@@ -421,6 +421,8 @@ int migrate_page_move_mapping(struct address_space *mapping, __SetPageSwapBacked(newpage);
return MIGRATEPAGE_SUCCESS; + } else if (extra_count < 0) { + return -EINVAL; }
oldzone = page_zone(page); @@ -704,12 +706,15 @@ int migrate_page(struct address_space *mapping,
BUG_ON(PageWriteback(page)); /* Writeback must be complete */
- rc = migrate_page_move_mapping(mapping, newpage, page, 0); + if (mode == MIGRATE_REFERENCED) + rc = migrate_page_move_mapping(mapping, newpage, page, -1); + else + rc = migrate_page_move_mapping(mapping, newpage, page, 0);
if (rc != MIGRATEPAGE_SUCCESS) return rc;
- if (mode != MIGRATE_SYNC_NO_COPY) + if (mode != MIGRATE_SYNC_NO_COPY && mode != MIGRATE_REFERENCED) migrate_page_copy(newpage, page); else migrate_page_states(newpage, page); @@ -2327,15 +2332,15 @@ static int migrate_vma_collect_hole(unsigned long start, if (!vma_is_anonymous(walk->vma)) { for (addr = start; addr < end; addr += PAGE_SIZE) { migrate->src[migrate->npages] = 0; - migrate->dst[migrate->npages] = 0; migrate->npages++; } return 0; }
for (addr = start; addr < end; addr += PAGE_SIZE) { - migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE; - migrate->dst[migrate->npages] = 0; + if (vma_is_anonymous(walk->vma) && + !(migrate->src[migrate->npages] & MIGRATE_PFN_PIN)) + migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE; migrate->npages++; migrate->cpages++; } @@ -2425,7 +2430,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, pte = *ptep;
if (pte_none(pte)) { - if (vma_is_anonymous(vma)) { + if (vma_is_anonymous(vma) && + !(migrate->src[migrate->npages] & MIGRATE_PFN_PIN)) { mpfn = MIGRATE_PFN_MIGRATE; migrate->cpages++; } @@ -2525,8 +2531,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, }
next: - migrate->dst[migrate->npages] = 0; - migrate->src[migrate->npages++] = mpfn; + migrate->src[migrate->npages++] |= mpfn; } arch_leave_lazy_mmu_mode(); pte_unmap_unlock(ptep - 1, ptl); @@ -2695,7 +2700,13 @@ static void migrate_vma_prepare(struct migrate_vma *migrate) put_page(page); }
- if (!migrate_vma_check_page(page)) { + /* + * If the page is being unmapped and pinned it isn't actually + * going to migrate, so it's safe to continue the operation with + * an elevated refcount. + */ + if (!migrate_vma_check_page(page) && + !(migrate->src[i] & MIGRATE_PFN_PIN)) { if (remap) { migrate->src[i] &= ~MIGRATE_PFN_MIGRATE; migrate->cpages--; @@ -2757,25 +2768,34 @@ static void migrate_vma_unmap(struct migrate_vma *migrate) if (!page || !(migrate->src[i] & MIGRATE_PFN_MIGRATE)) continue;
- if (page_mapped(page)) { + if (page_mapped(page)) try_to_unmap(page, flags); - if (page_mapped(page)) - goto restore; + + if (page_mapped(page)) + migrate->src[i] &= ~MIGRATE_PFN_MIGRATE; + + if (!migrate_vma_check_page(page) && + !(migrate->src[i] & MIGRATE_PFN_PIN)) + migrate->src[i] &= ~MIGRATE_PFN_MIGRATE; + + if (migrate->src[i] & MIGRATE_PFN_PIN) { + if (page_maybe_dma_pinned(page)) + migrate->src[i] &= ~MIGRATE_PFN_MIGRATE; + else + page_ref_add(page, GUP_PIN_COUNTING_BIAS); }
- if (migrate_vma_check_page(page)) + if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE)) { + migrate->cpages--; + restore++; continue; - -restore: - migrate->src[i] &= ~MIGRATE_PFN_MIGRATE; - migrate->cpages--; - restore++; + } }
for (addr = start, i = 0; i < npages && restore; addr += PAGE_SIZE, i++) { struct page *page = migrate_pfn_to_page(migrate->src[i]);
- if (!page || (migrate->src[i] & MIGRATE_PFN_MIGRATE)) + if (!page || (migrate->src[i] & MIGRATE_PFN_MIGRATE)) continue;
remove_migration_ptes(page, page, false); @@ -3092,7 +3112,11 @@ void migrate_vma_pages(struct migrate_vma *migrate) } }
- r = migrate_page(mapping, newpage, page, MIGRATE_SYNC_NO_COPY); + if (migrate->src[i] & MIGRATE_PFN_PIN) + r = migrate_page(mapping, newpage, page, MIGRATE_REFERENCED); + else + r = migrate_page(mapping, newpage, page, MIGRATE_SYNC_NO_COPY); + if (r != MIGRATEPAGE_SUCCESS) migrate->src[i] &= ~MIGRATE_PFN_MIGRATE; } @@ -3148,15 +3172,19 @@ void migrate_vma_finalize(struct migrate_vma *migrate)
if (is_zone_device_page(page)) put_page(page); - else + else if (!(migrate->src[i] & MIGRATE_PFN_PIN)) putback_lru_page(page);
if (newpage != page) { unlock_page(newpage); if (is_zone_device_page(newpage)) put_page(newpage); - else + else { + if (migrate->dst[i] & MIGRATE_PFN_UNPIN) + page_ref_sub(newpage, GUP_PIN_COUNTING_BIAS); + putback_lru_page(newpage); + } } } }
Update the HMM documentation to include information on the unmap and pin operation.
Signed-off-by: Alistair Popple apopple@nvidia.com --- Documentation/vm/hmm.rst | 22 +++++++++++++++++++--- 1 file changed, 19 insertions(+), 3 deletions(-)
diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst index 09e28507f5b2..83234984f656 100644 --- a/Documentation/vm/hmm.rst +++ b/Documentation/vm/hmm.rst @@ -346,7 +346,15 @@ between device driver specific code and shared common code: from the LRU (if system memory since device private pages are not on the LRU), unmapped from the process, and a special migration PTE is inserted in place of the original PTE. - migrate_vma_setup() also clears the ``args->dst`` array. + + A device driver may also initialise ``src`` entries with the + ``MIGRATE_PFN_PIN`` flag. This allows the device driver to unmap and pin + the existing system page in place whilst migrating page metadata to a + device private page. This leaves the page isolated from the LRU and gives + the device exclusive access to the page data without the need to migrate + data as any CPU access will trigger a fault. The device driver needs to + keep track of the ``src`` page as it effectively becomes the owner of + the page and needs to pass it in when remapping and unpinning the page.
3. The device driver allocates destination pages and copies source pages to destination pages. @@ -357,8 +365,8 @@ between device driver specific code and shared common code: array for that page.
The driver then allocates either a device private struct page or a - system memory page, locks the page with ``lock_page()``, and fills in the - ``dst`` array entry with:: + system memory page, locks the page with ``lock_page()``, and fills in + the ``dst`` array entry with::
dst[i] = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
@@ -373,6 +381,14 @@ between device driver specific code and shared common code: destination or clear the destination device private memory if the pointer is ``NULL`` meaning the source page was not populated in system memory.
+ Alternatively a driver that is remapping and unpinning a source page + obtained from a ``MIGRATE_PFN_PIN`` operation should lock the original + source page and pass it in along with the ``MIGRATE_PFN_UNPIN`` flag + without any need to copy data:: + + dst[i] = migrate_pfn(page_to_pfn(spage)) | MIGRATE_PFN_LOCKED + | MIGRATE_PFN_UNPIN; + 4. ``migrate_vma_pages()``
This step is where the migration is actually "committed".
Adds a basic test of the HMM unmap and pin operation.
Signed-off-by: Alistair Popple apopple@nvidia.com --- lib/test_hmm.c | 107 +++++++++++++++++++++---- lib/test_hmm_uapi.h | 1 + tools/testing/selftests/vm/hmm-tests.c | 49 +++++++++++ 3 files changed, 140 insertions(+), 17 deletions(-)
diff --git a/lib/test_hmm.c b/lib/test_hmm.c index 98848b96ff09..c78a473250a3 100644 --- a/lib/test_hmm.c +++ b/lib/test_hmm.c @@ -46,6 +46,7 @@ struct dmirror_bounce { unsigned long cpages; };
+#define DPT_XA_TAG_ATOMIC 1UL #define DPT_XA_TAG_WRITE 3UL
/* @@ -83,6 +84,7 @@ struct dmirror_device { struct cdev cdevice; struct hmm_devmem *devmem;
+ unsigned int devmem_faults; unsigned int devmem_capacity; unsigned int devmem_count; struct dmirror_chunk **devmem_chunks; @@ -203,8 +205,18 @@ static void dmirror_do_update(struct dmirror *dmirror, unsigned long start, * Therefore, it is OK to just clear the entry. */ xa_for_each_range(&dmirror->pt, pfn, entry, start >> PAGE_SHIFT, - end >> PAGE_SHIFT) + end >> PAGE_SHIFT) { + /* + * Typically this would be done in devmap free page, but as + * we're using the XArray to store the reference to the original + * page do it here as it doesn't matter if clean up of the + * pinned page is delayed. + */ + if (xa_pointer_tag(entry) == DPT_XA_TAG_ATOMIC) + unpin_user_page(xa_untag_pointer(entry)); + xa_erase(&dmirror->pt, pfn); + } }
static bool dmirror_interval_invalidate(struct mmu_interval_notifier *mni, @@ -571,7 +583,8 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice) }
static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args, - struct dmirror *dmirror) + struct dmirror *dmirror, + int allow_ref) { struct dmirror_device *mdevice = dmirror->mdevice; const unsigned long *src = args->src; @@ -598,9 +611,17 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args, continue;
rpage = dpage->zone_device_data; - if (spage) + if (spage && !(*src & MIGRATE_PFN_PIN)) copy_highpage(rpage, spage); else + /* + * In the MIGRATE_PFN_PIN case we don't really + * need rpage at all because the existing page is + * staying in place and will be mapped. However we need + * somewhere to store dmirror and that place is + * rpage->zone_device_data so we keep it for + * simplicity. + */ clear_highpage(rpage);
/* @@ -620,7 +641,8 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args, }
static int dmirror_migrate_finalize_and_map(struct migrate_vma *args, - struct dmirror *dmirror) + struct dmirror *dmirror, + int allow_ref) { unsigned long start = args->start; unsigned long end = args->end; @@ -647,8 +669,14 @@ static int dmirror_migrate_finalize_and_map(struct migrate_vma *args, * Store the page that holds the data so the page table * doesn't have to deal with ZONE_DEVICE private pages. */ - entry = dpage->zone_device_data; - if (*dst & MIGRATE_PFN_WRITE) + if (*src & MIGRATE_PFN_PIN) + entry = migrate_pfn_to_page(*src); + else + entry = dpage->zone_device_data; + + if (*src & MIGRATE_PFN_PIN) + entry = xa_tag_pointer(entry, DPT_XA_TAG_ATOMIC); + else if (*dst & MIGRATE_PFN_WRITE) entry = xa_tag_pointer(entry, DPT_XA_TAG_WRITE); entry = xa_store(&dmirror->pt, pfn, entry, GFP_ATOMIC); if (xa_is_err(entry)) { @@ -662,7 +690,8 @@ static int dmirror_migrate_finalize_and_map(struct migrate_vma *args, }
static int dmirror_migrate(struct dmirror *dmirror, - struct hmm_dmirror_cmd *cmd) + struct hmm_dmirror_cmd *cmd, + int allow_ref) { unsigned long start, end, addr; unsigned long size = cmd->npages << PAGE_SHIFT; @@ -673,7 +702,7 @@ static int dmirror_migrate(struct dmirror *dmirror, struct dmirror_bounce bounce; struct migrate_vma args; unsigned long next; - int ret; + int i, ret;
start = cmd->addr; end = start + size; @@ -696,8 +725,13 @@ static int dmirror_migrate(struct dmirror *dmirror, if (next > vma->vm_end) next = vma->vm_end;
- memset(src_pfns, 0, ARRAY_SIZE(src_pfns)); - memset(dst_pfns, 0, ARRAY_SIZE(dst_pfns)); + if (allow_ref) + for (i = 0; i < 64; ++i) + src_pfns[i] = MIGRATE_PFN_PIN; + else + memset(src_pfns, 0, sizeof(src_pfns)); + memset(dst_pfns, 0, sizeof(dst_pfns)); + args.vma = vma; args.src = src_pfns; args.dst = dst_pfns; @@ -709,9 +743,9 @@ static int dmirror_migrate(struct dmirror *dmirror, if (ret) goto out;
- dmirror_migrate_alloc_and_copy(&args, dmirror); + dmirror_migrate_alloc_and_copy(&args, dmirror, allow_ref); migrate_vma_pages(&args); - dmirror_migrate_finalize_and_map(&args, dmirror); + dmirror_migrate_finalize_and_map(&args, dmirror, allow_ref); migrate_vma_finalize(&args); } mmap_read_unlock(mm); @@ -739,6 +773,28 @@ static int dmirror_migrate(struct dmirror *dmirror, return ret; }
+static int dmirror_migrate_pin(struct dmirror *dmirror, + struct hmm_dmirror_cmd *cmd) +{ + void *tmp; + int nr_pages = cmd->npages; + int ret; + + ret = dmirror_migrate(dmirror, cmd, true); + + tmp = kmalloc(nr_pages << PAGE_SHIFT, GFP_KERNEL); + if (!tmp) + return -ENOMEM; + + /* Make sure user access faults */ + dmirror->mdevice->devmem_faults = 0; + if (copy_from_user(tmp, u64_to_user_ptr(cmd->addr), nr_pages << PAGE_SHIFT)) + ret = -EFAULT; + cmd->faults = dmirror->mdevice->devmem_faults; + + return ret; +} + static void dmirror_mkentry(struct dmirror *dmirror, struct hmm_range *range, unsigned char *perm, unsigned long entry) { @@ -948,7 +1004,11 @@ static long dmirror_fops_unlocked_ioctl(struct file *filp, break;
case HMM_DMIRROR_MIGRATE: - ret = dmirror_migrate(dmirror, &cmd); + ret = dmirror_migrate(dmirror, &cmd, false); + break; + + case HMM_DMIRROR_MIGRATE_PIN: + ret = dmirror_migrate_pin(dmirror, &cmd); break;
case HMM_DMIRROR_SNAPSHOT: @@ -1004,20 +1064,31 @@ static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args, for (addr = start; addr < end; addr += PAGE_SIZE, src++, dst++) { struct page *dpage, *spage; + void *entry;
spage = migrate_pfn_to_page(*src); if (!spage || !(*src & MIGRATE_PFN_MIGRATE)) continue; - spage = spage->zone_device_data;
- dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr); + entry = xa_load(&dmirror->pt, addr >> PAGE_SHIFT); + if (entry && xa_pointer_tag(entry) == DPT_XA_TAG_ATOMIC) { + spage = NULL; + dpage = xa_untag_pointer(entry); + *dst = migrate_pfn(page_to_pfn(dpage)) | + MIGRATE_PFN_LOCKED | MIGRATE_PFN_UNPIN; + } else { + spage = spage->zone_device_data; + dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr); + *dst = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED; + } + if (!dpage) continue;
lock_page(dpage); xa_erase(&dmirror->pt, addr >> PAGE_SHIFT); - copy_highpage(dpage, spage); - *dst = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED; + if (spage) + copy_highpage(dpage, spage); if (*src & MIGRATE_PFN_WRITE) *dst |= MIGRATE_PFN_WRITE; } @@ -1041,6 +1112,8 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf) rpage = vmf->page->zone_device_data; dmirror = rpage->zone_device_data;
+ dmirror->mdevice->devmem_faults++; + /* FIXME demonstrate how we can adjust migrate range */ args.vma = vmf->vma; args.start = vmf->address; diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h index 670b4ef2a5b6..b40f4e6affe0 100644 --- a/lib/test_hmm_uapi.h +++ b/lib/test_hmm_uapi.h @@ -33,6 +33,7 @@ struct hmm_dmirror_cmd { #define HMM_DMIRROR_WRITE _IOWR('H', 0x01, struct hmm_dmirror_cmd) #define HMM_DMIRROR_MIGRATE _IOWR('H', 0x02, struct hmm_dmirror_cmd) #define HMM_DMIRROR_SNAPSHOT _IOWR('H', 0x03, struct hmm_dmirror_cmd) +#define HMM_DMIRROR_MIGRATE_PIN _IOWR('H', 0x04, struct hmm_dmirror_cmd)
/* * Values returned in hmm_dmirror_cmd.ptr for HMM_DMIRROR_SNAPSHOT. diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c index 5d1ac691b9f4..7111ebab93c7 100644 --- a/tools/testing/selftests/vm/hmm-tests.c +++ b/tools/testing/selftests/vm/hmm-tests.c @@ -947,6 +947,55 @@ TEST_F(hmm, migrate_fault) hmm_buffer_free(buffer); }
+TEST_F(hmm, migrate_fault_pin) +{ + struct hmm_buffer *buffer; + unsigned long npages; + unsigned long size; + unsigned long i; + int *ptr; + int ret; + + npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift; + ASSERT_NE(npages, 0); + size = npages << self->page_shift; + + buffer = malloc(sizeof(*buffer)); + ASSERT_NE(buffer, NULL); + + buffer->fd = -1; + buffer->size = size; + buffer->mirror = malloc(size); + ASSERT_NE(buffer->mirror, NULL); + + buffer->ptr = mmap(NULL, size, + PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, + buffer->fd, 0); + ASSERT_NE(buffer->ptr, MAP_FAILED); + + /* Initialize buffer in system memory. */ + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) + ptr[i] = i; + + /* Migrate memory to device. */ + ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE_PIN, buffer, npages); + ASSERT_EQ(ret, 0); + ASSERT_EQ(buffer->cpages, npages); + + /* Check what the device read. */ + for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i) + ASSERT_EQ(ptr[i], i); + + ASSERT_EQ(buffer->faults, npages); + + /* Fault pages back to system memory and check them. */ + for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) + ASSERT_EQ(ptr[i], i); + + hmm_buffer_free(buffer); +} + /* * Migrate anonymous shared memory to device private memory. */
Only pages which were actually migrated should be mapped on the GPU. migrate_vma_pages() clears MIGRATE_PFN_MIGRATE in the src_pfn array, so test this prior to mapping the pages on the GPU. If any pages failed to migrate don't install any mappings - the GPU will demand fault any as required.
Signed-off-by: Alistair Popple apopple@nvidia.com --- drivers/gpu/drm/nouveau/nouveau_dmem.c | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c index 92987daa5e17..9579bd001f11 100644 --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c @@ -618,8 +618,9 @@ static void nouveau_dmem_migrate_chunk(struct nouveau_drm *drm, dma_addr_t *dma_addrs, u64 *pfns) { struct nouveau_fence *fence; - unsigned long addr = args->start, nr_dma = 0, i; + unsigned long addr = args->start, nr_dma = 0, i, npages;
+ npages = (args->start - args->end) >> PAGE_SHIFT; for (i = 0; addr < args->end; i++) { args->dst[i] = nouveau_dmem_migrate_copy_one(drm, svmm, args->src[i], dma_addrs + nr_dma, pfns + i); @@ -631,7 +632,17 @@ static void nouveau_dmem_migrate_chunk(struct nouveau_drm *drm, nouveau_fence_new(drm->dmem->migrate.chan, false, &fence); migrate_vma_pages(args); nouveau_dmem_fence_done(&fence); - nouveau_pfns_map(svmm, args->vma->vm_mm, args->start, pfns, i); + + for (i = 0; i < npages; i++) + if (!(args->src[i] & MIGRATE_PFN_MIGRATE)) + break; + + /* + * If all pages were migrated successfully map them on the GPU. If any + * failed just let the GPU fault to create the mapping. + */ + if (i == npages) + nouveau_pfns_map(svmm, args->vma->vm_mm, args->start, pfns, npages);
while (nr_dma--) { dma_unmap_page(drm->dev->dev, dma_addrs[nr_dma], PAGE_SIZE,
Call mmu_interval_notifier_insert() as part of nouveau_range_fault(). This doesn't introduce any functional change but makes it easier for a subsequent patch to alter the behaviour of nouveau_range_fault() to support GPU atomic operations.
Signed-off-by: Alistair Popple apopple@nvidia.com --- drivers/gpu/drm/nouveau/nouveau_svm.c | 34 ++++++++++++++++----------- 1 file changed, 20 insertions(+), 14 deletions(-)
diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c index 1c3f890377d2..63332387402e 100644 --- a/drivers/gpu/drm/nouveau/nouveau_svm.c +++ b/drivers/gpu/drm/nouveau/nouveau_svm.c @@ -567,18 +567,27 @@ static int nouveau_range_fault(struct nouveau_svmm *svmm, unsigned long hmm_pfns[1]; struct hmm_range range = { .notifier = ¬ifier->notifier, - .start = notifier->notifier.interval_tree.start, - .end = notifier->notifier.interval_tree.last + 1, .default_flags = hmm_flags, .hmm_pfns = hmm_pfns, .dev_private_owner = drm->dev, }; - struct mm_struct *mm = notifier->notifier.mm; + struct mm_struct *mm = svmm->notifier.mm; int ret;
+ ret = mmu_interval_notifier_insert(¬ifier->notifier, mm, + args->p.addr, args->p.size, + &nouveau_svm_mni_ops); + if (ret) + return ret; + + range.start = notifier->notifier.interval_tree.start; + range.end = notifier->notifier.interval_tree.last + 1; + while (true) { - if (time_after(jiffies, timeout)) - return -EBUSY; + if (time_after(jiffies, timeout)) { + ret = -EBUSY; + goto out; + }
range.notifier_seq = mmu_interval_read_begin(range.notifier); mmap_read_lock(mm); @@ -587,7 +596,7 @@ static int nouveau_range_fault(struct nouveau_svmm *svmm, if (ret) { if (ret == -EBUSY) continue; - return ret; + goto out; }
mutex_lock(&svmm->mutex); @@ -606,6 +615,9 @@ static int nouveau_range_fault(struct nouveau_svmm *svmm, svmm->vmm->vmm.object.client->super = false; mutex_unlock(&svmm->mutex);
+out: + mmu_interval_notifier_remove(¬ifier->notifier); + return ret; }
@@ -727,14 +739,8 @@ nouveau_svm_fault(struct nvif_notify *notify) }
notifier.svmm = svmm; - ret = mmu_interval_notifier_insert(¬ifier.notifier, mm, - args.i.p.addr, args.i.p.size, - &nouveau_svm_mni_ops); - if (!ret) { - ret = nouveau_range_fault(svmm, svm->drm, &args.i, - sizeof(args), hmm_flags, ¬ifier); - mmu_interval_notifier_remove(¬ifier.notifier); - } + ret = nouveau_range_fault(svmm, svm->drm, &args.i, + sizeof(args), hmm_flags, ¬ifier); mmput(mm);
limit = args.i.p.addr + args.i.p.size;
Device private pages are used to track a per-page migrate_to_ram() callback which is called when the CPU attempts to access a GPU page from the CPU. Currently the same callback is used for all GPU pages tracked by Nouveau. However a future patch requires support for calling a different callback when accessing some GPU pages.
This patch extends the existing Nouveau device private page allocator to make it easier to allocate device private pages with different callbacks but should not introduce any functional changes.
Signed-off-by: Alistair Popple apopple@nvidia.com --- drivers/gpu/drm/nouveau/nouveau_dmem.c | 27 ++++++++++++++------------ drivers/gpu/drm/nouveau/nouveau_dmem.h | 5 +++++ 2 files changed, 20 insertions(+), 12 deletions(-)
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c index 9579bd001f11..8fb4949f3778 100644 --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c @@ -67,6 +67,7 @@ struct nouveau_dmem_chunk { struct nouveau_bo *bo; struct nouveau_drm *drm; unsigned long callocated; + enum nouveau_dmem_type type; struct dev_pagemap pagemap; };
@@ -81,7 +82,7 @@ struct nouveau_dmem { struct nouveau_dmem_migrate migrate; struct list_head chunks; struct mutex mutex; - struct page *free_pages; + struct page *free_pages[NOUVEAU_DMEM_NTYPES]; spinlock_t lock; };
@@ -112,8 +113,8 @@ static void nouveau_dmem_page_free(struct page *page) struct nouveau_dmem *dmem = chunk->drm->dmem;
spin_lock(&dmem->lock); - page->zone_device_data = dmem->free_pages; - dmem->free_pages = page; + page->zone_device_data = dmem->free_pages[chunk->type]; + dmem->free_pages[chunk->type] = page;
WARN_ON(!chunk->callocated); chunk->callocated--; @@ -224,7 +225,8 @@ static const struct dev_pagemap_ops nouveau_dmem_pagemap_ops = { };
static int -nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage) +nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage, + enum nouveau_dmem_type type) { struct nouveau_dmem_chunk *chunk; struct resource *res; @@ -248,6 +250,7 @@ nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage) }
chunk->drm = drm; + chunk->type = type; chunk->pagemap.type = MEMORY_DEVICE_PRIVATE; chunk->pagemap.range.start = res->start; chunk->pagemap.range.end = res->end; @@ -279,8 +282,8 @@ nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage) page = pfn_to_page(pfn_first); spin_lock(&drm->dmem->lock); for (i = 0; i < DMEM_CHUNK_NPAGES - 1; ++i, ++page) { - page->zone_device_data = drm->dmem->free_pages; - drm->dmem->free_pages = page; + page->zone_device_data = drm->dmem->free_pages[type]; + drm->dmem->free_pages[type] = page; } *ppage = page; chunk->callocated++; @@ -304,22 +307,22 @@ nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage) }
static struct page * -nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm) +nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm, enum nouveau_dmem_type type) { struct nouveau_dmem_chunk *chunk; struct page *page = NULL; int ret;
spin_lock(&drm->dmem->lock); - if (drm->dmem->free_pages) { - page = drm->dmem->free_pages; - drm->dmem->free_pages = page->zone_device_data; + if (drm->dmem->free_pages[type]) { + page = drm->dmem->free_pages[type]; + drm->dmem->free_pages[type] = page->zone_device_data; chunk = nouveau_page_to_chunk(page); chunk->callocated++; spin_unlock(&drm->dmem->lock); } else { spin_unlock(&drm->dmem->lock); - ret = nouveau_dmem_chunk_alloc(drm, &page); + ret = nouveau_dmem_chunk_alloc(drm, &page, type); if (ret) return NULL; } @@ -577,7 +580,7 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm, if (!(src & MIGRATE_PFN_MIGRATE)) goto out;
- dpage = nouveau_dmem_page_alloc_locked(drm); + dpage = nouveau_dmem_page_alloc_locked(drm, NOUVEAU_DMEM); if (!dpage) goto out;
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.h b/drivers/gpu/drm/nouveau/nouveau_dmem.h index 64da5d3635c8..02e261c4acf1 100644 --- a/drivers/gpu/drm/nouveau/nouveau_dmem.h +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.h @@ -28,6 +28,11 @@ struct nouveau_drm; struct nouveau_svmm; struct hmm_range;
+enum nouveau_dmem_type { + NOUVEAU_DMEM, + NOUVEAU_DMEM_NTYPES, /* Number of types, must be last */ +}; + #if IS_ENABLED(CONFIG_DRM_NOUVEAU_SVM) void nouveau_dmem_init(struct nouveau_drm *); void nouveau_dmem_fini(struct nouveau_drm *);
Some NVIDIA GPUs do not support direct atomic access to system memory via PCIe. Instead this must be emulated by granting the GPU exclusive access to the memory. This is achieved by migrating the userspace mappings to device private pages whilst leaving the actual page in place.
The driver then grants the GPU permission to update the page undergoing atomic access via the GPU page tables. When CPU access to the page is required a CPU fault is raised which calls into the device driver via migrate_to_ram() to remap the page into the CPU page tables and revoke GPU access.
Signed-off-by: Alistair Popple apopple@nvidia.com --- drivers/gpu/drm/nouveau/include/nvif/if000c.h | 1 + drivers/gpu/drm/nouveau/nouveau_dmem.c | 148 ++++++++++++++++-- drivers/gpu/drm/nouveau/nouveau_dmem.h | 4 + drivers/gpu/drm/nouveau/nouveau_svm.c | 116 ++++++++++++-- drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h | 1 + .../drm/nouveau/nvkm/subdev/mmu/vmmgp100.c | 6 + 6 files changed, 249 insertions(+), 27 deletions(-)
diff --git a/drivers/gpu/drm/nouveau/include/nvif/if000c.h b/drivers/gpu/drm/nouveau/include/nvif/if000c.h index d6dd40f21eed..9c7ff56831c5 100644 --- a/drivers/gpu/drm/nouveau/include/nvif/if000c.h +++ b/drivers/gpu/drm/nouveau/include/nvif/if000c.h @@ -77,6 +77,7 @@ struct nvif_vmm_pfnmap_v0 { #define NVIF_VMM_PFNMAP_V0_APER 0x00000000000000f0ULL #define NVIF_VMM_PFNMAP_V0_HOST 0x0000000000000000ULL #define NVIF_VMM_PFNMAP_V0_VRAM 0x0000000000000010ULL +#define NVIF_VMM_PFNMAP_V0_A 0x0000000000000004ULL #define NVIF_VMM_PFNMAP_V0_W 0x0000000000000002ULL #define NVIF_VMM_PFNMAP_V0_V 0x0000000000000001ULL #define NVIF_VMM_PFNMAP_V0_NONE 0x0000000000000000ULL diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c index 8fb4949f3778..7b103670af56 100644 --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c @@ -69,6 +69,7 @@ struct nouveau_dmem_chunk { unsigned long callocated; enum nouveau_dmem_type type; struct dev_pagemap pagemap; + struct page **atomic_pages; };
struct nouveau_dmem_migrate { @@ -107,12 +108,51 @@ unsigned long nouveau_dmem_page_addr(struct page *page) return chunk->bo->offset + off; }
+static struct page *nouveau_dpage_to_atomic(struct page *dpage) +{ + struct nouveau_dmem_chunk *chunk = nouveau_page_to_chunk(dpage); + int index; + + if (WARN_ON_ONCE(chunk->type != NOUVEAU_ATOMIC)) + return NULL; + + index = page_to_pfn(dpage) - (chunk->pagemap.range.start >> PAGE_SHIFT); + + if (WARN_ON_ONCE(index > DMEM_CHUNK_NPAGES)) + return NULL; + + return chunk->atomic_pages[index]; +} + +void nouveau_dmem_set_atomic(struct page *dpage, struct page *atomic_page) +{ + struct nouveau_dmem_chunk *chunk = nouveau_page_to_chunk(dpage); + int index; + + if (WARN_ON_ONCE(chunk->type != NOUVEAU_ATOMIC)) + return; + + index = page_to_pfn(dpage) - (chunk->pagemap.range.start >> PAGE_SHIFT); + + if (WARN_ON_ONCE(index > DMEM_CHUNK_NPAGES)) + return; + + chunk->atomic_pages[index] = atomic_page; +} + static void nouveau_dmem_page_free(struct page *page) { struct nouveau_dmem_chunk *chunk = nouveau_page_to_chunk(page); struct nouveau_dmem *dmem = chunk->drm->dmem;
spin_lock(&dmem->lock); + if (chunk->type == NOUVEAU_ATOMIC) { + struct page *atomic_page; + + atomic_page = nouveau_dpage_to_atomic(page); + if (atomic_page) + unpin_user_page(atomic_page); + } page->zone_device_data = dmem->free_pages[chunk->type]; dmem->free_pages[chunk->type] = page;
@@ -125,6 +165,65 @@ static void nouveau_dmem_page_free(struct page *page) spin_unlock(&dmem->lock); }
+static vm_fault_t nouveau_atomic_page_migrate(struct vm_fault *vmf) +{ + struct nouveau_drm *drm = page_to_drm(vmf->page); + struct page *dpage = vmf->page; + struct nouveau_svmm *svmm = dpage->zone_device_data; + struct migrate_vma args; + unsigned long src_pfn = 0, dst_pfn = 0; + struct page *src_page, *old_page; + int retry; + int ret = 0; + + args.vma = vmf->vma; + args.src = &src_pfn; + args.dst = &dst_pfn; + args.start = vmf->address; + args.end = vmf->address + PAGE_SIZE; + args.pgmap_owner = drm->dev; + args.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE; + + for (retry = 0; retry < 10; retry++) { + ret = migrate_vma_setup(&args); + if (ret) { + ret = VM_FAULT_SIGBUS; + goto out; + } + + if (src_pfn & MIGRATE_PFN_MIGRATE) + break; + + cond_resched(); + } + + src_page = migrate_pfn_to_page(src_pfn); + old_page = nouveau_dpage_to_atomic(dpage); + + /* + * If we can't get a lock on the atomic page it means another thread is + * racing or migrating so retry the fault. + */ + if (src_page && trylock_page(old_page)) + dst_pfn = migrate_pfn(page_to_pfn(old_page)) | + MIGRATE_PFN_LOCKED | MIGRATE_PFN_UNPIN; + else + ret = VM_FAULT_RETRY; + + migrate_vma_pages(&args); + if (src_pfn & MIGRATE_PFN_MIGRATE) { + nouveau_svmm_invalidate(svmm, args.start, args.end); + nouveau_dmem_set_atomic(dpage, NULL); + } + migrate_vma_finalize(&args); + +out: + if (ret == VM_FAULT_RETRY) + mmap_read_unlock(vmf->vma->vm_mm); + + return ret; +} + static void nouveau_dmem_fence_done(struct nouveau_fence **fence) { if (fence) { @@ -224,6 +323,11 @@ static const struct dev_pagemap_ops nouveau_dmem_pagemap_ops = { .migrate_to_ram = nouveau_dmem_migrate_to_ram, };
+static const struct dev_pagemap_ops nouveau_atomic_pagemap_ops = { + .page_free = nouveau_dmem_page_free, + .migrate_to_ram = nouveau_atomic_page_migrate, +}; + static int nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage, enum nouveau_dmem_type type) @@ -255,18 +359,30 @@ nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage, chunk->pagemap.range.start = res->start; chunk->pagemap.range.end = res->end; chunk->pagemap.nr_range = 1; - chunk->pagemap.ops = &nouveau_dmem_pagemap_ops; + if (type == NOUVEAU_DMEM) + chunk->pagemap.ops = &nouveau_dmem_pagemap_ops; + else + chunk->pagemap.ops = &nouveau_atomic_pagemap_ops; chunk->pagemap.owner = drm->dev;
- ret = nouveau_bo_new(&drm->client, DMEM_CHUNK_SIZE, 0, - NOUVEAU_GEM_DOMAIN_VRAM, 0, 0, NULL, NULL, - &chunk->bo); - if (ret) - goto out_release; + if (type == NOUVEAU_DMEM) { + ret = nouveau_bo_new(&drm->client, DMEM_CHUNK_SIZE, 0, + NOUVEAU_GEM_DOMAIN_VRAM, 0, 0, NULL, NULL, + &chunk->bo); + if (ret) + goto out_release;
- ret = nouveau_bo_pin(chunk->bo, NOUVEAU_GEM_DOMAIN_VRAM, false); - if (ret) - goto out_bo_free; + ret = nouveau_bo_pin(chunk->bo, NOUVEAU_GEM_DOMAIN_VRAM, false); + if (ret) + goto out_bo_free; + } else { + chunk->atomic_pages = kcalloc(DMEM_CHUNK_NPAGES, + sizeof(void *), GFP_KERNEL); + if (!chunk->atomic_pages) { + ret = -ENOMEM; + goto out_release; + } + }
ptr = memremap_pages(&chunk->pagemap, numa_node_id()); if (IS_ERR(ptr)) { @@ -289,8 +405,8 @@ nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage, chunk->callocated++; spin_unlock(&drm->dmem->lock);
- NV_INFO(drm, "DMEM: registered %ldMB of device memory\n", - DMEM_CHUNK_SIZE >> 20); + NV_INFO(drm, "DMEM: registered %ldMB of device %smemory\n", + DMEM_CHUNK_SIZE >> 20, type == NOUVEAU_ATOMIC ? "atomic " : "");
return 0;
@@ -306,7 +422,7 @@ nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage, return ret; }
-static struct page * +struct page * nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm, enum nouveau_dmem_type type) { struct nouveau_dmem_chunk *chunk; @@ -382,8 +498,12 @@ nouveau_dmem_fini(struct nouveau_drm *drm) mutex_lock(&drm->dmem->mutex);
list_for_each_entry_safe(chunk, tmp, &drm->dmem->chunks, list) { - nouveau_bo_unpin(chunk->bo); - nouveau_bo_ref(NULL, &chunk->bo); + if (chunk->type == NOUVEAU_DMEM) { + nouveau_bo_unpin(chunk->bo); + nouveau_bo_ref(NULL, &chunk->bo); + } else { + kfree(chunk->atomic_pages); + } list_del(&chunk->list); memunmap_pages(&chunk->pagemap); release_mem_region(chunk->pagemap.range.start, diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.h b/drivers/gpu/drm/nouveau/nouveau_dmem.h index 02e261c4acf1..6b52a7a8dea4 100644 --- a/drivers/gpu/drm/nouveau/nouveau_dmem.h +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.h @@ -30,6 +30,7 @@ struct hmm_range;
enum nouveau_dmem_type { NOUVEAU_DMEM, + NOUVEAU_ATOMIC, NOUVEAU_DMEM_NTYPES, /* Number of types, must be last */ };
@@ -45,6 +46,9 @@ int nouveau_dmem_migrate_vma(struct nouveau_drm *drm, unsigned long start, unsigned long end); unsigned long nouveau_dmem_page_addr(struct page *page); +struct page *nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm, + enum nouveau_dmem_type type); +void nouveau_dmem_set_atomic(struct page *dpage, struct page *atomic_page);
#else /* IS_ENABLED(CONFIG_DRM_NOUVEAU_SVM) */ static inline void nouveau_dmem_init(struct nouveau_drm *drm) {} diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c index 63332387402e..e571c6907bfd 100644 --- a/drivers/gpu/drm/nouveau/nouveau_svm.c +++ b/drivers/gpu/drm/nouveau/nouveau_svm.c @@ -35,6 +35,7 @@ #include <linux/sched/mm.h> #include <linux/sort.h> #include <linux/hmm.h> +#include <linux/idr.h>
struct nouveau_svm { struct nouveau_drm *drm; @@ -421,9 +422,9 @@ nouveau_svm_fault_cmp(const void *a, const void *b) return ret; if ((ret = (s64)fa->addr - fb->addr)) return ret; - /*XXX: atomic? */ - return (fa->access == 0 || fa->access == 3) - - (fb->access == 0 || fb->access == 3); + /* Atomic access (2) has highest priority */ + return (-1*(fa->access == 2) + (fa->access == 0 || fa->access == 3)) - + (-1*(fb->access == 2) + (fb->access == 0 || fb->access == 3)); }
static void @@ -555,10 +556,86 @@ static void nouveau_hmm_convert_pfn(struct nouveau_drm *drm, args->p.phys[0] |= NVIF_VMM_PFNMAP_V0_W; }
+static int nouveau_atomic_range_fault(struct nouveau_svmm *svmm, + struct nouveau_drm *drm, + struct nouveau_pfnmap_args *args, u32 size, + unsigned long hmm_flags, struct mm_struct *mm) +{ + struct page *dpage, *oldpage; + struct migrate_vma migrate_args; + unsigned long src_pfn = MIGRATE_PFN_PIN, dst_pfn = 0, start = args->p.addr; + int ret = 0; + + mmap_read_lock(mm); + + migrate_args.src = &src_pfn; + migrate_args.dst = &dst_pfn; + migrate_args.start = start; + migrate_args.end = migrate_args.start + PAGE_SIZE; + migrate_args.pgmap_owner = NULL; + migrate_args.flags = MIGRATE_VMA_SELECT_SYSTEM; + migrate_args.vma = find_vma_intersection(mm, migrate_args.start, migrate_args.end); + + ret = migrate_vma_setup(&migrate_args); + + if (ret) { + SVMM_ERR(svmm, "Unable to setup atomic page migration"); + ret = -ENOMEM; + goto out; + } + + oldpage = migrate_pfn_to_page(src_pfn); + + if (src_pfn & MIGRATE_PFN_MIGRATE) { + dpage = nouveau_dmem_page_alloc_locked(drm, NOUVEAU_ATOMIC); + if (!dpage) { + SVMM_ERR(svmm, "Unable to allocate atomic page"); + *migrate_args.dst = 0; + } else { + nouveau_dmem_set_atomic(dpage, oldpage); + dpage->zone_device_data = svmm; + *migrate_args.dst = migrate_pfn(page_to_pfn(dpage)) | + MIGRATE_PFN_LOCKED; + } + } + + migrate_vma_pages(&migrate_args); + + /* Map the page on the GPU for successfully migrated mappings. Do this + * before removing the migration PTEs in migrate_vma_finalize() as once + * that happens a CPU thread can race and invalidate the GPU mappings + * prior to mapping them here. + */ + if (src_pfn & MIGRATE_PFN_MIGRATE) { + args->p.page = 12; + args->p.size = PAGE_SIZE; + args->p.addr = start; + args->p.phys[0] = page_to_phys(oldpage) | + NVIF_VMM_PFNMAP_V0_V | + NVIF_VMM_PFNMAP_V0_A | + NVIF_VMM_PFNMAP_V0_HOST; + + if (migrate_args.vma->vm_flags & VM_WRITE) + args->p.phys[0] |= NVIF_VMM_PFNMAP_V0_W; + + mutex_lock(&svmm->mutex); + svmm->vmm->vmm.object.client->super = true; + ret = nvif_object_ioctl(&svmm->vmm->vmm.object, args, size, NULL); + svmm->vmm->vmm.object.client->super = false; + mutex_unlock(&svmm->mutex); + } + + migrate_vma_finalize(&migrate_args); + +out: + mmap_read_unlock(mm); + return ret; +} + static int nouveau_range_fault(struct nouveau_svmm *svmm, struct nouveau_drm *drm, struct nouveau_pfnmap_args *args, u32 size, - unsigned long hmm_flags, + unsigned long hmm_flags, int atomic, struct svm_notifier *notifier) { unsigned long timeout = @@ -608,12 +685,18 @@ static int nouveau_range_fault(struct nouveau_svmm *svmm, break; }
- nouveau_hmm_convert_pfn(drm, &range, args); + if (atomic) { + mutex_unlock(&svmm->mutex); + ret = nouveau_atomic_range_fault(svmm, drm, args, + size, hmm_flags, mm); + } else { + nouveau_hmm_convert_pfn(drm, &range, args);
- svmm->vmm->vmm.object.client->super = true; - ret = nvif_object_ioctl(&svmm->vmm->vmm.object, args, size, NULL); - svmm->vmm->vmm.object.client->super = false; - mutex_unlock(&svmm->mutex); + svmm->vmm->vmm.object.client->super = true; + ret = nvif_object_ioctl(&svmm->vmm->vmm.object, args, size, NULL); + svmm->vmm->vmm.object.client->super = false; + mutex_unlock(&svmm->mutex); + }
out: mmu_interval_notifier_remove(¬ifier->notifier); @@ -637,7 +720,7 @@ nouveau_svm_fault(struct nvif_notify *notify) unsigned long hmm_flags; u64 inst, start, limit; int fi, fn; - int replay = 0, ret; + int replay = 0, atomic = 0, ret;
/* Parse available fault buffer entries into a cache, and update * the GET pointer so HW can reuse the entries. @@ -718,12 +801,15 @@ nouveau_svm_fault(struct nvif_notify *notify) /* * Determine required permissions based on GPU fault * access flags. - * XXX: atomic? */ switch (buffer->fault[fi]->access) { case 0: /* READ. */ hmm_flags = HMM_PFN_REQ_FAULT; break; + case 2: /* ATOMIC. */ + hmm_flags = HMM_PFN_REQ_FAULT | HMM_PFN_REQ_WRITE; + atomic = true; + break; case 3: /* PREFETCH. */ hmm_flags = 0; break; @@ -740,7 +826,7 @@ nouveau_svm_fault(struct nvif_notify *notify)
notifier.svmm = svmm; ret = nouveau_range_fault(svmm, svm->drm, &args.i, - sizeof(args), hmm_flags, ¬ifier); + sizeof(args), hmm_flags, atomic, ¬ifier); mmput(mm);
limit = args.i.p.addr + args.i.p.size; @@ -760,7 +846,11 @@ nouveau_svm_fault(struct nvif_notify *notify) !(args.phys[0] & NVIF_VMM_PFNMAP_V0_V)) || (buffer->fault[fi]->access != 0 /* READ. */ && buffer->fault[fi]->access != 3 /* PREFETCH. */ && - !(args.phys[0] & NVIF_VMM_PFNMAP_V0_W))) + !(args.phys[0] & NVIF_VMM_PFNMAP_V0_W)) || + (buffer->fault[fi]->access != 0 /* READ. */ && + buffer->fault[fi]->access != 1 /* WRITE. */ && + buffer->fault[fi]->access != 3 /* PREFETCH. */ && + !(args.phys[0] & NVIF_VMM_PFNMAP_V0_A))) break; }
diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h index a2b179568970..f6188aa9171c 100644 --- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmm.h @@ -178,6 +178,7 @@ void nvkm_vmm_unmap_region(struct nvkm_vmm *, struct nvkm_vma *); #define NVKM_VMM_PFN_APER 0x00000000000000f0ULL #define NVKM_VMM_PFN_HOST 0x0000000000000000ULL #define NVKM_VMM_PFN_VRAM 0x0000000000000010ULL +#define NVKM_VMM_PFN_A 0x0000000000000004ULL #define NVKM_VMM_PFN_W 0x0000000000000002ULL #define NVKM_VMM_PFN_V 0x0000000000000001ULL #define NVKM_VMM_PFN_NONE 0x0000000000000000ULL diff --git a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgp100.c b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgp100.c index 236db5570771..f02abd9cb4dd 100644 --- a/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgp100.c +++ b/drivers/gpu/drm/nouveau/nvkm/subdev/mmu/vmmgp100.c @@ -88,6 +88,9 @@ gp100_vmm_pgt_pfn(struct nvkm_vmm *vmm, struct nvkm_mmu_pt *pt, if (!(*map->pfn & NVKM_VMM_PFN_W)) data |= BIT_ULL(6); /* RO. */
+ if (!(*map->pfn & NVKM_VMM_PFN_A)) + data |= BIT_ULL(7); /* Atomic disable. */ + if (!(*map->pfn & NVKM_VMM_PFN_VRAM)) { addr = *map->pfn >> NVKM_VMM_PFN_ADDR_SHIFT; addr = dma_map_page(dev, pfn_to_page(addr), 0, @@ -322,6 +325,9 @@ gp100_vmm_pd0_pfn(struct nvkm_vmm *vmm, struct nvkm_mmu_pt *pt, if (!(*map->pfn & NVKM_VMM_PFN_W)) data |= BIT_ULL(6); /* RO. */
+ if (!(*map->pfn & NVKM_VMM_PFN_A)) + data |= BIT_ULL(7); /* Atomic disable. */ + if (!(*map->pfn & NVKM_VMM_PFN_VRAM)) { addr = *map->pfn >> NVKM_VMM_PFN_ADDR_SHIFT; addr = dma_map_page(dev, pfn_to_page(addr), 0,
On Tue, Feb 09, 2021 at 12:07:13PM +1100, Alistair Popple wrote:
Why do we need to pin for gpu atomics? You still have the callback for cpu faults, so you can move the page as needed, and hence a long-term pin sounds like the wrong approach.
That would avoid all the hacking around long term pin constraints, because for real unmoveable long term pinned memory we really want to have all these checks. So I think we might be missing some other callbacks to be able to move these pages, instead of abusing longterm pins for lack of better tools.
Cheers, Daniel
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On Tuesday, 9 February 2021 9:27:05 PM AEDT Daniel Vetter wrote:
Recent changes to pin_user_pages() prevent the creation of pinned pages in ZONE_MOVABLE. This series allows pinned pages to be created in
ZONE_MOVABLE
as attempts to migrate may fail which would be fatal to userspace.
In this case migration of the pinned page is unnecessary as the page can
be
Technically a real long term unmoveable pin isn't required, because as you say the page can be moved as needed at any time. However I needed some way of stopping the CPU page from being freed once the userspace mappings for it had been removed. Obviously I could have just used get_page() but from the perspective of page migration the result is much the same as a pin - a page which can't be moved because of the extra refcount.
The normal solution of registering an MMU notifier to unpin the page when it needs to be moved also doesn't work as the CPU page tables now point to the device-private page and hence the migration code won't call any invalidate notifiers for the CPU page.
Yes, I would like to avoid the long term pin constraints as well if possible I just haven't found a solution yet. Are you suggesting it might be possible to add a callback in the page migration logic to specially deal with moving these pages?
Thanks, Alistair
On Tue, Feb 09, 2021 at 11:57:28PM +1100, Alistair Popple wrote:
The issue is you took the page out of the PTE it belongs to, which makes it orphaned and unlocatable by the rest of the mm?
Ideally this would leave the PTE in place so everything continues to work, just disable CPU access to it.
Maybe some kind of special swap entry?
I also don't much like the use of ZONE_DEVICE here, that should only be used for actual device memory, not as a temporary proxy for CPU pages.. Having two struct pages refer to the same physical memory is pretty ugly.
The fact the page is lost from the MM seems to be the main issue here.
How would migration even find the page?
Jason
On Tue, Feb 9, 2021 at 2:35 PM Jason Gunthorpe jgg@ziepe.ca wrote:
I probably should have read the patches more in detail, I was assuming the ZONE_DEVICE is only for vram. At least I thought the requirement for gpu atomics was that the page is in vram, but maybe I'm mixing up how this works on nvidia with how it works in other places. Iirc we had a long discussion about this at lpc19 that ended with the conclusion that we must be able to migrate, and sometimes migration is blocked. But the details ellude me now.
Either way ZONE_DEVICE for not vram/device memory sounds wrong. Is that really going on here? -Daniel
On Tue, Feb 09, 2021 at 02:39:51PM +0100, Daniel Vetter wrote:
Either way ZONE_DEVICE for not vram/device memory sounds wrong. Is that really going on here?
My read was this was doing non-coherent atomics on CPU memory.
Atomics on GPU memory is just called migration to GPU memory, it doesn't need to be special for atomics. In that case it can free the CPU struct page completely as the data now lives in the ZONE_DEVICE page so no need for a pin, no problem with movable
Jason
On Tue, Feb 09, 2021 at 04:17:38PM -0500, Jerome Glisse wrote:
I mean find out that the page is now owned by the GPU driver to tell it that it needs to migrate that reference. Normally that would go through the VMA to the mmu notifier, but with the page removed from the normal VMA it can't get to a mmu notifier or the GPU driver.
Jason
On Tue, Feb 9, 2021 at 1:57 PM Alistair Popple apopple@nvidia.com wrote:
long term pin vs short term page reference aren't fully fleshed out. But the rule more or less is: - short term page reference: _must_ get released in finite time for migration and other things, either because you have a callback, or because it's just for direct I/O, which will complete. This means short term pins will delay migration, but not foul it complete
- long term pin: the page cannot be moved, all migration must fail. Also this will have an impact on COW behaviour for fork (but not sure where those patches are, John Hubbard will know).
So I think for your use case here you want a) short term page reference to make sure it doesn't disappear plus b) callback to make sure migrate isn't blocked.
Breaking ZONE_MOVEABLE with either allowing long term pins or failing migrations because you don't release your short term page reference isn't good.
Yeah you need some other callback for migration on the page directly. it's a bit awkward since there is one already for struct address_space, but that's own by the address_space/page cache, not HMM. So I think we need something else, maybe something for each ZONE_DEVICE?
s/possible/need to type some code to address it/ I think.
But also I'm not much of an expert on this, I've only just started learning how this all fits together coming from the gpu side. There's a _lot_ of finesse involved.
Cheers, Daniel
On 2/9/21 5:37 AM, Daniel Vetter wrote:
GPU atomic operations to sysmem are hard to categorize, because because application programmers could easily write programs that do a long series of atomic operations. Such a program would be a little weird, but it's hard to rule out.
That would be Jason's commit 57efa1fe59576 ("mm/gup: prevent gup_fast from racing with COW during fork"), which is in linux-next 20201216.
This direction sounds at least...possible. Using MMU notifiers instead of pins is definitely appealing. I'm not quite clear on the callback idea above, but overall it seems like taking advantage of the ZONE_DEVICE tracking of pages (without having to put anything additional in each struct page), could work.
Additional notes or ideas here are definitely welcome.
thanks,
On Tue, Feb 09, 2021 at 12:53:27PM -0800, John Hubbard wrote:
Yeah, but we can forcefully break this whenever we feel like by revoking the page, moving it, and then reinstating the gpu pte again and let it continue.
If that's no possible then what we need here instead is an mlock() type of thing I think.
Nice, thanks for the pointer.
Well I don't have ideas for the details tbh, just from the little I learned about how this all fits together pretending to be a pin while pretending to not be a pin just gets us back to the mess we're trying to solve with gup vs pup cleanup. And given that the pin_user_pages rollout hasn't even completed yet it's maybe way to early to already toss it out again.
I think overall we should try really hard to not mix up things between memory that's pinned and memory that can be moved. Because retroactively trying to fix things up just because it was easier to get a feature going that way is very, very hard. And I think we've demonstrated countless times that "ah we just fix this with pinning, it doesn't happen often, no one will notice" always comes back to bite us later on :-)
Like just looking ahead, 1GB pages are a thing, we will have to support that, migrate_page is the only way to get there, and statistically just a 1/256th chance of encountering a pinned page guarantees we'll never assemble a giant page.
Cheers, Daniel
On Tue, Feb 09, 2021 at 12:53:27PM -0800, John Hubbard wrote:
It isn't the ZONE_DEVICE page that needs to be tracked.
Really what you want to do here is leave the CPU page in the VMA and the page tables where it started and deny CPU access to the page. Then all the proper machinery will continue to work.
IMHO "migration" is the wrong idea if the data isn't actually moving.
Jason
dri-devel@lists.freedesktop.org