This is the first version of our HMM based shared virtual memory manager for KFD. There are still a number of known issues that we're working through (see below). This will likely lead to some pretty significant changes in MMU notifier handling and locking on the migration code paths. So don't get hung up on those details yet.
But I think this is a good time to start getting feedback. We're pretty confident about the ioctl API, which is both simple and extensible for the future. (see patches 4,16) The user mode side of the API can be found here: https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/fxkamd/hmm-wi...
I'd also like another pair of eyes on how we're interfacing with the GPU VM code in amdgpu_vm.c (see patches 12,13), retry page fault handling (24,25), and some retry IRQ handling changes (32).
Known issues: * won't work with IOMMU enabled, we need to dma_map all pages properly * still working on some race conditions and random bugs * performance is not great yet
Alex Sierra (12): drm/amdgpu: replace per_device_list by array drm/amdkfd: helper to convert gpu id and idx drm/amdkfd: add xnack enabled flag to kfd_process drm/amdkfd: add ioctl to configure and query xnack retries drm/amdkfd: invalidate tables on page retry fault drm/amdkfd: page table restore through svm API drm/amdkfd: SVM API call to restore page tables drm/amdkfd: add svm_bo reference for eviction fence drm/amdgpu: add param bit flag to create SVM BOs drm/amdkfd: add svm_bo eviction mechanism support drm/amdgpu: svm bo enable_signal call condition drm/amdgpu: add svm_bo eviction to enable_signal cb
Philip Yang (23): drm/amdkfd: select kernel DEVICE_PRIVATE option drm/amdkfd: add svm ioctl API drm/amdkfd: Add SVM API support capability bits drm/amdkfd: register svm range drm/amdkfd: add svm ioctl GET_ATTR op drm/amdgpu: add common HMM get pages function drm/amdkfd: validate svm range system memory drm/amdkfd: register overlap system memory range drm/amdkfd: deregister svm range drm/amdgpu: export vm update mapping interface drm/amdkfd: map svm range to GPUs drm/amdkfd: svm range eviction and restore drm/amdkfd: register HMM device private zone drm/amdkfd: validate vram svm range from TTM drm/amdkfd: support xgmi same hive mapping drm/amdkfd: copy memory through gart table drm/amdkfd: HMM migrate ram to vram drm/amdkfd: HMM migrate vram to ram drm/amdgpu: reserve fence slot to update page table drm/amdgpu: enable retry fault wptr overflow drm/amdkfd: refine migration policy with xnack on drm/amdkfd: add svm range validate timestamp drm/amdkfd: multiple gpu migrate vram to vram
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 3 + drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 4 +- .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c | 16 +- .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 13 +- drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c | 83 + drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h | 7 + drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 5 + drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 90 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 47 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 10 + drivers/gpu/drm/amd/amdgpu/vega10_ih.c | 32 +- drivers/gpu/drm/amd/amdgpu/vega20_ih.c | 32 +- drivers/gpu/drm/amd/amdkfd/Kconfig | 1 + drivers/gpu/drm/amd/amdkfd/Makefile | 4 +- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 170 +- drivers/gpu/drm/amd/amdkfd/kfd_iommu.c | 8 +- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 866 ++++++ drivers/gpu/drm/amd/amdkfd/kfd_migrate.h | 59 + drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 52 +- drivers/gpu/drm/amd/amdkfd/kfd_process.c | 200 +- .../amd/amdkfd/kfd_process_queue_manager.c | 6 +- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 2564 +++++++++++++++++ drivers/gpu/drm/amd/amdkfd/kfd_svm.h | 135 + drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 1 + drivers/gpu/drm/amd/amdkfd/kfd_topology.h | 10 +- include/uapi/linux/kfd_ioctl.h | 169 +- 26 files changed, 4296 insertions(+), 291 deletions(-) create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.c create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.h
From: Philip Yang Philip.Yang@amd.com
DEVICE_PRIVATE kernel config option is required for HMM page migration, to register vram (GPU device memory) as DEVICE_PRIVATE zone memory. Enabling this option recompiles kernel.
Signed-off-by: Philip Yang Philip.Yang@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com --- drivers/gpu/drm/amd/amdkfd/Kconfig | 1 + 1 file changed, 1 insertion(+)
diff --git a/drivers/gpu/drm/amd/amdkfd/Kconfig b/drivers/gpu/drm/amd/amdkfd/Kconfig index e8fb10c41f16..33f8efadc6f6 100644 --- a/drivers/gpu/drm/amd/amdkfd/Kconfig +++ b/drivers/gpu/drm/amd/amdkfd/Kconfig @@ -7,6 +7,7 @@ config HSA_AMD bool "HSA kernel driver for AMD GPU devices" depends on DRM_AMDGPU && (X86_64 || ARM64 || PPC64) imply AMD_IOMMU_V2 if X86_64 + select DEVICE_PRIVATE select MMU_NOTIFIER help Enable this if you want to use HSA features on AMD GPU devices.
From: Alex Sierra alex.sierra@amd.com
Remove per_device_list from kfd_process and replace it with a kfd_process_device pointers array of MAX_GPU_INSTANCES size. This helps to manage the kfd_process_devices binded to a specific kfd_process. Also, functions used by kfd_chardev to iterate over the list were removed, since they are not valid anymore. Instead, it was replaced by a local loop iterating the array.
Signed-off-by: Alex Sierra alex.sierra@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com --- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 116 ++++++++---------- drivers/gpu/drm/amd/amdkfd/kfd_iommu.c | 8 +- drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 20 +-- drivers/gpu/drm/amd/amdkfd/kfd_process.c | 108 ++++++++-------- .../amd/amdkfd/kfd_process_queue_manager.c | 6 +- 5 files changed, 111 insertions(+), 147 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c index 8cc51cec988a..8c87afce12df 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c @@ -874,52 +874,47 @@ static int kfd_ioctl_get_process_apertures(struct file *filp, { struct kfd_ioctl_get_process_apertures_args *args = data; struct kfd_process_device_apertures *pAperture; - struct kfd_process_device *pdd; + int i;
dev_dbg(kfd_device, "get apertures for PASID 0x%x", p->pasid);
args->num_of_nodes = 0;
mutex_lock(&p->mutex); + /* Run over all pdd of the process */ + for (i = 0; i < p->n_pdds; i++) { + struct kfd_process_device *pdd = p->pdds[i]; + + pAperture = + &args->process_apertures[args->num_of_nodes]; + pAperture->gpu_id = pdd->dev->id; + pAperture->lds_base = pdd->lds_base; + pAperture->lds_limit = pdd->lds_limit; + pAperture->gpuvm_base = pdd->gpuvm_base; + pAperture->gpuvm_limit = pdd->gpuvm_limit; + pAperture->scratch_base = pdd->scratch_base; + pAperture->scratch_limit = pdd->scratch_limit;
- /*if the process-device list isn't empty*/ - if (kfd_has_process_device_data(p)) { - /* Run over all pdd of the process */ - pdd = kfd_get_first_process_device_data(p); - do { - pAperture = - &args->process_apertures[args->num_of_nodes]; - pAperture->gpu_id = pdd->dev->id; - pAperture->lds_base = pdd->lds_base; - pAperture->lds_limit = pdd->lds_limit; - pAperture->gpuvm_base = pdd->gpuvm_base; - pAperture->gpuvm_limit = pdd->gpuvm_limit; - pAperture->scratch_base = pdd->scratch_base; - pAperture->scratch_limit = pdd->scratch_limit; - - dev_dbg(kfd_device, - "node id %u\n", args->num_of_nodes); - dev_dbg(kfd_device, - "gpu id %u\n", pdd->dev->id); - dev_dbg(kfd_device, - "lds_base %llX\n", pdd->lds_base); - dev_dbg(kfd_device, - "lds_limit %llX\n", pdd->lds_limit); - dev_dbg(kfd_device, - "gpuvm_base %llX\n", pdd->gpuvm_base); - dev_dbg(kfd_device, - "gpuvm_limit %llX\n", pdd->gpuvm_limit); - dev_dbg(kfd_device, - "scratch_base %llX\n", pdd->scratch_base); - dev_dbg(kfd_device, - "scratch_limit %llX\n", pdd->scratch_limit); - - args->num_of_nodes++; - - pdd = kfd_get_next_process_device_data(p, pdd); - } while (pdd && (args->num_of_nodes < NUM_OF_SUPPORTED_GPUS)); - } + dev_dbg(kfd_device, + "node id %u\n", args->num_of_nodes); + dev_dbg(kfd_device, + "gpu id %u\n", pdd->dev->id); + dev_dbg(kfd_device, + "lds_base %llX\n", pdd->lds_base); + dev_dbg(kfd_device, + "lds_limit %llX\n", pdd->lds_limit); + dev_dbg(kfd_device, + "gpuvm_base %llX\n", pdd->gpuvm_base); + dev_dbg(kfd_device, + "gpuvm_limit %llX\n", pdd->gpuvm_limit); + dev_dbg(kfd_device, + "scratch_base %llX\n", pdd->scratch_base); + dev_dbg(kfd_device, + "scratch_limit %llX\n", pdd->scratch_limit);
+ if (++args->num_of_nodes >= NUM_OF_SUPPORTED_GPUS) + break; + } mutex_unlock(&p->mutex);
return 0; @@ -930,9 +925,8 @@ static int kfd_ioctl_get_process_apertures_new(struct file *filp, { struct kfd_ioctl_get_process_apertures_new_args *args = data; struct kfd_process_device_apertures *pa; - struct kfd_process_device *pdd; - uint32_t nodes = 0; int ret; + int i;
dev_dbg(kfd_device, "get apertures for PASID 0x%x", p->pasid);
@@ -941,17 +935,7 @@ static int kfd_ioctl_get_process_apertures_new(struct file *filp, * sufficient memory */ mutex_lock(&p->mutex); - - if (!kfd_has_process_device_data(p)) - goto out_unlock; - - /* Run over all pdd of the process */ - pdd = kfd_get_first_process_device_data(p); - do { - args->num_of_nodes++; - pdd = kfd_get_next_process_device_data(p, pdd); - } while (pdd); - + args->num_of_nodes = p->n_pdds; goto out_unlock; }
@@ -966,22 +950,23 @@ static int kfd_ioctl_get_process_apertures_new(struct file *filp,
mutex_lock(&p->mutex);
- if (!kfd_has_process_device_data(p)) { + if (!p->n_pdds) { args->num_of_nodes = 0; kfree(pa); goto out_unlock; }
/* Run over all pdd of the process */ - pdd = kfd_get_first_process_device_data(p); - do { - pa[nodes].gpu_id = pdd->dev->id; - pa[nodes].lds_base = pdd->lds_base; - pa[nodes].lds_limit = pdd->lds_limit; - pa[nodes].gpuvm_base = pdd->gpuvm_base; - pa[nodes].gpuvm_limit = pdd->gpuvm_limit; - pa[nodes].scratch_base = pdd->scratch_base; - pa[nodes].scratch_limit = pdd->scratch_limit; + for (i = 0; i < min(p->n_pdds, args->num_of_nodes); i++) { + struct kfd_process_device *pdd = p->pdds[i]; + + pa[i].gpu_id = pdd->dev->id; + pa[i].lds_base = pdd->lds_base; + pa[i].lds_limit = pdd->lds_limit; + pa[i].gpuvm_base = pdd->gpuvm_base; + pa[i].gpuvm_limit = pdd->gpuvm_limit; + pa[i].scratch_base = pdd->scratch_base; + pa[i].scratch_limit = pdd->scratch_limit;
dev_dbg(kfd_device, "gpu id %u\n", pdd->dev->id); @@ -997,17 +982,14 @@ static int kfd_ioctl_get_process_apertures_new(struct file *filp, "scratch_base %llX\n", pdd->scratch_base); dev_dbg(kfd_device, "scratch_limit %llX\n", pdd->scratch_limit); - nodes++; - - pdd = kfd_get_next_process_device_data(p, pdd); - } while (pdd && (nodes < args->num_of_nodes)); + } mutex_unlock(&p->mutex);
- args->num_of_nodes = nodes; + args->num_of_nodes = i; ret = copy_to_user( (void __user *)args->kfd_process_device_apertures_ptr, pa, - (nodes * sizeof(struct kfd_process_device_apertures))); + (i * sizeof(struct kfd_process_device_apertures))); kfree(pa); return ret ? -EFAULT : 0;
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c b/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c index 5a64915abaf7..1a266b78f0d8 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_iommu.c @@ -131,11 +131,11 @@ int kfd_iommu_bind_process_to_device(struct kfd_process_device *pdd) */ void kfd_iommu_unbind_process(struct kfd_process *p) { - struct kfd_process_device *pdd; + int i;
- list_for_each_entry(pdd, &p->per_device_data, per_device_list) - if (pdd->bound == PDD_BOUND) - amd_iommu_unbind_pasid(pdd->dev->pdev, p->pasid); + for (i = 0; i < p->n_pdds; i++) + if (p->pdds[i]->bound == PDD_BOUND) + amd_iommu_unbind_pasid(p->pdds[i]->dev->pdev, p->pasid); }
/* Callback for process shutdown invoked by the IOMMU driver */ diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h index e2ebd5a1d4de..d9f8d3d48aac 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h @@ -45,6 +45,7 @@ #include <linux/swap.h>
#include "amd_shared.h" +#include "amdgpu.h"
#define KFD_MAX_RING_ENTRY_SIZE 8
@@ -644,12 +645,6 @@ enum kfd_pdd_bound {
/* Data that is per-process-per device. */ struct kfd_process_device { - /* - * List of all per-device data for a process. - * Starts from kfd_process.per_device_data. - */ - struct list_head per_device_list; - /* The device that owns this data. */ struct kfd_dev *dev;
@@ -766,10 +761,11 @@ struct kfd_process { uint16_t pasid;
/* - * List of kfd_process_device structures, + * Array of kfd_process_device pointers, * one for each device the process is using. */ - struct list_head per_device_data; + struct kfd_process_device *pdds[MAX_GPU_INSTANCE]; + uint32_t n_pdds;
struct process_queue_manager pqm;
@@ -867,14 +863,6 @@ void *kfd_process_device_translate_handle(struct kfd_process_device *p, void kfd_process_device_remove_obj_handle(struct kfd_process_device *pdd, int handle);
-/* Process device data iterator */ -struct kfd_process_device *kfd_get_first_process_device_data( - struct kfd_process *p); -struct kfd_process_device *kfd_get_next_process_device_data( - struct kfd_process *p, - struct kfd_process_device *pdd); -bool kfd_has_process_device_data(struct kfd_process *p); - /* PASIDs */ int kfd_pasid_init(void); void kfd_pasid_exit(void); diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c index 2807e1c4d59b..031e752e3154 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c @@ -505,7 +505,7 @@ static int kfd_sysfs_create_file(struct kfd_process *p, struct attribute *attr, static int kfd_procfs_add_sysfs_stats(struct kfd_process *p) { int ret = 0; - struct kfd_process_device *pdd; + int i; char stats_dir_filename[MAX_SYSFS_FILENAME_LEN];
if (!p) @@ -520,7 +520,8 @@ static int kfd_procfs_add_sysfs_stats(struct kfd_process *p) * - proc/<pid>/stats_<gpuid>/evicted_ms * - proc/<pid>/stats_<gpuid>/cu_occupancy */ - list_for_each_entry(pdd, &p->per_device_data, per_device_list) { + for (i = 0; i < p->n_pdds; i++) { + struct kfd_process_device *pdd = p->pdds[i]; struct kobject *kobj_stats;
snprintf(stats_dir_filename, MAX_SYSFS_FILENAME_LEN, @@ -571,7 +572,7 @@ static int kfd_procfs_add_sysfs_stats(struct kfd_process *p) static int kfd_procfs_add_sysfs_files(struct kfd_process *p) { int ret = 0; - struct kfd_process_device *pdd; + int i;
if (!p) return -EINVAL; @@ -584,7 +585,9 @@ static int kfd_procfs_add_sysfs_files(struct kfd_process *p) * - proc/<pid>/vram_<gpuid> * - proc/<pid>/sdma_<gpuid> */ - list_for_each_entry(pdd, &p->per_device_data, per_device_list) { + for (i = 0; i < p->n_pdds; i++) { + struct kfd_process_device *pdd = p->pdds[i]; + snprintf(pdd->vram_filename, MAX_SYSFS_FILENAME_LEN, "vram_%u", pdd->dev->id); ret = kfd_sysfs_create_file(p, &pdd->attr_vram, pdd->vram_filename); @@ -875,21 +878,23 @@ void kfd_unref_process(struct kfd_process *p) kref_put(&p->ref, kfd_process_ref_release); }
+ static void kfd_process_device_free_bos(struct kfd_process_device *pdd) { struct kfd_process *p = pdd->process; void *mem; int id; + int i;
/* * Remove all handles from idr and release appropriate * local memory object */ idr_for_each_entry(&pdd->alloc_idr, mem, id) { - struct kfd_process_device *peer_pdd;
- list_for_each_entry(peer_pdd, &p->per_device_data, - per_device_list) { + for (i = 0; i < p->n_pdds; i++) { + struct kfd_process_device *peer_pdd = p->pdds[i]; + if (!peer_pdd->vm) continue; amdgpu_amdkfd_gpuvm_unmap_memory_from_gpu( @@ -903,18 +908,19 @@ static void kfd_process_device_free_bos(struct kfd_process_device *pdd)
static void kfd_process_free_outstanding_kfd_bos(struct kfd_process *p) { - struct kfd_process_device *pdd; + int i;
- list_for_each_entry(pdd, &p->per_device_data, per_device_list) - kfd_process_device_free_bos(pdd); + for (i = 0; i < p->n_pdds; i++) + kfd_process_device_free_bos(p->pdds[i]); }
static void kfd_process_destroy_pdds(struct kfd_process *p) { - struct kfd_process_device *pdd, *temp; + int i; + + for (i = 0; i < p->n_pdds; i++) { + struct kfd_process_device *pdd = p->pdds[i];
- list_for_each_entry_safe(pdd, temp, &p->per_device_data, - per_device_list) { pr_debug("Releasing pdd (topology id %d) for process (pasid 0x%x)\n", pdd->dev->id, p->pasid);
@@ -927,8 +933,6 @@ static void kfd_process_destroy_pdds(struct kfd_process *p) amdgpu_amdkfd_gpuvm_destroy_process_vm( pdd->dev->kgd, pdd->vm);
- list_del(&pdd->per_device_list); - if (pdd->qpd.cwsr_kaddr && !pdd->qpd.cwsr_base) free_pages((unsigned long)pdd->qpd.cwsr_kaddr, get_order(KFD_CWSR_TBA_TMA_SIZE)); @@ -949,7 +953,9 @@ static void kfd_process_destroy_pdds(struct kfd_process *p) }
kfree(pdd); + p->pdds[i] = NULL; } + p->n_pdds = 0; }
/* No process locking is needed in this function, because the process @@ -961,7 +967,7 @@ static void kfd_process_wq_release(struct work_struct *work) { struct kfd_process *p = container_of(work, struct kfd_process, release_work); - struct kfd_process_device *pdd; + int i;
/* Remove the procfs files */ if (p->kobj) { @@ -970,7 +976,9 @@ static void kfd_process_wq_release(struct work_struct *work) kobject_put(p->kobj_queues); p->kobj_queues = NULL;
- list_for_each_entry(pdd, &p->per_device_data, per_device_list) { + for (i = 0; i < p->n_pdds; i++) { + struct kfd_process_device *pdd = p->pdds[i]; + sysfs_remove_file(p->kobj, &pdd->attr_vram); sysfs_remove_file(p->kobj, &pdd->attr_sdma); sysfs_remove_file(p->kobj, &pdd->attr_evict); @@ -1020,7 +1028,7 @@ static void kfd_process_notifier_release(struct mmu_notifier *mn, struct mm_struct *mm) { struct kfd_process *p; - struct kfd_process_device *pdd = NULL; + int i;
/* * The kfd_process structure can not be free because the @@ -1044,8 +1052,8 @@ static void kfd_process_notifier_release(struct mmu_notifier *mn, * pdd is in debug mode, we should first force unregistration, * then we will be able to destroy the queues */ - list_for_each_entry(pdd, &p->per_device_data, per_device_list) { - struct kfd_dev *dev = pdd->dev; + for (i = 0; i < p->n_pdds; i++) { + struct kfd_dev *dev = p->pdds[i]->dev;
mutex_lock(kfd_get_dbgmgr_mutex()); if (dev && dev->dbgmgr && dev->dbgmgr->pasid == p->pasid) { @@ -1081,11 +1089,11 @@ static const struct mmu_notifier_ops kfd_process_mmu_notifier_ops = { static int kfd_process_init_cwsr_apu(struct kfd_process *p, struct file *filep) { unsigned long offset; - struct kfd_process_device *pdd; + int i;
- list_for_each_entry(pdd, &p->per_device_data, per_device_list) { - struct kfd_dev *dev = pdd->dev; - struct qcm_process_device *qpd = &pdd->qpd; + for (i = 0; i < p->n_pdds; i++) { + struct kfd_dev *dev = p->pdds[i]->dev; + struct qcm_process_device *qpd = &p->pdds[i]->qpd;
if (!dev->cwsr_enabled || qpd->cwsr_kaddr || qpd->cwsr_base) continue; @@ -1162,7 +1170,7 @@ static struct kfd_process *create_process(const struct task_struct *thread) mutex_init(&process->mutex); process->mm = thread->mm; process->lead_thread = thread->group_leader; - INIT_LIST_HEAD(&process->per_device_data); + process->n_pdds = 0; INIT_DELAYED_WORK(&process->eviction_work, evict_process_worker); INIT_DELAYED_WORK(&process->restore_work, restore_process_worker); process->last_restore_timestamp = get_jiffies_64(); @@ -1244,11 +1252,11 @@ static int init_doorbell_bitmap(struct qcm_process_device *qpd, struct kfd_process_device *kfd_get_process_device_data(struct kfd_dev *dev, struct kfd_process *p) { - struct kfd_process_device *pdd = NULL; + int i;
- list_for_each_entry(pdd, &p->per_device_data, per_device_list) - if (pdd->dev == dev) - return pdd; + for (i = 0; i < p->n_pdds; i++) + if (p->pdds[i]->dev == dev) + return p->pdds[i];
return NULL; } @@ -1258,6 +1266,8 @@ struct kfd_process_device *kfd_create_process_device_data(struct kfd_dev *dev, { struct kfd_process_device *pdd = NULL;
+ if (WARN_ON_ONCE(p->n_pdds >= MAX_GPU_INSTANCE)) + return NULL; pdd = kzalloc(sizeof(*pdd), GFP_KERNEL); if (!pdd) return NULL; @@ -1286,7 +1296,7 @@ struct kfd_process_device *kfd_create_process_device_data(struct kfd_dev *dev, pdd->vram_usage = 0; pdd->sdma_past_activity_counter = 0; atomic64_set(&pdd->evict_duration_counter, 0); - list_add(&pdd->per_device_list, &p->per_device_data); + p->pdds[p->n_pdds++] = pdd;
/* Init idr used for memory handle translation */ idr_init(&pdd->alloc_idr); @@ -1418,28 +1428,6 @@ struct kfd_process_device *kfd_bind_process_to_device(struct kfd_dev *dev, return ERR_PTR(err); }
-struct kfd_process_device *kfd_get_first_process_device_data( - struct kfd_process *p) -{ - return list_first_entry(&p->per_device_data, - struct kfd_process_device, - per_device_list); -} - -struct kfd_process_device *kfd_get_next_process_device_data( - struct kfd_process *p, - struct kfd_process_device *pdd) -{ - if (list_is_last(&pdd->per_device_list, &p->per_device_data)) - return NULL; - return list_next_entry(pdd, per_device_list); -} - -bool kfd_has_process_device_data(struct kfd_process *p) -{ - return !(list_empty(&p->per_device_data)); -} - /* Create specific handle mapped to mem from process local memory idr * Assumes that the process lock is held. */ @@ -1515,11 +1503,13 @@ struct kfd_process *kfd_lookup_process_by_mm(const struct mm_struct *mm) */ int kfd_process_evict_queues(struct kfd_process *p) { - struct kfd_process_device *pdd; int r = 0; + int i; unsigned int n_evicted = 0;
- list_for_each_entry(pdd, &p->per_device_data, per_device_list) { + for (i = 0; i < p->n_pdds; i++) { + struct kfd_process_device *pdd = p->pdds[i]; + r = pdd->dev->dqm->ops.evict_process_queues(pdd->dev->dqm, &pdd->qpd); if (r) { @@ -1535,7 +1525,9 @@ int kfd_process_evict_queues(struct kfd_process *p) /* To keep state consistent, roll back partial eviction by * restoring queues */ - list_for_each_entry(pdd, &p->per_device_data, per_device_list) { + for (i = 0; i < p->n_pdds; i++) { + struct kfd_process_device *pdd = p->pdds[i]; + if (n_evicted == 0) break; if (pdd->dev->dqm->ops.restore_process_queues(pdd->dev->dqm, @@ -1551,10 +1543,12 @@ int kfd_process_evict_queues(struct kfd_process *p) /* kfd_process_restore_queues - Restore all user queues of a process */ int kfd_process_restore_queues(struct kfd_process *p) { - struct kfd_process_device *pdd; int r, ret = 0; + int i; + + for (i = 0; i < p->n_pdds; i++) { + struct kfd_process_device *pdd = p->pdds[i];
- list_for_each_entry(pdd, &p->per_device_data, per_device_list) { r = pdd->dev->dqm->ops.restore_process_queues(pdd->dev->dqm, &pdd->qpd); if (r) { diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c index eb1635ac8988..95a6c36cea4c 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process_queue_manager.c @@ -126,10 +126,10 @@ int pqm_set_gws(struct process_queue_manager *pqm, unsigned int qid,
void kfd_process_dequeue_from_all_devices(struct kfd_process *p) { - struct kfd_process_device *pdd; + int i;
- list_for_each_entry(pdd, &p->per_device_data, per_device_list) - kfd_process_dequeue_from_device(pdd); + for (i = 0; i < p->n_pdds; i++) + kfd_process_dequeue_from_device(p->pdds[i]); }
int pqm_init(struct process_queue_manager *pqm, struct kfd_process *p)
From: Alex Sierra alex.sierra@amd.com
svm range uses gpu bitmap to store which GPU svm range maps to. Application pass driver gpu id to specify GPU, the helper is needed to convert gpu id to gpu bitmap idx.
Access through kfd_process_device pointers array from kfd_process.
Signed-off-by: Alex Sierra alex.sierra@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com --- drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 5 ++++ drivers/gpu/drm/amd/amdkfd/kfd_process.c | 30 ++++++++++++++++++++++++ 2 files changed, 35 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h index d9f8d3d48aac..4ef8804adcf5 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h @@ -837,6 +837,11 @@ struct kfd_process *kfd_create_process(struct file *filep); struct kfd_process *kfd_get_process(const struct task_struct *); struct kfd_process *kfd_lookup_process_by_pasid(unsigned int pasid); struct kfd_process *kfd_lookup_process_by_mm(const struct mm_struct *mm); +int kfd_process_gpuid_from_gpuidx(struct kfd_process *p, + uint32_t gpu_idx, uint32_t *gpuid); +int kfd_process_gpuidx_from_gpuid(struct kfd_process *p, uint32_t gpu_id); +int kfd_process_device_from_gpuidx(struct kfd_process *p, + uint32_t gpu_idx, struct kfd_dev **gpu); void kfd_unref_process(struct kfd_process *p); int kfd_process_evict_queues(struct kfd_process *p); int kfd_process_restore_queues(struct kfd_process *p); diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c index 031e752e3154..7396f3a6d0ee 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c @@ -1561,6 +1561,36 @@ int kfd_process_restore_queues(struct kfd_process *p) return ret; }
+int kfd_process_gpuid_from_gpuidx(struct kfd_process *p, + uint32_t gpu_idx, uint32_t *gpuid) +{ + if (gpu_idx < p->n_pdds) { + *gpuid = p->pdds[gpu_idx]->dev->id; + return 0; + } + return -EINVAL; +} + +int kfd_process_gpuidx_from_gpuid(struct kfd_process *p, uint32_t gpu_id) +{ + int i; + + for (i = 0; i < p->n_pdds; i++) + if (p->pdds[i] && gpu_id == p->pdds[i]->dev->id) + return i; + return -EINVAL; +} + +int kfd_process_device_from_gpuidx(struct kfd_process *p, + uint32_t gpu_idx, struct kfd_dev **gpu) +{ + if (gpu_idx < p->n_pdds) { + *gpu = p->pdds[gpu_idx]->dev; + return 0; + } + return -EINVAL; +} + static void evict_process_worker(struct work_struct *work) { int ret;
From: Philip Yang Philip.Yang@amd.com
Add svm (shared virtual memory) ioctl data structure and API definition.
The svm ioctl API is designed to be extensible in the future. All operations are provided by a single IOCTL to preserve ioctl number space. The arguments structure ends with a variable size array of attributes that can be used to set or get one or multiple attributes.
Signed-off-by: Philip Yang Philip.Yang@amd.com Signed-off-by: Alex Sierra alex.sierra@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com --- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 7 ++ include/uapi/linux/kfd_ioctl.h | 128 ++++++++++++++++++++++- 2 files changed, 133 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c index 8c87afce12df..c5288a6e45b9 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c @@ -1746,6 +1746,11 @@ static int kfd_ioctl_smi_events(struct file *filep, return kfd_smi_event_open(dev, &args->anon_fd); }
+static int kfd_ioctl_svm(struct file *filep, struct kfd_process *p, void *data) +{ + return -EINVAL; +} + #define AMDKFD_IOCTL_DEF(ioctl, _func, _flags) \ [_IOC_NR(ioctl)] = {.cmd = ioctl, .func = _func, .flags = _flags, \ .cmd_drv = 0, .name = #ioctl} @@ -1844,6 +1849,8 @@ static const struct amdkfd_ioctl_desc amdkfd_ioctls[] = {
AMDKFD_IOCTL_DEF(AMDKFD_IOC_SMI_EVENTS, kfd_ioctl_smi_events, 0), + + AMDKFD_IOCTL_DEF(AMDKFD_IOC_SVM, kfd_ioctl_svm, 0), };
#define AMDKFD_CORE_IOCTL_COUNT ARRAY_SIZE(amdkfd_ioctls) diff --git a/include/uapi/linux/kfd_ioctl.h b/include/uapi/linux/kfd_ioctl.h index 695b606da4b1..5d4a4b3e0b61 100644 --- a/include/uapi/linux/kfd_ioctl.h +++ b/include/uapi/linux/kfd_ioctl.h @@ -29,9 +29,10 @@ /* * - 1.1 - initial version * - 1.3 - Add SMI events support + * - 1.4 - Add SVM API */ #define KFD_IOCTL_MAJOR_VERSION 1 -#define KFD_IOCTL_MINOR_VERSION 3 +#define KFD_IOCTL_MINOR_VERSION 4
struct kfd_ioctl_get_version_args { __u32 major_version; /* from KFD */ @@ -471,6 +472,127 @@ enum kfd_mmio_remap { KFD_MMIO_REMAP_HDP_REG_FLUSH_CNTL = 4, };
+/* Guarantee host access to memory */ +#define KFD_IOCTL_SVM_FLAG_HOST_ACCESS 0x00000001 +/* Fine grained coherency between all devices with access */ +#define KFD_IOCTL_SVM_FLAG_COHERENT 0x00000002 +/* Use any GPU in same hive as preferred device */ +#define KFD_IOCTL_SVM_FLAG_HIVE_LOCAL 0x00000004 +/* GPUs only read, allows replication */ +#define KFD_IOCTL_SVM_FLAG_GPU_RO 0x00000008 +/* Allow execution on GPU */ +#define KFD_IOCTL_SVM_FLAG_GPU_EXEC 0x00000010 + +/** + * kfd_ioctl_svm_op - SVM ioctl operations + * + * @KFD_IOCTL_SVM_OP_SET_ATTR: Modify one or more attributes + * @KFD_IOCTL_SVM_OP_GET_ATTR: Query one or more attributes + */ +enum kfd_ioctl_svm_op { + KFD_IOCTL_SVM_OP_SET_ATTR, + KFD_IOCTL_SVM_OP_GET_ATTR +}; + +/** kfd_ioctl_svm_location - Enum for preferred and prefetch locations + * + * GPU IDs are used to specify GPUs as preferred and prefetch locations. + * Below definitions are used for system memory or for leaving the preferred + * location unspecified. + */ +enum kfd_ioctl_svm_location { + KFD_IOCTL_SVM_LOCATION_SYSMEM = 0, + KFD_IOCTL_SVM_LOCATION_UNDEFINED = 0xffffffff +}; + +/** + * kfd_ioctl_svm_attr_type - SVM attribute types + * + * @KFD_IOCTL_SVM_ATTR_PREFERRED_LOC: gpuid of the preferred location, 0 for + * system memory + * @KFD_IOCTL_SVM_ATTR_PREFETCH_LOC: gpuid of the prefetch location, 0 for + * system memory. Setting this triggers an + * immediate prefetch (migration). + * @KFD_IOCTL_SVM_ATTR_ACCESS: + * @KFD_IOCTL_SVM_ATTR_ACCESS_IN_PLACE: + * @KFD_IOCTL_SVM_ATTR_NO_ACCESS: specify memory access for the gpuid given + * by the attribute value + * @KFD_IOCTL_SVM_ATTR_SET_FLAGS: bitmask of flags to set (see + * KFD_IOCTL_SVM_FLAG_...) + * @KFD_IOCTL_SVM_ATTR_CLR_FLAGS: bitmask of flags to clear + * @KFD_IOCTL_SVM_ATTR_GRANULARITY: migration granularity + * (log2 num pages) + */ +enum kfd_ioctl_svm_attr_type { + KFD_IOCTL_SVM_ATTR_PREFERRED_LOC, + KFD_IOCTL_SVM_ATTR_PREFETCH_LOC, + KFD_IOCTL_SVM_ATTR_ACCESS, + KFD_IOCTL_SVM_ATTR_ACCESS_IN_PLACE, + KFD_IOCTL_SVM_ATTR_NO_ACCESS, + KFD_IOCTL_SVM_ATTR_SET_FLAGS, + KFD_IOCTL_SVM_ATTR_CLR_FLAGS, + KFD_IOCTL_SVM_ATTR_GRANULARITY +}; + +/** + * kfd_ioctl_svm_attribute - Attributes as pairs of type and value + * + * The meaning of the @value depends on the attribute type. + * + * @type: attribute type (see enum @kfd_ioctl_svm_attr_type) + * @value: attribute value + */ +struct kfd_ioctl_svm_attribute { + __u32 type; + __u32 value; +}; + +/** + * kfd_ioctl_svm_args - Arguments for SVM ioctl + * + * @op specifies the operation to perform (see enum + * @kfd_ioctl_svm_op). @start_addr and @size are common for all + * operations. + * + * A variable number of attributes can be given in @attrs. + * @nattr specifies the number of attributes. New attributes can be + * added in the future without breaking the ABI. If unknown attributes + * are given, the function returns -EINVAL. + * + * @KFD_IOCTL_SVM_OP_SET_ATTR sets attributes for a virtual address + * range. It may overlap existing virtual address ranges. If it does, + * the existing ranges will be split such that the attribute changes + * only apply to the specified address range. + * + * @KFD_IOCTL_SVM_OP_GET_ATTR returns the intersection of attributes + * over all memory in the given range and returns the result as the + * attribute value. If different pages have different preferred or + * prefetch locations, 0xffffffff will be returned for + * @KFD_IOCTL_SVM_ATTR_PREFERRED_LOC or + * @KFD_IOCTL_SVM_ATTR_PREFETCH_LOC resepctively. For + * @KFD_IOCTL_SVM_ATTR_SET_FLAGS, flags of all pages will be + * aggregated by bitwise AND. The minimum migration granularity + * throughout the range will be returned for + * @KFD_IOCTL_SVM_ATTR_GRANULARITY. + * + * Querying of accessibility attributes works by initializing the + * attribute type to @KFD_IOCTL_SVM_ATTR_ACCESS and the value to the + * GPUID being queried. Multiple attributes can be given to allow + * querying multiple GPUIDs. The ioctl function overwrites the + * attribute type to indicate the access for the specified GPU. + * + * @KFD_IOCTL_SVM_ATTR_CLR_FLAGS is invalid for + * @KFD_IOCTL_SVM_OP_GET_ATTR. + */ +struct kfd_ioctl_svm_args { + __u64 start_addr; + __u64 size; + __u32 op; + __u32 nattr; + /* Variable length array of attributes */ + struct kfd_ioctl_svm_attribute attrs[0]; +}; + #define AMDKFD_IOCTL_BASE 'K' #define AMDKFD_IO(nr) _IO(AMDKFD_IOCTL_BASE, nr) #define AMDKFD_IOR(nr, type) _IOR(AMDKFD_IOCTL_BASE, nr, type) @@ -571,7 +693,9 @@ enum kfd_mmio_remap { #define AMDKFD_IOC_SMI_EVENTS \ AMDKFD_IOWR(0x1F, struct kfd_ioctl_smi_events_args)
+#define AMDKFD_IOC_SVM AMDKFD_IOWR(0x20, struct kfd_ioctl_svm_args) + #define AMDKFD_COMMAND_START 0x01 -#define AMDKFD_COMMAND_END 0x20 +#define AMDKFD_COMMAND_END 0x21
#endif
From: Philip Yang Philip.Yang@amd.com
SVMAPISupported property added to HSA_CAPABILITY, the value match HSA_CAPABILITY defined in Thunk spec:
SVMAPISupported: it will not be supported on older kernels that don't have HMM or on GFXv8 or older GPUs without support for 48-bit virtual addresses.
CoherentHostAccess property added to HSA_MEMORYPROPERTY, the value match HSA_MEMORYPROPERTY defined in Thunk spec:
CoherentHostAccess: whether or not device memory can be coherently accessed by the host CPU.
Signed-off-by: Philip Yang Philip.Yang@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com --- drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 1 + drivers/gpu/drm/amd/amdkfd/kfd_topology.h | 10 ++++++---- 2 files changed, 7 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c index a3fc23873819..885b8a071717 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c @@ -1380,6 +1380,7 @@ int kfd_topology_add_device(struct kfd_dev *gpu) dev->node_props.capability |= ((HSA_CAP_DOORBELL_TYPE_2_0 << HSA_CAP_DOORBELL_TYPE_TOTALBITS_SHIFT) & HSA_CAP_DOORBELL_TYPE_TOTALBITS_MASK); + dev->node_props.capability |= HSA_CAP_SVMAPI_SUPPORTED; break; default: WARN(1, "Unexpected ASIC family %u", diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_topology.h b/drivers/gpu/drm/amd/amdkfd/kfd_topology.h index 326d9b26b7aa..7c5ea9b4b9d9 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_topology.h +++ b/drivers/gpu/drm/amd/amdkfd/kfd_topology.h @@ -52,8 +52,9 @@ #define HSA_CAP_RASEVENTNOTIFY 0x00200000 #define HSA_CAP_ASIC_REVISION_MASK 0x03c00000 #define HSA_CAP_ASIC_REVISION_SHIFT 22 +#define HSA_CAP_SVMAPI_SUPPORTED 0x04000000
-#define HSA_CAP_RESERVED 0xfc078000 +#define HSA_CAP_RESERVED 0xf8078000
struct kfd_node_properties { uint64_t hive_id; @@ -98,9 +99,10 @@ struct kfd_node_properties { #define HSA_MEM_HEAP_TYPE_GPU_LDS 4 #define HSA_MEM_HEAP_TYPE_GPU_SCRATCH 5
-#define HSA_MEM_FLAGS_HOT_PLUGGABLE 0x00000001 -#define HSA_MEM_FLAGS_NON_VOLATILE 0x00000002 -#define HSA_MEM_FLAGS_RESERVED 0xfffffffc +#define HSA_MEM_FLAGS_HOT_PLUGGABLE 0x00000001 +#define HSA_MEM_FLAGS_NON_VOLATILE 0x00000002 +#define HSA_MEM_FLAGS_COHERENTHOSTACCESS 0x00000004 +#define HSA_MEM_FLAGS_RESERVED 0xfffffff8
struct kfd_mem_properties { struct list_head list;
From: Philip Yang Philip.Yang@amd.com
svm range structure stores the range start address, size, attributes, flags, prefetch location and gpu bitmap which indicates which GPU this range maps to. Same virtual address is shared by CPU and GPUs.
Process has svm range list which uses both interval tree and list to store all svm ranges registered by the process. Interval tree is used by GPU vm fault handler and CPU page fault handler to get svm range structure from the specific address. List is used to scan all ranges in eviction restore work.
Apply attributes preferred location, prefetch location, mapping flags, migration granularity to svm range, store mapping gpu index into bitmap.
Signed-off-by: Philip Yang Philip.Yang@amd.com Signed-off-by: Alex Sierra alex.sierra@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com --- drivers/gpu/drm/amd/amdkfd/Makefile | 3 +- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 21 +- drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 14 + drivers/gpu/drm/amd/amdkfd/kfd_process.c | 9 + drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 603 +++++++++++++++++++++++ drivers/gpu/drm/amd/amdkfd/kfd_svm.h | 93 ++++ 6 files changed, 741 insertions(+), 2 deletions(-) create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.c create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.h
diff --git a/drivers/gpu/drm/amd/amdkfd/Makefile b/drivers/gpu/drm/amd/amdkfd/Makefile index e1e4115dcf78..387ce0217d35 100644 --- a/drivers/gpu/drm/amd/amdkfd/Makefile +++ b/drivers/gpu/drm/amd/amdkfd/Makefile @@ -54,7 +54,8 @@ AMDKFD_FILES := $(AMDKFD_PATH)/kfd_module.o \ $(AMDKFD_PATH)/kfd_dbgdev.o \ $(AMDKFD_PATH)/kfd_dbgmgr.o \ $(AMDKFD_PATH)/kfd_smi_events.o \ - $(AMDKFD_PATH)/kfd_crat.o + $(AMDKFD_PATH)/kfd_crat.o \ + $(AMDKFD_PATH)/kfd_svm.o
ifneq ($(CONFIG_AMD_IOMMU_V2),) AMDKFD_FILES += $(AMDKFD_PATH)/kfd_iommu.o diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c index c5288a6e45b9..2d3ba7e806d5 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c @@ -38,6 +38,7 @@ #include "kfd_priv.h" #include "kfd_device_queue_manager.h" #include "kfd_dbgmgr.h" +#include "kfd_svm.h" #include "amdgpu_amdkfd.h" #include "kfd_smi_events.h"
@@ -1748,7 +1749,25 @@ static int kfd_ioctl_smi_events(struct file *filep,
static int kfd_ioctl_svm(struct file *filep, struct kfd_process *p, void *data) { - return -EINVAL; + struct kfd_ioctl_svm_args *args = data; + int r = 0; + + pr_debug("start 0x%llx size 0x%llx op 0x%x nattr 0x%x\n", + args->start_addr, args->size, args->op, args->nattr); + + if ((args->start_addr & ~PAGE_MASK) || (args->size & ~PAGE_MASK)) + return -EINVAL; + if (!args->start_addr || !args->size) + return -EINVAL; + + mutex_lock(&p->mutex); + + r = svm_ioctl(p, args->op, args->start_addr, args->size, args->nattr, + args->attrs); + + mutex_unlock(&p->mutex); + + return r; }
#define AMDKFD_IOCTL_DEF(ioctl, _func, _flags) \ diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h index 4ef8804adcf5..cbb2bae1982d 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h @@ -726,6 +726,17 @@ struct kfd_process_device {
#define qpd_to_pdd(x) container_of(x, struct kfd_process_device, qpd)
+struct svm_range_list { + struct mutex lock; /* use svms_lock/unlock(svms) */; + unsigned int saved_flags; + struct rb_root_cached objects; + struct list_head list; + struct srcu_struct srcu; + struct work_struct srcu_free_work; + struct list_head free_list; + struct mutex free_list_lock; +}; + /* Process data */ struct kfd_process { /* @@ -804,6 +815,9 @@ struct kfd_process { struct kobject *kobj; struct kobject *kobj_queues; struct attribute attr_pasid; + + /* shared virtual memory registered by this process */ + struct svm_range_list svms; };
#define KFD_PROCESS_TABLE_SIZE 5 /* bits: 32 entries */ diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c index 7396f3a6d0ee..791f17308b1b 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c @@ -35,6 +35,7 @@ #include <linux/pm_runtime.h> #include "amdgpu_amdkfd.h" #include "amdgpu.h" +#include "kfd_svm.h"
struct mm_struct;
@@ -42,6 +43,7 @@ struct mm_struct; #include "kfd_device_queue_manager.h" #include "kfd_dbgmgr.h" #include "kfd_iommu.h" +#include "kfd_svm.h"
/* * List of struct kfd_process (field kfd_process). @@ -997,6 +999,7 @@ static void kfd_process_wq_release(struct work_struct *work) kfd_iommu_unbind_process(p);
kfd_process_free_outstanding_kfd_bos(p); + svm_range_list_fini(p);
kfd_process_destroy_pdds(p); dma_fence_put(p->ef); @@ -1190,6 +1193,10 @@ static struct kfd_process *create_process(const struct task_struct *thread) if (err != 0) goto err_init_apertures;
+ err = svm_range_list_init(process); + if (err) + goto err_init_svm_range_list; + /* Must be last, have to use release destruction after this */ process->mmu_notifier.ops = &kfd_process_mmu_notifier_ops; err = mmu_notifier_register(&process->mmu_notifier, process->mm); @@ -1203,6 +1210,8 @@ static struct kfd_process *create_process(const struct task_struct *thread) return process;
err_register_notifier: + svm_range_list_fini(process); +err_init_svm_range_list: kfd_process_free_outstanding_kfd_bos(process); kfd_process_destroy_pdds(process); err_init_apertures: diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c new file mode 100644 index 000000000000..0b0410837be9 --- /dev/null +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c @@ -0,0 +1,603 @@ +/* + * Copyright 2020 Advanced Micro Devices, Inc. + * + * Permission is hereby granted, free of charge, to any person obtaining a + * copy of this software and associated documentation files (the "Software"), + * to deal in the Software without restriction, including without limitation + * the rights to use, copy, modify, merge, publish, distribute, sublicense, + * and/or sell copies of the Software, and to permit persons to whom the + * Software is furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR + * OTHER DEALINGS IN THE SOFTWARE. + */ + +#include <linux/types.h> +#include "amdgpu_sync.h" +#include "amdgpu_object.h" +#include "amdgpu_vm.h" +#include "amdgpu_mn.h" +#include "kfd_priv.h" +#include "kfd_svm.h" + +/** + * svm_range_unlink - unlink svm_range from lists and interval tree + * @prange: svm range structure to be removed + * + * Remove the svm range from svms interval tree and link list + * + * Context: The caller must hold svms_lock + */ +static void svm_range_unlink(struct svm_range *prange) +{ + pr_debug("prange 0x%p [0x%lx 0x%lx]\n", prange, prange->it_node.start, + prange->it_node.last); + + list_del_rcu(&prange->list); + interval_tree_remove(&prange->it_node, &prange->svms->objects); +} + +/** + * svm_range_add_to_svms - add svm range to svms + * @prange: svm range structure to be added + * + * Add the svm range to svms interval tree and link list + * + * Context: The caller must hold svms_lock + */ +static void svm_range_add_to_svms(struct svm_range *prange) +{ + pr_debug("svms 0x%p [0x%lx 0x%lx]\n", prange->svms, + prange->it_node.start, prange->it_node.last); + + list_add_tail_rcu(&prange->list, &prange->svms->list); + interval_tree_insert(&prange->it_node, &prange->svms->objects); +} + +static void svm_range_remove(struct svm_range *prange) +{ + pr_debug("svms 0x%p [0x%lx 0x%lx]\n", prange->svms, + prange->it_node.start, prange->it_node.last); + + kvfree(prange->pages_addr); + kfree(prange); +} + +static void +svm_range_set_default_attributes(int32_t *location, int32_t *prefetch_loc, + uint8_t *granularity, uint32_t *flags) +{ + *location = 0; + *prefetch_loc = 0; + *granularity = 9; + *flags = + KFD_IOCTL_SVM_FLAG_HOST_ACCESS | KFD_IOCTL_SVM_FLAG_COHERENT; +} + +static struct +svm_range *svm_range_new(struct svm_range_list *svms, uint64_t start, + uint64_t last) +{ + uint64_t size = last - start + 1; + struct svm_range *prange; + + prange = kzalloc(sizeof(*prange), GFP_KERNEL); + if (!prange) + return NULL; + prange->npages = size; + prange->svms = svms; + prange->it_node.start = start; + prange->it_node.last = last; + INIT_LIST_HEAD(&prange->list); + INIT_LIST_HEAD(&prange->update_list); + INIT_LIST_HEAD(&prange->remove_list); + svm_range_set_default_attributes(&prange->preferred_loc, + &prange->prefetch_loc, + &prange->granularity, &prange->flags); + + pr_debug("svms 0x%p [0x%llx 0x%llx]\n", svms, start, last); + + return prange; +} + +static struct kfd_dev * +svm_get_supported_dev_by_id(struct kfd_process *p, uint32_t gpu_id, + int *r_gpuidx) +{ + struct kfd_dev *dev; + int gpuidx; + int r; + + gpuidx = kfd_process_gpuidx_from_gpuid(p, gpu_id); + if (gpuidx < 0) { + pr_debug("failed to get device by id 0x%x\n", gpu_id); + return NULL; + } + r = kfd_process_device_from_gpuidx(p, gpuidx, &dev); + if (r < 0) { + pr_debug("failed to get device by idx 0x%x\n", gpuidx); + return NULL; + } + if (dev->device_info->asic_family < CHIP_VEGA10) { + pr_debug("device id 0x%x does not support SVM\n", gpu_id); + return NULL; + } + if (r_gpuidx) + *r_gpuidx = gpuidx; + return dev; +} + +static int +svm_range_apply_attrs(struct kfd_process *p, struct svm_range *prange, + uint32_t nattr, struct kfd_ioctl_svm_attribute *attrs) +{ + uint32_t i; + int gpuidx; + + for (i = 0; i < nattr; i++) { + switch (attrs[i].type) { + case KFD_IOCTL_SVM_ATTR_PREFERRED_LOC: + if (attrs[i].value != KFD_IOCTL_SVM_LOCATION_SYSMEM && + attrs[i].value != KFD_IOCTL_SVM_LOCATION_UNDEFINED && + !svm_get_supported_dev_by_id(p, attrs[i].value, NULL)) + return -EINVAL; + prange->preferred_loc = attrs[i].value; + break; + case KFD_IOCTL_SVM_ATTR_PREFETCH_LOC: + if (attrs[i].value != KFD_IOCTL_SVM_LOCATION_SYSMEM && + !svm_get_supported_dev_by_id(p, attrs[i].value, NULL)) + return -EINVAL; + prange->prefetch_loc = attrs[i].value; + break; + case KFD_IOCTL_SVM_ATTR_ACCESS: + case KFD_IOCTL_SVM_ATTR_ACCESS_IN_PLACE: + case KFD_IOCTL_SVM_ATTR_NO_ACCESS: + if (!svm_get_supported_dev_by_id(p, attrs[i].value, + &gpuidx)) + return -EINVAL; + if (attrs[i].type == KFD_IOCTL_SVM_ATTR_NO_ACCESS) { + bitmap_clear(prange->bitmap_access, gpuidx, 1); + bitmap_clear(prange->bitmap_aip, gpuidx, 1); + } else if (attrs[i].type == KFD_IOCTL_SVM_ATTR_ACCESS) { + bitmap_set(prange->bitmap_access, gpuidx, 1); + bitmap_clear(prange->bitmap_aip, gpuidx, 1); + } else { + bitmap_clear(prange->bitmap_access, gpuidx, 1); + bitmap_set(prange->bitmap_aip, gpuidx, 1); + } + break; + case KFD_IOCTL_SVM_ATTR_SET_FLAGS: + prange->flags |= attrs[i].value; + break; + case KFD_IOCTL_SVM_ATTR_CLR_FLAGS: + prange->flags &= ~attrs[i].value; + break; + case KFD_IOCTL_SVM_ATTR_GRANULARITY: + prange->granularity = attrs[i].value; + break; + default: + pr_debug("unknown attr type 0x%x\n", attrs[i].type); + return -EINVAL; + } + } + + return 0; +} + +/** + * svm_range_debug_dump - print all range information from svms + * @svms: svm range list header + * + * debug output svm range start, end, pages_addr, prefetch location from svms + * interval tree and link list + * + * Context: The caller must hold svms_lock + */ +static void svm_range_debug_dump(struct svm_range_list *svms) +{ + struct interval_tree_node *node; + struct svm_range *prange; + + pr_debug("dump svms 0x%p list\n", svms); + pr_debug("range\tstart\tpage\tend\t\tpages_addr\tlocation\n"); + + /* Not using list_for_each_entry_rcu because the caller is holding the + * svms lock + */ + list_for_each_entry(prange, &svms->list, list) { + pr_debug("0x%lx\t0x%llx\t0x%llx\t0x%llx\t0x%x\n", + prange->it_node.start, prange->npages, + prange->it_node.start + prange->npages - 1, + prange->pages_addr ? *prange->pages_addr : 0, + prange->actual_loc); + } + + pr_debug("dump svms 0x%p interval tree\n", svms); + pr_debug("range\tstart\tpage\tend\t\tpages_addr\tlocation\n"); + node = interval_tree_iter_first(&svms->objects, 0, ~0ULL); + while (node) { + prange = container_of(node, struct svm_range, it_node); + pr_debug("0x%lx\t0x%llx\t0x%llx\t0x%llx\t0x%x\n", + prange->it_node.start, prange->npages, + prange->it_node.start + prange->npages - 1, + prange->pages_addr ? *prange->pages_addr : 0, + prange->actual_loc); + node = interval_tree_iter_next(node, 0, ~0ULL); + } +} + +/** + * svm_range_handle_overlap - split overlap ranges + * @svms: svm range list header + * @new: range added with this attributes + * @start: range added start address, in pages + * @last: range last address, in pages + * @update_list: output, the ranges attributes are updated. For set_attr, this + * will do validation and map to GPUs. For unmap, this will be + * removed and unmap from GPUs + * @insert_list: output, the ranges will be inserted into svms, attributes are + * not changes. For set_attr, this will add into svms. For unmap, + * will remove duplicate range from update_list because it is + * unmapped, should not insert to svms. + * @remove_list:output, the ranges will be removed from svms + * @left: the remaining range after overlap, For set_attr, this will be added + * as new range. For unmap, this is ignored. + * + * Total have 5 overlap cases. + * + * Context: The caller must hold svms_lock + */ +static int +svm_range_handle_overlap(struct svm_range_list *svms, struct svm_range *new, + unsigned long start, unsigned long last, + struct list_head *update_list, + struct list_head *insert_list, + struct list_head *remove_list, + unsigned long *left) +{ + struct interval_tree_node *node; + struct svm_range *prange; + struct svm_range *tmp; + int r = 0; + + INIT_LIST_HEAD(update_list); + INIT_LIST_HEAD(insert_list); + INIT_LIST_HEAD(remove_list); + + node = interval_tree_iter_first(&svms->objects, start, last); + while (node) { + struct interval_tree_node *next; + + pr_debug("found overlap node [0x%lx 0x%lx]\n", node->start, + node->last); + + prange = container_of(node, struct svm_range, it_node); + next = interval_tree_iter_next(node, start, last); + + if (node->start < start && node->last > last) { + pr_debug("split in 2 ranges\n"); + start = last + 1; + + } else if (node->start < start) { + /* + * For node->last == last, will exit loop + * for node->last < last, will continue in next loop + */ + uint64_t old_last = node->last; + + start = old_last + 1; + + } else if (node->start == start && node->last > last) { + pr_debug("change old range start\n"); + + start = last + 1; + + } else if (node->start == start) { + if (prange->it_node.last == last) + pr_debug("found exactly same range\n"); + else + pr_debug("next loop to add remaining range\n"); + + start = node->last + 1; + + } else { /* node->start > start */ + pr_debug("add new range at front\n"); + + start = node->last + 1; + } + + if (r) + goto out; + + node = next; + } + + if (left && start <= last) + *left = last - start + 1; + +out: + if (r) + list_for_each_entry_safe(prange, tmp, insert_list, list) + svm_range_remove(prange); + + return r; +} + +static void svm_range_srcu_free_work(struct work_struct *work_struct) +{ + struct svm_range_list *svms; + struct svm_range *prange; + struct svm_range *tmp; + + svms = container_of(work_struct, struct svm_range_list, srcu_free_work); + + synchronize_srcu(&svms->srcu); + + mutex_lock(&svms->free_list_lock); + list_for_each_entry_safe(prange, tmp, &svms->free_list, remove_list) { + list_del(&prange->remove_list); + svm_range_remove(prange); + } + mutex_unlock(&svms->free_list_lock); +} + +void svm_range_list_fini(struct kfd_process *p) +{ + pr_debug("pasid 0x%x svms 0x%p\n", p->pasid, &p->svms); + + /* Ensure srcu free work is finished before process is destroyed */ + flush_work(&p->svms.srcu_free_work); + cleanup_srcu_struct(&p->svms.srcu); + mutex_destroy(&p->svms.free_list_lock); +} + +int svm_range_list_init(struct kfd_process *p) +{ + struct svm_range_list *svms = &p->svms; + int r; + + svms->objects = RB_ROOT_CACHED; + mutex_init(&svms->lock); + INIT_LIST_HEAD(&svms->list); + r = init_srcu_struct(&svms->srcu); + if (r) { + pr_debug("failed %d to init srcu\n", r); + return r; + } + INIT_WORK(&svms->srcu_free_work, svm_range_srcu_free_work); + INIT_LIST_HEAD(&svms->free_list); + mutex_init(&svms->free_list_lock); + + return 0; +} + +/** + * svm_range_is_valid - check if virtual address range is valid + * @mm: current process mm_struct + * @start: range start address, in pages + * @size: range size, in pages + * + * Valid virtual address range means it belongs to one or more VMAs + * + * Context: Process context + * + * Return: + * true - valid svm range + * false - invalid svm range + */ +static bool +svm_range_is_valid(struct mm_struct *mm, uint64_t start, uint64_t size) +{ + const unsigned long device_vma = VM_IO | VM_PFNMAP | VM_MIXEDMAP; + struct vm_area_struct *vma; + unsigned long end; + + start <<= PAGE_SHIFT; + end = start + (size << PAGE_SHIFT); + + do { + vma = find_vma(mm, start); + if (!vma || start < vma->vm_start || + (vma->vm_flags & device_vma)) + return false; + start = min(end, vma->vm_end); + } while (start < end); + + return true; +} + +/** + * svm_range_add - add svm range and handle overlap + * @p: the range add to this process svms + * @start: page size aligned + * @size: page size aligned + * @nattr: number of attributes + * @attrs: array of attributes + * @update_list: output, the ranges need validate and update GPU mapping + * @insert_list: output, the ranges need insert to svms + * @remove_list: output, the ranges are replaced and need remove from svms + * + * Check if the virtual address range has overlap with the registered ranges, + * split the overlapped range, copy and adjust pages address and vram nodes in + * old and new ranges. + * + * Context: Process context, takes and releases svms_lock + * + * Return: + * 0 - OK, otherwise error code + */ +static int +svm_range_add(struct kfd_process *p, uint64_t start, uint64_t size, + uint32_t nattr, struct kfd_ioctl_svm_attribute *attrs, + struct list_head *update_list, struct list_head *insert_list, + struct list_head *remove_list) +{ + uint64_t last = start + size - 1UL; + struct svm_range_list *svms; + struct svm_range new = {0}; + struct svm_range *prange; + unsigned long left = 0; + int r = 0; + + pr_debug("svms 0x%p [0x%llx 0x%llx]\n", &p->svms, start, last); + + r = svm_range_apply_attrs(p, &new, nattr, attrs); + if (r) + return r; + + svms = &p->svms; + + r = svm_range_handle_overlap(svms, &new, start, last, update_list, + insert_list, remove_list, &left); + if (r) + return r; + + if (left) { + prange = svm_range_new(svms, last - left + 1, last); + list_add(&prange->list, insert_list); + list_add(&prange->update_list, update_list); + } + + return 0; +} + +static int +svm_range_set_attr(struct kfd_process *p, uint64_t start, uint64_t size, + uint32_t nattr, struct kfd_ioctl_svm_attribute *attrs) +{ + struct amdkfd_process_info *process_info = p->kgd_process_info; + struct mm_struct *mm = current->mm; + struct list_head update_list; + struct list_head insert_list; + struct list_head remove_list; + struct svm_range_list *svms; + struct svm_range *prange; + struct svm_range *tmp; + int srcu_idx; + int r = 0; + + pr_debug("pasid 0x%x svms 0x%p [0x%llx 0x%llx] pages 0x%llx\n", + p->pasid, &p->svms, start, start + size - 1, size); + + mmap_read_lock(mm); + if (!svm_range_is_valid(mm, start, size)) { + pr_debug("invalid range\n"); + mmap_read_unlock(mm); + return -EFAULT; + } + mmap_read_unlock(mm); + + mutex_lock(&process_info->lock); + + svms = &p->svms; + svms_lock(svms); + + r = svm_range_add(p, start, size, nattr, attrs, &update_list, + &insert_list, &remove_list); + if (r) { + svms_unlock(svms); + mutex_unlock(&process_info->lock); + return r; + } + + list_for_each_entry_safe(prange, tmp, &insert_list, list) + svm_range_add_to_svms(prange); + + /* Hold read lock to prevent prange is removed after unlocking svms */ + srcu_idx = srcu_read_lock(&svms->srcu); + svms_unlock(svms); + + /* Hold mm->map_sem and check if svm range is unmapped in parallel */ + mmap_read_lock(mm); + + if (!svm_range_is_valid(mm, start, size)) { + pr_debug("range is unmapped\n"); + mmap_read_unlock(mm); + srcu_read_unlock(&svms->srcu, srcu_idx); + r = -EFAULT; + goto out_remove; + } + + list_for_each_entry(prange, &update_list, update_list) { + + r = svm_range_apply_attrs(p, prange, nattr, attrs); + if (r) { + pr_debug("failed %d to apply attrs\n", r); + mmap_read_unlock(mm); + srcu_read_unlock(&prange->svms->srcu, srcu_idx); + goto out_remove; + } + } + + srcu_read_unlock(&svms->srcu, srcu_idx); + svms_lock(svms); + + mutex_lock(&svms->free_list_lock); + list_for_each_entry_safe(prange, tmp, &remove_list, remove_list) { + pr_debug("remove overlap prange svms 0x%p [0x%lx 0x%lx]\n", + prange->svms, prange->it_node.start, + prange->it_node.last); + svm_range_unlink(prange); + + pr_debug("schedule to free prange svms 0x%p [0x%lx 0x%lx]\n", + prange->svms, prange->it_node.start, + prange->it_node.last); + list_add_tail(&prange->remove_list, &svms->free_list); + } + if (!list_empty(&svms->free_list)) + schedule_work(&svms->srcu_free_work); + mutex_unlock(&svms->free_list_lock); + + svm_range_debug_dump(svms); + + svms_unlock(svms); + mmap_read_unlock(mm); + mutex_unlock(&process_info->lock); + + pr_debug("pasid 0x%x svms 0x%p [0x%llx 0x%llx] done\n", p->pasid, + &p->svms, start, start + size - 1); + + return 0; + +out_remove: + svms_lock(svms); + list_for_each_entry_safe(prange, tmp, &insert_list, list) { + svm_range_unlink(prange); + list_add_tail(&prange->remove_list, &svms->free_list); + } + if (!list_empty(&svms->free_list)) + schedule_work(&svms->srcu_free_work); + svms_unlock(svms); + mutex_unlock(&process_info->lock); + + return r; +} + +int +svm_ioctl(struct kfd_process *p, enum kfd_ioctl_svm_op op, uint64_t start, + uint64_t size, uint32_t nattrs, struct kfd_ioctl_svm_attribute *attrs) +{ + int r; + + start >>= PAGE_SHIFT; + size >>= PAGE_SHIFT; + + switch (op) { + case KFD_IOCTL_SVM_OP_SET_ATTR: + r = svm_range_set_attr(p, start, size, nattrs, attrs); + break; + default: + r = EINVAL; + break; + } + + return r; +} diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h new file mode 100644 index 000000000000..c7c54fb73dfb --- /dev/null +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h @@ -0,0 +1,93 @@ +/* + * Copyright 2020 Advanced Micro Devices, Inc. + * + * Permission is hereby granted, free of charge, to any person obtaining a + * copy of this software and associated documentation files (the "Software"), + * to deal in the Software without restriction, including without limitation + * the rights to use, copy, modify, merge, publish, distribute, sublicense, + * and/or sell copies of the Software, and to permit persons to whom the + * Software is furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR + * OTHER DEALINGS IN THE SOFTWARE. + * + */ + +#ifndef KFD_SVM_H_ +#define KFD_SVM_H_ + +#include <linux/rwsem.h> +#include <linux/list.h> +#include <linux/mutex.h> +#include <linux/sched/mm.h> +#include <linux/hmm.h> +#include "amdgpu.h" +#include "kfd_priv.h" + +/** + * struct svm_range - shared virtual memory range + * + * @svms: list of svm ranges, structure defined in kfd_process + * @it_node: node [start, last] stored in interval tree, start, last are page + * aligned, page size is (last - start + 1) + * @list: link list node, used to scan all ranges of svms + * @update_list:link list node used to add to update_list + * @remove_list:link list node used to add to remove list + * @npages: number of pages + * @pages_addr: list of system memory physical page address + * @flags: flags defined as KFD_IOCTL_SVM_FLAG_* + * @perferred_loc: perferred location, 0 for CPU, or GPU id + * @perfetch_loc: last prefetch location, 0 for CPU, or GPU id + * @actual_loc: the actual location, 0 for CPU, or GPU id + * @granularity:migration granularity, log2 num pages + * @bitmap_access: index bitmap of GPUs which can access the range + * @bitmap_aip: index bitmap of GPUs which can access the range in place + * + * Data structure for virtual memory range shared by CPU and GPUs, it can be + * allocated from system memory ram or device vram, and migrate from ram to vram + * or from vram to ram. + */ +struct svm_range { + struct svm_range_list *svms; + struct interval_tree_node it_node; + struct list_head list; + struct list_head update_list; + struct list_head remove_list; + uint64_t npages; + dma_addr_t *pages_addr; + uint32_t flags; + uint32_t preferred_loc; + uint32_t prefetch_loc; + uint32_t actual_loc; + uint8_t granularity; + DECLARE_BITMAP(bitmap_access, MAX_GPU_INSTANCE); + DECLARE_BITMAP(bitmap_aip, MAX_GPU_INSTANCE); +}; + +static inline void svms_lock(struct svm_range_list *svms) +{ + mutex_lock(&svms->lock); + svms->saved_flags = memalloc_nofs_save(); + +} +static inline void svms_unlock(struct svm_range_list *svms) +{ + memalloc_nofs_restore(svms->saved_flags); + mutex_unlock(&svms->lock); +} + +int svm_range_list_init(struct kfd_process *p); +void svm_range_list_fini(struct kfd_process *p); +int svm_ioctl(struct kfd_process *p, enum kfd_ioctl_svm_op op, uint64_t start, + uint64_t size, uint32_t nattrs, + struct kfd_ioctl_svm_attribute *attrs); + +#endif /* KFD_SVM_H_ */
From: Philip Yang Philip.Yang@amd.com
Get the intersection of attributes over all memory in the given range
Signed-off-by: Philip Yang Philip.Yang@amd.com Signed-off-by: Alex Sierra alex.sierra@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com --- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 175 ++++++++++++++++++++++++++- 1 file changed, 173 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index 0b0410837be9..017e77e9ae1e 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c @@ -75,8 +75,8 @@ static void svm_range_set_default_attributes(int32_t *location, int32_t *prefetch_loc, uint8_t *granularity, uint32_t *flags) { - *location = 0; - *prefetch_loc = 0; + *location = KFD_IOCTL_SVM_LOCATION_UNDEFINED; + *prefetch_loc = KFD_IOCTL_SVM_LOCATION_UNDEFINED; *granularity = 9; *flags = KFD_IOCTL_SVM_FLAG_HOST_ACCESS | KFD_IOCTL_SVM_FLAG_COHERENT; @@ -581,6 +581,174 @@ svm_range_set_attr(struct kfd_process *p, uint64_t start, uint64_t size, return r; }
+static int +svm_range_get_attr(struct kfd_process *p, uint64_t start, uint64_t size, + uint32_t nattr, struct kfd_ioctl_svm_attribute *attrs) +{ + DECLARE_BITMAP(bitmap_access, MAX_GPU_INSTANCE); + DECLARE_BITMAP(bitmap_aip, MAX_GPU_INSTANCE); + bool get_preferred_loc = false; + bool get_prefetch_loc = false; + bool get_granularity = false; + bool get_accessible = false; + bool get_flags = false; + uint64_t last = start + size - 1UL; + struct mm_struct *mm = current->mm; + uint8_t granularity = 0xff; + struct interval_tree_node *node; + struct svm_range_list *svms; + struct svm_range *prange; + uint32_t prefetch_loc = KFD_IOCTL_SVM_LOCATION_UNDEFINED; + uint32_t location = KFD_IOCTL_SVM_LOCATION_UNDEFINED; + uint32_t flags = 0xffffffff; + int gpuidx; + uint32_t i; + + pr_debug("svms 0x%p [0x%llx 0x%llx] nattr 0x%x\n", &p->svms, start, + start + size - 1, nattr); + + mmap_read_lock(mm); + if (!svm_range_is_valid(mm, start, size)) { + pr_debug("invalid range\n"); + mmap_read_unlock(mm); + return -EINVAL; + } + mmap_read_unlock(mm); + + for (i = 0; i < nattr; i++) { + switch (attrs[i].type) { + case KFD_IOCTL_SVM_ATTR_PREFERRED_LOC: + get_preferred_loc = true; + break; + case KFD_IOCTL_SVM_ATTR_PREFETCH_LOC: + get_prefetch_loc = true; + break; + case KFD_IOCTL_SVM_ATTR_ACCESS: + if (!svm_get_supported_dev_by_id( + p, attrs[i].value, NULL)) + return -EINVAL; + get_accessible = true; + break; + case KFD_IOCTL_SVM_ATTR_SET_FLAGS: + get_flags = true; + break; + case KFD_IOCTL_SVM_ATTR_GRANULARITY: + get_granularity = true; + break; + case KFD_IOCTL_SVM_ATTR_CLR_FLAGS: + case KFD_IOCTL_SVM_ATTR_ACCESS_IN_PLACE: + case KFD_IOCTL_SVM_ATTR_NO_ACCESS: + fallthrough; + default: + pr_debug("get invalid attr type 0x%x\n", attrs[i].type); + return -EINVAL; + } + } + + svms = &p->svms; + + svms_lock(svms); + + node = interval_tree_iter_first(&svms->objects, start, last); + if (!node) { + pr_debug("range attrs not found return default values\n"); + svm_range_set_default_attributes(&location, &prefetch_loc, + &granularity, &flags); + /* TODO: Automatically create SVM ranges and map them on + * GPU page faults + if (p->xnack_enabled) + bitmap_fill(bitmap_access, MAX_GPU_INSTANCE); + FIXME: Only set bits for supported GPUs + FIXME: I think this should be done inside + svm_range_set_default_attributes, so that it will + apply to all newly created ranges + */ + + goto fill_values; + } + bitmap_fill(bitmap_access, MAX_GPU_INSTANCE); + bitmap_fill(bitmap_aip, MAX_GPU_INSTANCE); + + while (node) { + struct interval_tree_node *next; + + prange = container_of(node, struct svm_range, it_node); + next = interval_tree_iter_next(node, start, last); + + if (get_preferred_loc) { + if (prange->preferred_loc == + KFD_IOCTL_SVM_LOCATION_UNDEFINED || + (location != KFD_IOCTL_SVM_LOCATION_UNDEFINED && + location != prange->preferred_loc)) { + location = KFD_IOCTL_SVM_LOCATION_UNDEFINED; + get_preferred_loc = false; + } else { + location = prange->preferred_loc; + } + } + if (get_prefetch_loc) { + if (prange->prefetch_loc == + KFD_IOCTL_SVM_LOCATION_UNDEFINED || + (prefetch_loc != KFD_IOCTL_SVM_LOCATION_UNDEFINED && + prefetch_loc != prange->prefetch_loc)) { + prefetch_loc = KFD_IOCTL_SVM_LOCATION_UNDEFINED; + get_prefetch_loc = false; + } else { + prefetch_loc = prange->prefetch_loc; + } + } + if (get_accessible) { + bitmap_and(bitmap_access, bitmap_access, + prange->bitmap_access, MAX_GPU_INSTANCE); + bitmap_and(bitmap_aip, bitmap_aip, + prange->bitmap_aip, MAX_GPU_INSTANCE); + } + if (get_flags) + flags &= prange->flags; + + if (get_granularity && prange->granularity < granularity) + granularity = prange->granularity; + + node = next; + } +fill_values: + svms_unlock(svms); + + for (i = 0; i < nattr; i++) { + switch (attrs[i].type) { + case KFD_IOCTL_SVM_ATTR_PREFERRED_LOC: + attrs[i].value = location; + break; + case KFD_IOCTL_SVM_ATTR_PREFETCH_LOC: + attrs[i].value = prefetch_loc; + break; + case KFD_IOCTL_SVM_ATTR_ACCESS: + gpuidx = kfd_process_gpuidx_from_gpuid(p, + attrs[i].value); + if (gpuidx < 0) { + pr_debug("invalid gpuid %x\n", attrs[i].value); + return -EINVAL; + } + if (test_bit(gpuidx, bitmap_access)) + attrs[i].type = KFD_IOCTL_SVM_ATTR_ACCESS; + else if (test_bit(gpuidx, bitmap_aip)) + attrs[i].type = + KFD_IOCTL_SVM_ATTR_ACCESS_IN_PLACE; + else + attrs[i].type = KFD_IOCTL_SVM_ATTR_NO_ACCESS; + break; + case KFD_IOCTL_SVM_ATTR_SET_FLAGS: + attrs[i].value = flags; + break; + case KFD_IOCTL_SVM_ATTR_GRANULARITY: + attrs[i].value = (uint32_t)granularity; + break; + } + } + + return 0; +} + int svm_ioctl(struct kfd_process *p, enum kfd_ioctl_svm_op op, uint64_t start, uint64_t size, uint32_t nattrs, struct kfd_ioctl_svm_attribute *attrs) @@ -594,6 +762,9 @@ svm_ioctl(struct kfd_process *p, enum kfd_ioctl_svm_op op, uint64_t start, case KFD_IOCTL_SVM_OP_SET_ATTR: r = svm_range_set_attr(p, start, size, nattrs, attrs); break; + case KFD_IOCTL_SVM_OP_GET_ATTR: + r = svm_range_get_attr(p, start, size, nattrs, attrs); + break; default: r = EINVAL; break;
From: Philip Yang Philip.Yang@amd.com
Move the HMM get pages function from amdgpu_ttm and to amdgpu_mn. This common function will be used by new svm APIs.
Signed-off-by: Philip Yang Philip.Yang@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com --- drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c | 83 +++++++++++++++++++++++++ drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h | 7 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 76 +++------------------- 3 files changed, 100 insertions(+), 66 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c index 828b5167ff12..997da4237a10 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c @@ -155,3 +155,86 @@ void amdgpu_mn_unregister(struct amdgpu_bo *bo) mmu_interval_notifier_remove(&bo->notifier); bo->notifier.mm = NULL; } + +int amdgpu_hmm_range_get_pages(struct mmu_interval_notifier *notifier, + struct mm_struct *mm, struct page **pages, + uint64_t start, uint64_t npages, + struct hmm_range **phmm_range, bool readonly, + bool mmap_locked) +{ + struct hmm_range *hmm_range; + unsigned long timeout; + unsigned long i; + unsigned long *pfns; + int r = 0; + + hmm_range = kzalloc(sizeof(*hmm_range), GFP_KERNEL); + if (unlikely(!hmm_range)) + return -ENOMEM; + + pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL); + if (unlikely(!pfns)) { + r = -ENOMEM; + goto out_free_range; + } + + hmm_range->notifier = notifier; + hmm_range->default_flags = HMM_PFN_REQ_FAULT; + if (!readonly) + hmm_range->default_flags |= HMM_PFN_REQ_WRITE; + hmm_range->hmm_pfns = pfns; + hmm_range->start = start; + hmm_range->end = start + npages * PAGE_SIZE; + timeout = jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT); + +retry: + hmm_range->notifier_seq = mmu_interval_read_begin(notifier); + + if (likely(!mmap_locked)) + mmap_read_lock(mm); + + r = hmm_range_fault(hmm_range); + + if (likely(!mmap_locked)) + mmap_read_unlock(mm); + if (unlikely(r)) { + /* + * FIXME: This timeout should encompass the retry from + * mmu_interval_read_retry() as well. + */ + if (r == -EBUSY && !time_after(jiffies, timeout)) + goto retry; + goto out_free_pfns; + } + + /* + * Due to default_flags, all pages are HMM_PFN_VALID or + * hmm_range_fault() fails. FIXME: The pages cannot be touched outside + * the notifier_lock, and mmu_interval_read_retry() must be done first. + */ + for (i = 0; pages && i < npages; i++) + pages[i] = hmm_pfn_to_page(pfns[i]); + + *phmm_range = hmm_range; + + return 0; + +out_free_pfns: + kvfree(pfns); +out_free_range: + kfree(hmm_range); + + return r; +} + +int amdgpu_hmm_range_get_pages_done(struct hmm_range *hmm_range) +{ + int r; + + r = mmu_interval_read_retry(hmm_range->notifier, + hmm_range->notifier_seq); + kvfree(hmm_range->hmm_pfns); + kfree(hmm_range); + + return r; +} diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h index a292238f75eb..7f7d37a457c3 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h @@ -30,6 +30,13 @@ #include <linux/workqueue.h> #include <linux/interval_tree.h>
+int amdgpu_hmm_range_get_pages(struct mmu_interval_notifier *notifier, + struct mm_struct *mm, struct page **pages, + uint64_t start, uint64_t npages, + struct hmm_range **phmm_range, bool readonly, + bool mmap_locked); +int amdgpu_hmm_range_get_pages_done(struct hmm_range *hmm_range); + #if defined(CONFIG_HMM_MIRROR) int amdgpu_mn_register(struct amdgpu_bo *bo, unsigned long addr); void amdgpu_mn_unregister(struct amdgpu_bo *bo); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c index aaad9e304ad9..f423f42cb9b5 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c @@ -32,7 +32,6 @@
#include <linux/dma-mapping.h> #include <linux/iommu.h> -#include <linux/hmm.h> #include <linux/pagemap.h> #include <linux/sched/task.h> #include <linux/sched/mm.h> @@ -843,10 +842,8 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages) struct amdgpu_ttm_tt *gtt = (void *)ttm; unsigned long start = gtt->userptr; struct vm_area_struct *vma; - struct hmm_range *range; - unsigned long timeout; struct mm_struct *mm; - unsigned long i; + bool readonly; int r = 0;
mm = bo->notifier.mm; @@ -862,76 +859,26 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages) if (!mmget_not_zero(mm)) /* Happens during process shutdown */ return -ESRCH;
- range = kzalloc(sizeof(*range), GFP_KERNEL); - if (unlikely(!range)) { - r = -ENOMEM; - goto out; - } - range->notifier = &bo->notifier; - range->start = bo->notifier.interval_tree.start; - range->end = bo->notifier.interval_tree.last + 1; - range->default_flags = HMM_PFN_REQ_FAULT; - if (!amdgpu_ttm_tt_is_readonly(ttm)) - range->default_flags |= HMM_PFN_REQ_WRITE; - - range->hmm_pfns = kvmalloc_array(ttm->num_pages, - sizeof(*range->hmm_pfns), GFP_KERNEL); - if (unlikely(!range->hmm_pfns)) { - r = -ENOMEM; - goto out_free_ranges; - } - mmap_read_lock(mm); vma = find_vma(mm, start); + mmap_read_unlock(mm); if (unlikely(!vma || start < vma->vm_start)) { r = -EFAULT; - goto out_unlock; + goto out_putmm; } if (unlikely((gtt->userflags & AMDGPU_GEM_USERPTR_ANONONLY) && vma->vm_file)) { r = -EPERM; - goto out_unlock; + goto out_putmm; } - mmap_read_unlock(mm); - timeout = jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT); - -retry: - range->notifier_seq = mmu_interval_read_begin(&bo->notifier);
- mmap_read_lock(mm); - r = hmm_range_fault(range); - mmap_read_unlock(mm); - if (unlikely(r)) { - /* - * FIXME: This timeout should encompass the retry from - * mmu_interval_read_retry() as well. - */ - if (r == -EBUSY && !time_after(jiffies, timeout)) - goto retry; - goto out_free_pfns; - } - - /* - * Due to default_flags, all pages are HMM_PFN_VALID or - * hmm_range_fault() fails. FIXME: The pages cannot be touched outside - * the notifier_lock, and mmu_interval_read_retry() must be done first. - */ - for (i = 0; i < ttm->num_pages; i++) - pages[i] = hmm_pfn_to_page(range->hmm_pfns[i]); - - gtt->range = range; + readonly = amdgpu_ttm_tt_is_readonly(ttm); + r = amdgpu_hmm_range_get_pages(&bo->notifier, mm, pages, start, + ttm->num_pages, >t->range, readonly, + false); +out_putmm: mmput(mm);
- return 0; - -out_unlock: - mmap_read_unlock(mm); -out_free_pfns: - kvfree(range->hmm_pfns); -out_free_ranges: - kfree(range); -out: - mmput(mm); return r; }
@@ -960,10 +907,7 @@ bool amdgpu_ttm_tt_get_user_pages_done(struct ttm_tt *ttm) * FIXME: Must always hold notifier_lock for this, and must * not ignore the return code. */ - r = mmu_interval_read_retry(gtt->range->notifier, - gtt->range->notifier_seq); - kvfree(gtt->range->hmm_pfns); - kfree(gtt->range); + r = amdgpu_hmm_range_get_pages_done(gtt->range); gtt->range = NULL; }
Am 07.01.21 um 04:01 schrieb Felix Kuehling:
From: Philip Yang Philip.Yang@amd.com
Move the HMM get pages function from amdgpu_ttm and to amdgpu_mn. This common function will be used by new svm APIs.
Signed-off-by: Philip Yang Philip.Yang@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com
Acked-by: Christian König christian.koenig@amd.com
drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c | 83 +++++++++++++++++++++++++ drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h | 7 +++ drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 76 +++------------------- 3 files changed, 100 insertions(+), 66 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c index 828b5167ff12..997da4237a10 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c @@ -155,3 +155,86 @@ void amdgpu_mn_unregister(struct amdgpu_bo *bo) mmu_interval_notifier_remove(&bo->notifier); bo->notifier.mm = NULL; }
+int amdgpu_hmm_range_get_pages(struct mmu_interval_notifier *notifier,
struct mm_struct *mm, struct page **pages,
uint64_t start, uint64_t npages,
struct hmm_range **phmm_range, bool readonly,
bool mmap_locked)
+{
- struct hmm_range *hmm_range;
- unsigned long timeout;
- unsigned long i;
- unsigned long *pfns;
- int r = 0;
- hmm_range = kzalloc(sizeof(*hmm_range), GFP_KERNEL);
- if (unlikely(!hmm_range))
return -ENOMEM;
- pfns = kvmalloc_array(npages, sizeof(*pfns), GFP_KERNEL);
- if (unlikely(!pfns)) {
r = -ENOMEM;
goto out_free_range;
- }
- hmm_range->notifier = notifier;
- hmm_range->default_flags = HMM_PFN_REQ_FAULT;
- if (!readonly)
hmm_range->default_flags |= HMM_PFN_REQ_WRITE;
- hmm_range->hmm_pfns = pfns;
- hmm_range->start = start;
- hmm_range->end = start + npages * PAGE_SIZE;
- timeout = jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
+retry:
- hmm_range->notifier_seq = mmu_interval_read_begin(notifier);
- if (likely(!mmap_locked))
mmap_read_lock(mm);
- r = hmm_range_fault(hmm_range);
- if (likely(!mmap_locked))
mmap_read_unlock(mm);
- if (unlikely(r)) {
/*
* FIXME: This timeout should encompass the retry from
* mmu_interval_read_retry() as well.
*/
if (r == -EBUSY && !time_after(jiffies, timeout))
goto retry;
goto out_free_pfns;
- }
- /*
* Due to default_flags, all pages are HMM_PFN_VALID or
* hmm_range_fault() fails. FIXME: The pages cannot be touched outside
* the notifier_lock, and mmu_interval_read_retry() must be done first.
*/
- for (i = 0; pages && i < npages; i++)
pages[i] = hmm_pfn_to_page(pfns[i]);
- *phmm_range = hmm_range;
- return 0;
+out_free_pfns:
- kvfree(pfns);
+out_free_range:
- kfree(hmm_range);
- return r;
+}
+int amdgpu_hmm_range_get_pages_done(struct hmm_range *hmm_range) +{
- int r;
- r = mmu_interval_read_retry(hmm_range->notifier,
hmm_range->notifier_seq);
- kvfree(hmm_range->hmm_pfns);
- kfree(hmm_range);
- return r;
+} diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h index a292238f75eb..7f7d37a457c3 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h @@ -30,6 +30,13 @@ #include <linux/workqueue.h> #include <linux/interval_tree.h>
+int amdgpu_hmm_range_get_pages(struct mmu_interval_notifier *notifier,
struct mm_struct *mm, struct page **pages,
uint64_t start, uint64_t npages,
struct hmm_range **phmm_range, bool readonly,
bool mmap_locked);
+int amdgpu_hmm_range_get_pages_done(struct hmm_range *hmm_range);
- #if defined(CONFIG_HMM_MIRROR) int amdgpu_mn_register(struct amdgpu_bo *bo, unsigned long addr); void amdgpu_mn_unregister(struct amdgpu_bo *bo);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c index aaad9e304ad9..f423f42cb9b5 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c @@ -32,7 +32,6 @@
#include <linux/dma-mapping.h> #include <linux/iommu.h> -#include <linux/hmm.h> #include <linux/pagemap.h> #include <linux/sched/task.h> #include <linux/sched/mm.h> @@ -843,10 +842,8 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages) struct amdgpu_ttm_tt *gtt = (void *)ttm; unsigned long start = gtt->userptr; struct vm_area_struct *vma;
- struct hmm_range *range;
- unsigned long timeout; struct mm_struct *mm;
- unsigned long i;
bool readonly; int r = 0;
mm = bo->notifier.mm;
@@ -862,76 +859,26 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages) if (!mmget_not_zero(mm)) /* Happens during process shutdown */ return -ESRCH;
- range = kzalloc(sizeof(*range), GFP_KERNEL);
- if (unlikely(!range)) {
r = -ENOMEM;
goto out;
- }
- range->notifier = &bo->notifier;
- range->start = bo->notifier.interval_tree.start;
- range->end = bo->notifier.interval_tree.last + 1;
- range->default_flags = HMM_PFN_REQ_FAULT;
- if (!amdgpu_ttm_tt_is_readonly(ttm))
range->default_flags |= HMM_PFN_REQ_WRITE;
- range->hmm_pfns = kvmalloc_array(ttm->num_pages,
sizeof(*range->hmm_pfns), GFP_KERNEL);
- if (unlikely(!range->hmm_pfns)) {
r = -ENOMEM;
goto out_free_ranges;
- }
- mmap_read_lock(mm); vma = find_vma(mm, start);
- mmap_read_unlock(mm); if (unlikely(!vma || start < vma->vm_start)) { r = -EFAULT;
goto out_unlock;
} if (unlikely((gtt->userflags & AMDGPU_GEM_USERPTR_ANONONLY) && vma->vm_file)) { r = -EPERM;goto out_putmm;
goto out_unlock;
}goto out_putmm;
- mmap_read_unlock(mm);
- timeout = jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
-retry:
range->notifier_seq = mmu_interval_read_begin(&bo->notifier);
mmap_read_lock(mm);
r = hmm_range_fault(range);
mmap_read_unlock(mm);
if (unlikely(r)) {
/*
* FIXME: This timeout should encompass the retry from
* mmu_interval_read_retry() as well.
*/
if (r == -EBUSY && !time_after(jiffies, timeout))
goto retry;
goto out_free_pfns;
}
/*
* Due to default_flags, all pages are HMM_PFN_VALID or
* hmm_range_fault() fails. FIXME: The pages cannot be touched outside
* the notifier_lock, and mmu_interval_read_retry() must be done first.
*/
for (i = 0; i < ttm->num_pages; i++)
pages[i] = hmm_pfn_to_page(range->hmm_pfns[i]);
gtt->range = range;
- readonly = amdgpu_ttm_tt_is_readonly(ttm);
- r = amdgpu_hmm_range_get_pages(&bo->notifier, mm, pages, start,
ttm->num_pages, >t->range, readonly,
false);
+out_putmm: mmput(mm);
- return 0;
-out_unlock:
- mmap_read_unlock(mm);
-out_free_pfns:
- kvfree(range->hmm_pfns);
-out_free_ranges:
- kfree(range);
-out:
- mmput(mm); return r; }
@@ -960,10 +907,7 @@ bool amdgpu_ttm_tt_get_user_pages_done(struct ttm_tt *ttm) * FIXME: Must always hold notifier_lock for this, and must * not ignore the return code. */
r = mmu_interval_read_retry(gtt->range->notifier,
gtt->range->notifier_seq);
kvfree(gtt->range->hmm_pfns);
kfree(gtt->range);
gtt->range = NULL; }r = amdgpu_hmm_range_get_pages_done(gtt->range);
From: Philip Yang Philip.Yang@amd.com
Use HMM to get system memory pages address, which will be used to map to GPUs or migrate to vram.
Signed-off-by: Philip Yang Philip.Yang@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com --- drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 1 + drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 88 +++++++++++++++++++++++++++ drivers/gpu/drm/amd/amdkfd/kfd_svm.h | 2 + 3 files changed, 91 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h index cbb2bae1982d..97cf267b6f51 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h @@ -735,6 +735,7 @@ struct svm_range_list { struct work_struct srcu_free_work; struct list_head free_list; struct mutex free_list_lock; + struct mmu_interval_notifier notifier; };
/* Process data */ diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index 017e77e9ae1e..02918faa70d5 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c @@ -135,6 +135,65 @@ svm_get_supported_dev_by_id(struct kfd_process *p, uint32_t gpu_id, return dev; }
+/** + * svm_range_validate_ram - get system memory pages of svm range + * + * @mm: the mm_struct of process + * @prange: the range struct + * + * After mapping system memory to GPU, system memory maybe invalidated anytime + * during application running, we use HMM callback to sync GPU with CPU page + * table update, so we don't need use lock to prevent CPU invalidation and check + * hmm_range_get_pages_done return value. + * + * Return: + * 0 - OK, otherwise error code + */ +static int +svm_range_validate_ram(struct mm_struct *mm, struct svm_range *prange) +{ + uint64_t i; + int r; + + if (!prange->pages_addr) { + prange->pages_addr = kvmalloc_array(prange->npages, + sizeof(*prange->pages_addr), + GFP_KERNEL | __GFP_ZERO); + if (!prange->pages_addr) + return -ENOMEM; + } + + r = amdgpu_hmm_range_get_pages(&prange->svms->notifier, mm, NULL, + prange->it_node.start << PAGE_SHIFT, + prange->npages, &prange->hmm_range, + false, true); + if (r) { + pr_debug("failed %d to get svm range pages\n", r); + return r; + } + + for (i = 0; i < prange->npages; i++) + prange->pages_addr[i] = + PFN_PHYS(prange->hmm_range->hmm_pfns[i]); + + amdgpu_hmm_range_get_pages_done(prange->hmm_range); + prange->hmm_range = NULL; + + return 0; +} + +static int +svm_range_validate(struct mm_struct *mm, struct svm_range *prange) +{ + int r = 0; + + pr_debug("actual loc 0x%x\n", prange->actual_loc); + + r = svm_range_validate_ram(mm, prange); + + return r; +} + static int svm_range_apply_attrs(struct kfd_process *p, struct svm_range *prange, uint32_t nattr, struct kfd_ioctl_svm_attribute *attrs) @@ -349,10 +408,28 @@ static void svm_range_srcu_free_work(struct work_struct *work_struct) mutex_unlock(&svms->free_list_lock); }
+/** + * svm_range_cpu_invalidate_pagetables - interval notifier callback + * + */ +static bool +svm_range_cpu_invalidate_pagetables(struct mmu_interval_notifier *mni, + const struct mmu_notifier_range *range, + unsigned long cur_seq) +{ + return true; +} + +static const struct mmu_interval_notifier_ops svm_range_mn_ops = { + .invalidate = svm_range_cpu_invalidate_pagetables, +}; + void svm_range_list_fini(struct kfd_process *p) { pr_debug("pasid 0x%x svms 0x%p\n", p->pasid, &p->svms);
+ mmu_interval_notifier_remove(&p->svms.notifier); + /* Ensure srcu free work is finished before process is destroyed */ flush_work(&p->svms.srcu_free_work); cleanup_srcu_struct(&p->svms.srcu); @@ -375,6 +452,8 @@ int svm_range_list_init(struct kfd_process *p) INIT_WORK(&svms->srcu_free_work, svm_range_srcu_free_work); INIT_LIST_HEAD(&svms->free_list); mutex_init(&svms->free_list_lock); + mmu_interval_notifier_insert(&svms->notifier, current->mm, 0, ~1ULL, + &svm_range_mn_ops);
return 0; } @@ -531,6 +610,15 @@ svm_range_set_attr(struct kfd_process *p, uint64_t start, uint64_t size, r = svm_range_apply_attrs(p, prange, nattr, attrs); if (r) { pr_debug("failed %d to apply attrs\n", r); + goto out_unlock; + } + + r = svm_range_validate(mm, prange); + if (r) + pr_debug("failed %d to validate svm range\n", r); + +out_unlock: + if (r) { mmap_read_unlock(mm); srcu_read_unlock(&prange->svms->srcu, srcu_idx); goto out_remove; diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h index c7c54fb73dfb..4d394f72eefc 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h @@ -41,6 +41,7 @@ * @list: link list node, used to scan all ranges of svms * @update_list:link list node used to add to update_list * @remove_list:link list node used to add to remove list + * @hmm_range: hmm range structure used by hmm_range_fault to get system pages * @npages: number of pages * @pages_addr: list of system memory physical page address * @flags: flags defined as KFD_IOCTL_SVM_FLAG_* @@ -61,6 +62,7 @@ struct svm_range { struct list_head list; struct list_head update_list; struct list_head remove_list; + struct hmm_range *hmm_range; uint64_t npages; dma_addr_t *pages_addr; uint32_t flags;
From: Philip Yang Philip.Yang@amd.com
No overlap range interval [start, last] exist in svms object interval tree. If process registers new range which has overlap with old range, the old range split into 2 ranges depending on the overlap happens at head or tail part of old range.
Signed-off-by: Philip Yang Philip.Yang@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com --- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 297 ++++++++++++++++++++++++++- 1 file changed, 294 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index 02918faa70d5..ad007261f54c 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c @@ -293,6 +293,278 @@ static void svm_range_debug_dump(struct svm_range_list *svms) } }
+static bool +svm_range_is_same_attrs(struct svm_range *old, struct svm_range *new) +{ + return (old->prefetch_loc == new->prefetch_loc && + old->flags == new->flags && + old->granularity == new->granularity); +} + +static int +svm_range_split_pages(struct svm_range *new, struct svm_range *old, + uint64_t start, uint64_t last) +{ + unsigned long old_start; + dma_addr_t *pages_addr; + uint64_t d; + + old_start = old->it_node.start; + new->pages_addr = kvmalloc_array(new->npages, + sizeof(*new->pages_addr), + GFP_KERNEL | __GFP_ZERO); + if (!new->pages_addr) + return -ENOMEM; + + d = new->it_node.start - old_start; + memcpy(new->pages_addr, old->pages_addr + d, + new->npages * sizeof(*new->pages_addr)); + + old->npages = last - start + 1; + old->it_node.start = start; + old->it_node.last = last; + + pages_addr = kvmalloc_array(old->npages, sizeof(*pages_addr), + GFP_KERNEL); + if (!pages_addr) { + kvfree(new->pages_addr); + return -ENOMEM; + } + + d = start - old_start; + memcpy(pages_addr, old->pages_addr + d, + old->npages * sizeof(*pages_addr)); + + kvfree(old->pages_addr); + old->pages_addr = pages_addr; + + return 0; +} + +/** + * svm_range_split_adjust - split range and adjust + * + * @new: new range + * @old: the old range + * @start: the old range adjust to start address in pages + * @last: the old range adjust to last address in pages + * + * Copy system memory pages, pages_addr or vram mm_nodes in old range to new + * range from new_start up to size new->npages, the remaining old range is from + * start to last + * + * Return: + * 0 - OK, -ENOMEM - out of memory + */ +static int +svm_range_split_adjust(struct svm_range *new, struct svm_range *old, + uint64_t start, uint64_t last) +{ + int r = -EINVAL; + + pr_debug("svms 0x%p new 0x%lx old [0x%lx 0x%lx] => [0x%llx 0x%llx]\n", + new->svms, new->it_node.start, old->it_node.start, + old->it_node.last, start, last); + + if (new->it_node.start < old->it_node.start || + new->it_node.last > old->it_node.last) { + WARN_ONCE(1, "invalid new range start or last\n"); + return -EINVAL; + } + + if (old->pages_addr) + r = svm_range_split_pages(new, old, start, last); + else + WARN_ONCE(1, "split adjust invalid pages_addr and nodes\n"); + if (r) + return r; + + new->flags = old->flags; + new->preferred_loc = old->preferred_loc; + new->prefetch_loc = old->prefetch_loc; + new->actual_loc = old->actual_loc; + new->granularity = old->granularity; + bitmap_copy(new->bitmap_access, old->bitmap_access, MAX_GPU_INSTANCE); + bitmap_copy(new->bitmap_aip, old->bitmap_aip, MAX_GPU_INSTANCE); + + return 0; +} + +/** + * svm_range_split - split a range in 2 ranges + * + * @prange: the svm range to split + * @start: the remaining range start address in pages + * @last: the remaining range last address in pages + * @new: the result new range generated + * + * Two cases only: + * case 1: if start == prange->it_node.start + * prange ==> prange[start, last] + * new range [last + 1, prange->it_node.last] + * + * case 2: if last == prange->it_node.last + * prange ==> prange[start, last] + * new range [prange->it_node.start, start - 1] + * + * Context: Caller hold svms->rw_sem as write mode + * + * Return: + * 0 - OK, -ENOMEM - out of memory, -EINVAL - invalid start, last + */ +static int +svm_range_split(struct svm_range *prange, uint64_t start, uint64_t last, + struct svm_range **new) +{ + uint64_t old_start = prange->it_node.start; + uint64_t old_last = prange->it_node.last; + struct svm_range_list *svms; + int r = 0; + + pr_debug("svms 0x%p [0x%llx 0x%llx] to [0x%llx 0x%llx]\n", prange->svms, + old_start, old_last, start, last); + + if (old_start != start && old_last != last) + return -EINVAL; + if (start < old_start || last > old_last) + return -EINVAL; + + svms = prange->svms; + if (old_start == start) { + *new = svm_range_new(svms, last + 1, old_last); + if (!*new) + return -ENOMEM; + r = svm_range_split_adjust(*new, prange, start, last); + } else { + *new = svm_range_new(svms, old_start, start - 1); + if (!*new) + return -ENOMEM; + r = svm_range_split_adjust(*new, prange, start, last); + } + + return r; +} + +static int +svm_range_split_two(struct svm_range *prange, struct svm_range *new, + uint64_t start, uint64_t last, + struct list_head *insert_list, + struct list_head *update_list) +{ + struct svm_range *tail, *tail2; + int r; + + r = svm_range_split(prange, prange->it_node.start, start - 1, &tail); + if (r) + return r; + r = svm_range_split(tail, start, last, &tail2); + if (r) + return r; + list_add(&tail2->list, insert_list); + list_add(&tail->list, insert_list); + + if (!svm_range_is_same_attrs(prange, new)) + list_add(&tail->update_list, update_list); + + return 0; +} + +static int +svm_range_split_tail(struct svm_range *prange, struct svm_range *new, + uint64_t start, struct list_head *insert_list, + struct list_head *update_list) +{ + struct svm_range *tail; + int r; + + r = svm_range_split(prange, prange->it_node.start, start - 1, &tail); + if (r) + return r; + list_add(&tail->list, insert_list); + if (!svm_range_is_same_attrs(prange, new)) + list_add(&tail->update_list, update_list); + + return 0; +} + +static int +svm_range_split_head(struct svm_range *prange, struct svm_range *new, + uint64_t last, struct list_head *insert_list, + struct list_head *update_list) +{ + struct svm_range *head; + int r; + + r = svm_range_split(prange, last + 1, prange->it_node.last, &head); + if (r) + return r; + list_add(&head->list, insert_list); + if (!svm_range_is_same_attrs(prange, new)) + list_add(&head->update_list, update_list); + + return 0; +} + +static int +svm_range_split_add_front(struct svm_range *prange, struct svm_range *new, + uint64_t start, uint64_t last, + struct list_head *insert_list, + struct list_head *update_list) +{ + struct svm_range *front, *tail; + int r = 0; + + front = svm_range_new(prange->svms, start, prange->it_node.start - 1); + if (!front) + return -ENOMEM; + + list_add(&front->list, insert_list); + list_add(&front->update_list, update_list); + + if (prange->it_node.last > last) { + pr_debug("split old in 2\n"); + r = svm_range_split(prange, prange->it_node.start, last, &tail); + if (r) + return r; + list_add(&tail->list, insert_list); + } + if (!svm_range_is_same_attrs(prange, new)) + list_add(&prange->update_list, update_list); + + return 0; +} + +struct svm_range *svm_range_clone(struct svm_range *old) +{ + struct svm_range *new; + + new = svm_range_new(old->svms, old->it_node.start, old->it_node.last); + if (!new) + return NULL; + + if (old->pages_addr) { + new->pages_addr = kvmalloc_array(new->npages, + sizeof(*new->pages_addr), + GFP_KERNEL); + if (!new->pages_addr) { + kfree(new); + return NULL; + } + memcpy(new->pages_addr, old->pages_addr, + old->npages * sizeof(*old->pages_addr)); + } + + new->flags = old->flags; + new->preferred_loc = old->preferred_loc; + new->prefetch_loc = old->prefetch_loc; + new->actual_loc = old->actual_loc; + new->granularity = old->granularity; + bitmap_copy(new->bitmap_access, old->bitmap_access, MAX_GPU_INSTANCE); + bitmap_copy(new->bitmap_aip, old->bitmap_aip, MAX_GPU_INSTANCE); + + return new; +} + /** * svm_range_handle_overlap - split overlap ranges * @svms: svm range list header @@ -334,15 +606,27 @@ svm_range_handle_overlap(struct svm_range_list *svms, struct svm_range *new, node = interval_tree_iter_first(&svms->objects, start, last); while (node) { struct interval_tree_node *next; + struct svm_range *old;
pr_debug("found overlap node [0x%lx 0x%lx]\n", node->start, node->last);
- prange = container_of(node, struct svm_range, it_node); + old = container_of(node, struct svm_range, it_node); next = interval_tree_iter_next(node, start, last);
+ prange = svm_range_clone(old); + if (!prange) { + r = -ENOMEM; + goto out; + } + + list_add(&old->remove_list, remove_list); + list_add(&prange->list, insert_list); + if (node->start < start && node->last > last) { pr_debug("split in 2 ranges\n"); + r = svm_range_split_two(prange, new, start, last, + insert_list, update_list); start = last + 1;
} else if (node->start < start) { @@ -352,11 +636,15 @@ svm_range_handle_overlap(struct svm_range_list *svms, struct svm_range *new, */ uint64_t old_last = node->last;
+ pr_debug("change old range last\n"); + r = svm_range_split_tail(prange, new, start, + insert_list, update_list); start = old_last + 1;
} else if (node->start == start && node->last > last) { pr_debug("change old range start\n"); - + r = svm_range_split_head(prange, new, last, + insert_list, update_list); start = last + 1;
} else if (node->start == start) { @@ -364,12 +652,15 @@ svm_range_handle_overlap(struct svm_range_list *svms, struct svm_range *new, pr_debug("found exactly same range\n"); else pr_debug("next loop to add remaining range\n"); + if (!svm_range_is_same_attrs(prange, new)) + list_add(&prange->update_list, update_list);
start = node->last + 1;
} else { /* node->start > start */ pr_debug("add new range at front\n"); - + r = svm_range_split_add_front(prange, new, start, last, + insert_list, update_list); start = node->last + 1; }
From: Philip Yang Philip.Yang@amd.com
When application explicitly call unmap or unmap from mmput when application exit, driver will receive MMU_NOTIFY_UNMAP event to remove svm range from process svms object tree and list first, unmap from GPUs (in the following patch).
Split the svm ranges to handle unmap partial svm range.
Signed-off-by: Philip Yang Philip.Yang@amd.com Signed-off-by: Alex Sierra alex.sierra@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com --- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 86 ++++++++++++++++++++++++++++ 1 file changed, 86 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index ad007261f54c..55500ec4972f 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c @@ -699,15 +699,101 @@ static void svm_range_srcu_free_work(struct work_struct *work_struct) mutex_unlock(&svms->free_list_lock); }
+static void +svm_range_unmap_from_cpu(struct mm_struct *mm, unsigned long start, + unsigned long last) +{ + struct list_head remove_list; + struct list_head update_list; + struct list_head insert_list; + struct svm_range_list *svms; + struct svm_range new = {0}; + struct svm_range *prange; + struct svm_range *tmp; + struct kfd_process *p; + int r; + + p = kfd_lookup_process_by_mm(mm); + if (!p) + return; + svms = &p->svms; + + pr_debug("notifier svms 0x%p [0x%lx 0x%lx]\n", svms, start, last); + + svms_lock(svms); + + r = svm_range_handle_overlap(svms, &new, start, last, &update_list, + &insert_list, &remove_list, NULL); + if (r) { + svms_unlock(svms); + kfd_unref_process(p); + return; + } + + mutex_lock(&svms->free_list_lock); + list_for_each_entry_safe(prange, tmp, &remove_list, remove_list) { + pr_debug("remove svms 0x%p [0x%lx 0x%lx]\n", prange->svms, + prange->it_node.start, prange->it_node.last); + svm_range_unlink(prange); + + pr_debug("schedule to free svms 0x%p [0x%lx 0x%lx]\n", + prange->svms, prange->it_node.start, + prange->it_node.last); + list_add_tail(&prange->remove_list, &svms->free_list); + } + if (!list_empty(&svms->free_list)) + schedule_work(&svms->srcu_free_work); + mutex_unlock(&svms->free_list_lock); + + /* prange in update_list is unmapping from cpu, remove it from insert + * list + */ + list_for_each_entry_safe(prange, tmp, &update_list, update_list) { + list_del(&prange->list); + mutex_lock(&svms->free_list_lock); + list_add_tail(&prange->remove_list, &svms->free_list); + mutex_unlock(&svms->free_list_lock); + } + mutex_lock(&svms->free_list_lock); + if (!list_empty(&svms->free_list)) + schedule_work(&svms->srcu_free_work); + mutex_unlock(&svms->free_list_lock); + + list_for_each_entry_safe(prange, tmp, &insert_list, list) + svm_range_add_to_svms(prange); + + svms_unlock(svms); + kfd_unref_process(p); +} + /** * svm_range_cpu_invalidate_pagetables - interval notifier callback * + * MMU range unmap notifier to remove svm ranges */ static bool svm_range_cpu_invalidate_pagetables(struct mmu_interval_notifier *mni, const struct mmu_notifier_range *range, unsigned long cur_seq) { + unsigned long start = range->start >> PAGE_SHIFT; + unsigned long last = (range->end - 1) >> PAGE_SHIFT; + struct svm_range_list *svms; + + svms = container_of(mni, struct svm_range_list, notifier); + + if (range->event == MMU_NOTIFY_RELEASE) { + pr_debug("cpu release range [0x%lx 0x%lx]\n", range->start, + range->end - 1); + return true; + } + if (range->event == MMU_NOTIFY_UNMAP) { + pr_debug("mm 0x%p unmap range [0x%lx 0x%lx]\n", range->mm, + start, last); + svm_range_unmap_from_cpu(mni->mm, start, last); + return true; + } + return true; }
From: Philip Yang Philip.Yang@amd.com
It will be used by kfd to map svm range to GPU, because svm range does not have amdgpu_bo and bo_va, cannot use amdgpu_bo_update interface, use amdgpu vm update interface directly.
Signed-off-by: Philip Yang Philip.Yang@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 17 ++++++++--------- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 10 ++++++++++ 2 files changed, 18 insertions(+), 9 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index fdbe7d4e8b8b..9c557e8bf0e5 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -1589,15 +1589,14 @@ static int amdgpu_vm_update_ptes(struct amdgpu_vm_update_params *params, * Returns: * 0 for success, -EINVAL for failure. */ -static int amdgpu_vm_bo_update_mapping(struct amdgpu_device *adev, - struct amdgpu_device *bo_adev, - struct amdgpu_vm *vm, bool immediate, - bool unlocked, struct dma_resv *resv, - uint64_t start, uint64_t last, - uint64_t flags, uint64_t offset, - struct drm_mm_node *nodes, - dma_addr_t *pages_addr, - struct dma_fence **fence) +int amdgpu_vm_bo_update_mapping(struct amdgpu_device *adev, + struct amdgpu_device *bo_adev, + struct amdgpu_vm *vm, bool immediate, + bool unlocked, struct dma_resv *resv, + uint64_t start, uint64_t last, uint64_t flags, + uint64_t offset, struct drm_mm_node *nodes, + dma_addr_t *pages_addr, + struct dma_fence **fence) { struct amdgpu_vm_update_params params; enum amdgpu_sync_mode sync_mode; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h index 2bf4ef5fb3e1..73ca630520fd 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h @@ -366,6 +366,8 @@ struct amdgpu_vm_manager { spinlock_t pasid_lock; };
+struct amdgpu_bo_va_mapping; + #define amdgpu_vm_copy_pte(adev, ib, pe, src, count) ((adev)->vm_manager.vm_pte_funcs->copy_pte((ib), (pe), (src), (count))) #define amdgpu_vm_write_pte(adev, ib, pe, value, count, incr) ((adev)->vm_manager.vm_pte_funcs->write_pte((ib), (pe), (value), (count), (incr))) #define amdgpu_vm_set_pte_pde(adev, ib, pe, addr, count, incr, flags) ((adev)->vm_manager.vm_pte_funcs->set_pte_pde((ib), (pe), (addr), (count), (incr), (flags))) @@ -397,6 +399,14 @@ int amdgpu_vm_clear_freed(struct amdgpu_device *adev, struct dma_fence **fence); int amdgpu_vm_handle_moved(struct amdgpu_device *adev, struct amdgpu_vm *vm); +int amdgpu_vm_bo_update_mapping(struct amdgpu_device *adev, + struct amdgpu_device *bo_adev, + struct amdgpu_vm *vm, bool immediate, + bool unlocked, struct dma_resv *resv, + uint64_t start, uint64_t last, uint64_t flags, + uint64_t offset, struct drm_mm_node *nodes, + dma_addr_t *pages_addr, + struct dma_fence **fence); int amdgpu_vm_bo_update(struct amdgpu_device *adev, struct amdgpu_bo_va *bo_va, bool clear);
Am 07.01.21 um 04:01 schrieb Felix Kuehling:
From: Philip Yang Philip.Yang@amd.com
It will be used by kfd to map svm range to GPU, because svm range does not have amdgpu_bo and bo_va, cannot use amdgpu_bo_update interface, use amdgpu vm update interface directly.
Signed-off-by: Philip Yang Philip.Yang@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com
Reviewed-by: Christian König christian.koenig@amd.com
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 17 ++++++++--------- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 10 ++++++++++ 2 files changed, 18 insertions(+), 9 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index fdbe7d4e8b8b..9c557e8bf0e5 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -1589,15 +1589,14 @@ static int amdgpu_vm_update_ptes(struct amdgpu_vm_update_params *params,
- Returns:
- 0 for success, -EINVAL for failure.
*/ -static int amdgpu_vm_bo_update_mapping(struct amdgpu_device *adev,
struct amdgpu_device *bo_adev,
struct amdgpu_vm *vm, bool immediate,
bool unlocked, struct dma_resv *resv,
uint64_t start, uint64_t last,
uint64_t flags, uint64_t offset,
struct drm_mm_node *nodes,
dma_addr_t *pages_addr,
struct dma_fence **fence)
+int amdgpu_vm_bo_update_mapping(struct amdgpu_device *adev,
struct amdgpu_device *bo_adev,
struct amdgpu_vm *vm, bool immediate,
bool unlocked, struct dma_resv *resv,
uint64_t start, uint64_t last, uint64_t flags,
uint64_t offset, struct drm_mm_node *nodes,
dma_addr_t *pages_addr,
{ struct amdgpu_vm_update_params params; enum amdgpu_sync_mode sync_mode;struct dma_fence **fence)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h index 2bf4ef5fb3e1..73ca630520fd 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h @@ -366,6 +366,8 @@ struct amdgpu_vm_manager { spinlock_t pasid_lock; };
+struct amdgpu_bo_va_mapping;
- #define amdgpu_vm_copy_pte(adev, ib, pe, src, count) ((adev)->vm_manager.vm_pte_funcs->copy_pte((ib), (pe), (src), (count))) #define amdgpu_vm_write_pte(adev, ib, pe, value, count, incr) ((adev)->vm_manager.vm_pte_funcs->write_pte((ib), (pe), (value), (count), (incr))) #define amdgpu_vm_set_pte_pde(adev, ib, pe, addr, count, incr, flags) ((adev)->vm_manager.vm_pte_funcs->set_pte_pde((ib), (pe), (addr), (count), (incr), (flags)))
@@ -397,6 +399,14 @@ int amdgpu_vm_clear_freed(struct amdgpu_device *adev, struct dma_fence **fence); int amdgpu_vm_handle_moved(struct amdgpu_device *adev, struct amdgpu_vm *vm); +int amdgpu_vm_bo_update_mapping(struct amdgpu_device *adev,
struct amdgpu_device *bo_adev,
struct amdgpu_vm *vm, bool immediate,
bool unlocked, struct dma_resv *resv,
uint64_t start, uint64_t last, uint64_t flags,
uint64_t offset, struct drm_mm_node *nodes,
dma_addr_t *pages_addr,
int amdgpu_vm_bo_update(struct amdgpu_device *adev, struct amdgpu_bo_va *bo_va, bool clear);struct dma_fence **fence);
From: Philip Yang Philip.Yang@amd.com
Use amdgpu_vm_bo_update_mapping to update GPU page table to map or unmap svm range system memory pages address to GPUs.
Signed-off-by: Philip Yang Philip.Yang@amd.com Signed-off-by: Alex Sierra alex.sierra@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com --- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 232 ++++++++++++++++++++++++++- drivers/gpu/drm/amd/amdkfd/kfd_svm.h | 2 + 2 files changed, 233 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index 55500ec4972f..3c4a036609c4 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c @@ -534,6 +534,229 @@ svm_range_split_add_front(struct svm_range *prange, struct svm_range *new, return 0; }
+static uint64_t +svm_range_get_pte_flags(struct amdgpu_device *adev, struct svm_range *prange) +{ + uint32_t flags = prange->flags; + uint32_t mapping_flags; + uint64_t pte_flags; + + pte_flags = AMDGPU_PTE_VALID; + pte_flags |= AMDGPU_PTE_SYSTEM | AMDGPU_PTE_SNOOPED; + + mapping_flags = AMDGPU_VM_PAGE_READABLE | AMDGPU_VM_PAGE_WRITEABLE; + + if (flags & KFD_IOCTL_SVM_FLAG_GPU_RO) + mapping_flags &= ~AMDGPU_VM_PAGE_WRITEABLE; + if (flags & KFD_IOCTL_SVM_FLAG_GPU_EXEC) + mapping_flags |= AMDGPU_VM_PAGE_EXECUTABLE; + if (flags & KFD_IOCTL_SVM_FLAG_COHERENT) + mapping_flags |= AMDGPU_VM_MTYPE_UC; + else + mapping_flags |= AMDGPU_VM_MTYPE_NC; + + /* TODO: add CHIP_ARCTURUS new flags for vram mapping */ + + pte_flags |= amdgpu_gem_va_map_flags(adev, mapping_flags); + + /* Apply ASIC specific mapping flags */ + amdgpu_gmc_get_vm_pte(adev, &prange->mapping, &pte_flags); + + pr_debug("PTE flags 0x%llx\n", pte_flags); + + return pte_flags; +} + +static int +svm_range_unmap_from_gpu(struct amdgpu_device *adev, struct amdgpu_vm *vm, + struct svm_range *prange, struct dma_fence **fence) +{ + uint64_t init_pte_value = 0; + uint64_t start; + uint64_t last; + + start = prange->it_node.start; + last = prange->it_node.last; + + pr_debug("svms 0x%p [0x%llx 0x%llx]\n", prange->svms, start, last); + + return amdgpu_vm_bo_update_mapping(adev, adev, vm, false, true, NULL, + start, last, init_pte_value, 0, + NULL, NULL, fence); +} + +static int +svm_range_unmap_from_gpus(struct svm_range *prange) +{ + DECLARE_BITMAP(bitmap, MAX_GPU_INSTANCE); + struct kfd_process_device *pdd; + struct dma_fence *fence = NULL; + struct amdgpu_device *adev; + struct kfd_process *p; + struct kfd_dev *dev; + uint32_t gpuidx; + int r = 0; + + bitmap_or(bitmap, prange->bitmap_access, prange->bitmap_aip, + MAX_GPU_INSTANCE); + p = container_of(prange->svms, struct kfd_process, svms); + + for_each_set_bit(gpuidx, bitmap, MAX_GPU_INSTANCE) { + pr_debug("unmap from gpu idx 0x%x\n", gpuidx); + r = kfd_process_device_from_gpuidx(p, gpuidx, &dev); + if (r) { + pr_debug("failed to find device idx %d\n", gpuidx); + return -EINVAL; + } + + pdd = kfd_bind_process_to_device(dev, p); + if (IS_ERR(pdd)) + return -EINVAL; + + adev = (struct amdgpu_device *)dev->kgd; + + r = svm_range_unmap_from_gpu(adev, pdd->vm, prange, &fence); + if (r) + break; + + if (fence) { + r = dma_fence_wait(fence, false); + dma_fence_put(fence); + fence = NULL; + if (r) + break; + } + + amdgpu_amdkfd_flush_gpu_tlb_pasid((struct kgd_dev *)adev, + p->pasid); + } + + return r; +} + +static int svm_range_bo_validate(void *param, struct amdgpu_bo *bo) +{ + struct ttm_operation_ctx ctx = { false, false }; + + amdgpu_bo_placement_from_domain(bo, AMDGPU_GEM_DOMAIN_VRAM); + + return ttm_bo_validate(&bo->tbo, &bo->placement, &ctx); +} + +static int +svm_range_map_to_gpu(struct amdgpu_device *adev, struct amdgpu_vm *vm, + struct svm_range *prange, bool reserve_vm, + struct dma_fence **fence) +{ + struct amdgpu_bo *root; + dma_addr_t *pages_addr; + uint64_t pte_flags; + int r = 0; + + pr_debug("svms 0x%p [0x%lx 0x%lx]\n", prange->svms, + prange->it_node.start, prange->it_node.last); + + if (reserve_vm) { + root = amdgpu_bo_ref(vm->root.base.bo); + r = amdgpu_bo_reserve(root, true); + if (r) { + pr_debug("failed %d to reserve root bo\n", r); + amdgpu_bo_unref(&root); + goto out; + } + r = amdgpu_vm_validate_pt_bos(adev, vm, svm_range_bo_validate, + NULL); + if (r) { + pr_debug("failed %d validate pt bos\n", r); + goto unreserve_out; + } + } + + prange->mapping.start = prange->it_node.start; + prange->mapping.last = prange->it_node.last; + prange->mapping.offset = 0; + pte_flags = svm_range_get_pte_flags(adev, prange); + prange->mapping.flags = pte_flags; + pages_addr = prange->pages_addr; + + r = amdgpu_vm_bo_update_mapping(adev, adev, vm, false, false, NULL, + prange->mapping.start, + prange->mapping.last, pte_flags, + prange->mapping.offset, NULL, + pages_addr, &vm->last_update); + if (r) { + pr_debug("failed %d to map to gpu 0x%lx\n", r, + prange->it_node.start); + goto unreserve_out; + } + + + r = amdgpu_vm_update_pdes(adev, vm, false); + if (r) { + pr_debug("failed %d to update directories 0x%lx\n", r, + prange->it_node.start); + goto unreserve_out; + } + + if (fence) + *fence = dma_fence_get(vm->last_update); + +unreserve_out: + if (reserve_vm) { + amdgpu_bo_unreserve(root); + amdgpu_bo_unref(&root); + } + +out: + return r; +} + +static int svm_range_map_to_gpus(struct svm_range *prange, bool reserve_vm) +{ + DECLARE_BITMAP(bitmap, MAX_GPU_INSTANCE); + struct kfd_process_device *pdd; + struct amdgpu_device *adev; + struct kfd_process *p; + struct kfd_dev *dev; + struct dma_fence *fence = NULL; + uint32_t gpuidx; + int r = 0; + + bitmap_or(bitmap, prange->bitmap_access, prange->bitmap_aip, + MAX_GPU_INSTANCE); + p = container_of(prange->svms, struct kfd_process, svms); + + for_each_set_bit(gpuidx, bitmap, MAX_GPU_INSTANCE) { + r = kfd_process_device_from_gpuidx(p, gpuidx, &dev); + if (r) { + pr_debug("failed to find device idx %d\n", gpuidx); + return -EINVAL; + } + + pdd = kfd_bind_process_to_device(dev, p); + if (IS_ERR(pdd)) + return -EINVAL; + adev = (struct amdgpu_device *)dev->kgd; + + r = svm_range_map_to_gpu(adev, pdd->vm, prange, reserve_vm, + &fence); + if (r) + break; + + if (fence) { + r = dma_fence_wait(fence, false); + dma_fence_put(fence); + fence = NULL; + if (r) { + pr_debug("failed %d to dma fence wait\n", r); + break; + } + } + } + + return r; +} + struct svm_range *svm_range_clone(struct svm_range *old) { struct svm_range *new; @@ -750,6 +973,7 @@ svm_range_unmap_from_cpu(struct mm_struct *mm, unsigned long start, */ list_for_each_entry_safe(prange, tmp, &update_list, update_list) { list_del(&prange->list); + svm_range_unmap_from_gpus(prange); mutex_lock(&svms->free_list_lock); list_add_tail(&prange->remove_list, &svms->free_list); mutex_unlock(&svms->free_list_lock); @@ -991,8 +1215,14 @@ svm_range_set_attr(struct kfd_process *p, uint64_t start, uint64_t size, }
r = svm_range_validate(mm, prange); - if (r) + if (r) { pr_debug("failed %d to validate svm range\n", r); + goto out_unlock; + } + + r = svm_range_map_to_gpus(prange, true); + if (r) + pr_debug("failed %d to map svm range\n", r);
out_unlock: if (r) { diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h index 4d394f72eefc..fb68b5ee54f8 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h @@ -42,6 +42,7 @@ * @update_list:link list node used to add to update_list * @remove_list:link list node used to add to remove list * @hmm_range: hmm range structure used by hmm_range_fault to get system pages + * @mapping: bo_va mapping structure to create and update GPU page table * @npages: number of pages * @pages_addr: list of system memory physical page address * @flags: flags defined as KFD_IOCTL_SVM_FLAG_* @@ -63,6 +64,7 @@ struct svm_range { struct list_head update_list; struct list_head remove_list; struct hmm_range *hmm_range; + struct amdgpu_bo_va_mapping mapping; uint64_t npages; dma_addr_t *pages_addr; uint32_t flags;
From: Philip Yang Philip.Yang@amd.com
HMM interval notifier callback notify CPU page table will be updated, stop process queues if the updated address belongs to svm range registered in process svms objects tree. Scheduled restore work to update GPU page table using new pages address in the updated svm range.
svm restore work to use srcu to scan svms list to avoid deadlock between below two cases:
case1: svm restore work takes svm lock to scan svms list, then call hmm_page_fault which takes mm->mmap_sem. case2: unmap event callback and set_attr ioctl takes mm->mmap_sem, than takes svm lock to add/remove ranges.
Calling synchronize_srcu in unmap event callback will deadlock with restore work because restore work may wait for unmap event done to take mm->mmap_sem, so schedule srcu_free_work to wait for srcu read critical section done in svm restore work then free svm ranges.
Signed-off-by: Philip Yang Philip.Yang@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com --- drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 2 + drivers/gpu/drm/amd/amdkfd/kfd_process.c | 1 + drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 169 ++++++++++++++++++++++- drivers/gpu/drm/amd/amdkfd/kfd_svm.h | 2 + 4 files changed, 169 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h index 97cf267b6f51..f1e95773e19b 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h @@ -736,6 +736,8 @@ struct svm_range_list { struct list_head free_list; struct mutex free_list_lock; struct mmu_interval_notifier notifier; + atomic_t evicted_ranges; + struct delayed_work restore_work; };
/* Process data */ diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c index 791f17308b1b..0f31538b2a91 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c @@ -1048,6 +1048,7 @@ static void kfd_process_notifier_release(struct mmu_notifier *mn,
cancel_delayed_work_sync(&p->eviction_work); cancel_delayed_work_sync(&p->restore_work); + cancel_delayed_work_sync(&p->svms.restore_work);
mutex_lock(&p->mutex);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index 3c4a036609c4..e3ba6e7262a7 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c @@ -21,6 +21,7 @@ */
#include <linux/types.h> +#include <linux/sched/task.h> #include "amdgpu_sync.h" #include "amdgpu_object.h" #include "amdgpu_vm.h" @@ -28,6 +29,8 @@ #include "kfd_priv.h" #include "kfd_svm.h"
+#define AMDGPU_SVM_RANGE_RESTORE_DELAY_MS 1 + /** * svm_range_unlink - unlink svm_range from lists and interval tree * @prange: svm range structure to be removed @@ -99,6 +102,7 @@ svm_range *svm_range_new(struct svm_range_list *svms, uint64_t start, INIT_LIST_HEAD(&prange->list); INIT_LIST_HEAD(&prange->update_list); INIT_LIST_HEAD(&prange->remove_list); + atomic_set(&prange->invalid, 0); svm_range_set_default_attributes(&prange->preferred_loc, &prange->prefetch_loc, &prange->granularity, &prange->flags); @@ -191,6 +195,10 @@ svm_range_validate(struct mm_struct *mm, struct svm_range *prange)
r = svm_range_validate_ram(mm, prange);
+ pr_debug("svms 0x%p [0x%lx 0x%lx] ret %d invalid %d\n", prange->svms, + prange->it_node.start, prange->it_node.last, + r, atomic_read(&prange->invalid)); + return r; }
@@ -757,6 +765,151 @@ static int svm_range_map_to_gpus(struct svm_range *prange, bool reserve_vm) return r; }
+static void svm_range_restore_work(struct work_struct *work) +{ + struct delayed_work *dwork = to_delayed_work(work); + struct amdkfd_process_info *process_info; + struct svm_range_list *svms; + struct svm_range *prange; + struct kfd_process *p; + struct mm_struct *mm; + int evicted_ranges; + int srcu_idx; + int invalid; + int r; + + svms = container_of(dwork, struct svm_range_list, restore_work); + evicted_ranges = atomic_read(&svms->evicted_ranges); + if (!evicted_ranges) + return; + + pr_debug("restore svm ranges\n"); + + /* kfd_process_notifier_release destroys this worker thread. So during + * the lifetime of this thread, kfd_process and mm will be valid. + */ + p = container_of(svms, struct kfd_process, svms); + process_info = p->kgd_process_info; + mm = p->mm; + if (!mm) + return; + + mutex_lock(&process_info->lock); + mmap_read_lock(mm); + srcu_idx = srcu_read_lock(&svms->srcu); + + list_for_each_entry_rcu(prange, &svms->list, list) { + invalid = atomic_read(&prange->invalid); + if (!invalid) + continue; + + pr_debug("restoring svms 0x%p [0x%lx %lx] invalid %d\n", + prange->svms, prange->it_node.start, + prange->it_node.last, invalid); + + r = svm_range_validate(mm, prange); + if (r) { + pr_debug("failed %d to validate [0x%lx 0x%lx]\n", r, + prange->it_node.start, prange->it_node.last); + + goto unlock_out; + } + + r = svm_range_map_to_gpus(prange, true); + if (r) { + pr_debug("failed %d to map 0x%lx to gpu\n", r, + prange->it_node.start); + goto unlock_out; + } + + if (atomic_cmpxchg(&prange->invalid, invalid, 0) != invalid) + goto unlock_out; + } + + if (atomic_cmpxchg(&svms->evicted_ranges, evicted_ranges, 0) != + evicted_ranges) + goto unlock_out; + + evicted_ranges = 0; + + r = kgd2kfd_resume_mm(mm); + if (r) { + /* No recovery from this failure. Probably the CP is + * hanging. No point trying again. + */ + pr_debug("failed %d to resume KFD\n", r); + } + + pr_debug("restore svm ranges successfully\n"); + +unlock_out: + srcu_read_unlock(&svms->srcu, srcu_idx); + mmap_read_unlock(mm); + mutex_unlock(&process_info->lock); + + /* If validation failed, reschedule another attempt */ + if (evicted_ranges) { + pr_debug("reschedule to restore svm range\n"); + schedule_delayed_work(&svms->restore_work, + msecs_to_jiffies(AMDGPU_SVM_RANGE_RESTORE_DELAY_MS)); + } +} + +/** + * svm_range_evict - evict svm range + * + * Stop all queues of the process to ensure GPU doesn't access the memory, then + * return to let CPU evict the buffer and proceed CPU pagetable update. + * + * Don't need use lock to sync cpu pagetable invalidation with GPU execution. + * If invalidation happens while restore work is running, restore work will + * restart to ensure to get the latest CPU pages mapping to GPU, then start + * the queues. + */ +static int +svm_range_evict(struct svm_range_list *svms, struct mm_struct *mm, + unsigned long start, unsigned long last) +{ + int invalid, evicted_ranges; + int r = 0; + struct interval_tree_node *node; + struct svm_range *prange; + + svms_lock(svms); + + pr_debug("invalidate svms 0x%p [0x%lx 0x%lx]\n", svms, start, last); + + node = interval_tree_iter_first(&svms->objects, start, last); + while (node) { + struct interval_tree_node *next; + + prange = container_of(node, struct svm_range, it_node); + next = interval_tree_iter_next(node, start, last); + + invalid = atomic_inc_return(&prange->invalid); + evicted_ranges = atomic_inc_return(&svms->evicted_ranges); + if (evicted_ranges == 1) { + pr_debug("evicting svms 0x%p range [0x%lx 0x%lx]\n", + prange->svms, prange->it_node.start, + prange->it_node.last); + + /* First eviction, stop the queues */ + r = kgd2kfd_quiesce_mm(mm); + if (r) + pr_debug("failed to quiesce KFD\n"); + + pr_debug("schedule to restore svm %p ranges\n", svms); + schedule_delayed_work(&svms->restore_work, + msecs_to_jiffies(AMDGPU_SVM_RANGE_RESTORE_DELAY_MS)); + } + node = next; + } + + svms_unlock(svms); + + return r; +} + struct svm_range *svm_range_clone(struct svm_range *old) { struct svm_range *new; @@ -994,6 +1147,11 @@ svm_range_unmap_from_cpu(struct mm_struct *mm, unsigned long start, * svm_range_cpu_invalidate_pagetables - interval notifier callback * * MMU range unmap notifier to remove svm ranges + * + * If GPU vm fault retry is not enabled, evict the svm range, then restore + * work will update GPU mapping. + * If GPU vm fault retry is enabled, unmap the svm range from GPU, vm fault + * will update GPU mapping. */ static bool svm_range_cpu_invalidate_pagetables(struct mmu_interval_notifier *mni, @@ -1009,15 +1167,14 @@ svm_range_cpu_invalidate_pagetables(struct mmu_interval_notifier *mni, if (range->event == MMU_NOTIFY_RELEASE) { pr_debug("cpu release range [0x%lx 0x%lx]\n", range->start, range->end - 1); - return true; - } - if (range->event == MMU_NOTIFY_UNMAP) { + } else if (range->event == MMU_NOTIFY_UNMAP) { pr_debug("mm 0x%p unmap range [0x%lx 0x%lx]\n", range->mm, start, last); svm_range_unmap_from_cpu(mni->mm, start, last); - return true; + } else { + mmu_interval_set_seq(mni, cur_seq); + svm_range_evict(svms, mni->mm, start, last); } - return true; }
@@ -1045,6 +1202,8 @@ int svm_range_list_init(struct kfd_process *p) svms->objects = RB_ROOT_CACHED; mutex_init(&svms->lock); INIT_LIST_HEAD(&svms->list); + atomic_set(&svms->evicted_ranges, 0); + INIT_DELAYED_WORK(&svms->restore_work, svm_range_restore_work); r = init_srcu_struct(&svms->srcu); if (r) { pr_debug("failed %d to init srcu\n", r); diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h index fb68b5ee54f8..4c7daf8e0b6f 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h @@ -50,6 +50,7 @@ * @perfetch_loc: last prefetch location, 0 for CPU, or GPU id * @actual_loc: the actual location, 0 for CPU, or GPU id * @granularity:migration granularity, log2 num pages + * @invalid: not 0 means cpu page table is invalidated * @bitmap_access: index bitmap of GPUs which can access the range * @bitmap_aip: index bitmap of GPUs which can access the range in place * @@ -72,6 +73,7 @@ struct svm_range { uint32_t prefetch_loc; uint32_t actual_loc; uint8_t granularity; + atomic_t invalid; DECLARE_BITMAP(bitmap_access, MAX_GPU_INSTANCE); DECLARE_BITMAP(bitmap_aip, MAX_GPU_INSTANCE); };
From: Alex Sierra alex.sierra@amd.com
This flag is useful at cpu invalidation page table decision. Between select queue eviction or page fault.
Signed-off-by: Alex Sierra alex.sierra@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com --- drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 4 +++ drivers/gpu/drm/amd/amdkfd/kfd_process.c | 36 ++++++++++++++++++++++++ 2 files changed, 40 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h index f1e95773e19b..7a4b4b6dcf32 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h @@ -821,6 +821,8 @@ struct kfd_process {
/* shared virtual memory registered by this process */ struct svm_range_list svms; + + bool xnack_enabled; };
#define KFD_PROCESS_TABLE_SIZE 5 /* bits: 32 entries */ @@ -874,6 +876,8 @@ struct kfd_process_device *kfd_get_process_device_data(struct kfd_dev *dev, struct kfd_process_device *kfd_create_process_device_data(struct kfd_dev *dev, struct kfd_process *p);
+bool kfd_process_xnack_supported(struct kfd_process *p); + int kfd_reserved_mem_mmap(struct kfd_dev *dev, struct kfd_process *process, struct vm_area_struct *vma);
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c index 0f31538b2a91..f7a50a364d78 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c @@ -1157,6 +1157,39 @@ static int kfd_process_device_init_cwsr_dgpu(struct kfd_process_device *pdd) return 0; }
+bool kfd_process_xnack_supported(struct kfd_process *p) +{ + int i; + + /* On most GFXv9 GPUs, the retry mode in the SQ must match the + * boot time retry setting. Mixing processes with different + * XNACK/retry settings can hang the GPU. + * + * Different GPUs can have different noretry settings depending + * on HW bugs or limitations. We need to find at least one + * XNACK mode for this process that's compatible with all GPUs. + * Fortunately GPUs with retry enabled (noretry=0) can run code + * built for XNACK-off. On GFXv9 it may perform slower. + * + * Therefore applications built for XNACK-off can always be + * supported and will be our fallback if any GPU does not + * support retry. + */ + for (i = 0; i < p->n_pdds; i++) { + struct kfd_dev *dev = p->pdds[i]->dev; + + /* Only consider GFXv9 and higher GPUs. Older GPUs don't + * support the SVM APIs and don't need to be considered + * for the XNACK mode selection. + */ + if (dev->device_info->asic_family >= CHIP_VEGA10 && + dev->noretry) + return false; + } + + return true; +} + /* * On return the kfd_process is fully operational and will be freed when the * mm is released @@ -1194,6 +1227,9 @@ static struct kfd_process *create_process(const struct task_struct *thread) if (err != 0) goto err_init_apertures;
+ /* Check XNACK support after PDDs are created in kfd_init_apertures */ + process->xnack_enabled = kfd_process_xnack_supported(process); + err = svm_range_list_init(process); if (err) goto err_init_svm_range_list;
From: Alex Sierra alex.sierra@amd.com
Xnack retries are used for page fault recovery. Some AMD chip families support continuously retry while page table entries are invalid. The driver must handle the page fault interrupt and fill in a valid entry for the GPU to continue.
This ioctl allows to enable/disable XNACK retries per KFD process.
Signed-off-by: Alex Sierra alex.sierra@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com --- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 28 +++++++++++++++ include/uapi/linux/kfd_ioctl.h | 43 +++++++++++++++++++++++- 2 files changed, 70 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c index 2d3ba7e806d5..a9a6a7c8ff21 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c @@ -1747,6 +1747,31 @@ static int kfd_ioctl_smi_events(struct file *filep, return kfd_smi_event_open(dev, &args->anon_fd); }
+static int kfd_ioctl_set_xnack_mode(struct file *filep, + struct kfd_process *p, void *data) +{ + struct kfd_ioctl_set_xnack_mode_args *args = data; + int r = 0; + + mutex_lock(&p->mutex); + if (args->xnack_enabled >= 0) { + if (!list_empty(&p->pqm.queues)) { + pr_debug("Process has user queues running\n"); + mutex_unlock(&p->mutex); + return -EBUSY; + } + if (args->xnack_enabled && !kfd_process_xnack_supported(p)) + r = -EPERM; + else + p->xnack_enabled = args->xnack_enabled; + } else { + args->xnack_enabled = p->xnack_enabled; + } + mutex_unlock(&p->mutex); + + return r; +} + static int kfd_ioctl_svm(struct file *filep, struct kfd_process *p, void *data) { struct kfd_ioctl_svm_args *args = data; @@ -1870,6 +1895,9 @@ static const struct amdkfd_ioctl_desc amdkfd_ioctls[] = { kfd_ioctl_smi_events, 0),
AMDKFD_IOCTL_DEF(AMDKFD_IOC_SVM, kfd_ioctl_svm, 0), + + AMDKFD_IOCTL_DEF(AMDKFD_IOC_SET_XNACK_MODE, + kfd_ioctl_set_xnack_mode, 0), };
#define AMDKFD_CORE_IOCTL_COUNT ARRAY_SIZE(amdkfd_ioctls) diff --git a/include/uapi/linux/kfd_ioctl.h b/include/uapi/linux/kfd_ioctl.h index 5d4a4b3e0b61..b1a45cd37ab7 100644 --- a/include/uapi/linux/kfd_ioctl.h +++ b/include/uapi/linux/kfd_ioctl.h @@ -593,6 +593,44 @@ struct kfd_ioctl_svm_args { struct kfd_ioctl_svm_attribute attrs[0]; };
+/** + * kfd_ioctl_set_xnack_mode_args - Arguments for set_xnack_mode + * + * @xnack_enabled: [in/out] Whether to enable XNACK mode for this process + * + * @xnack_enabled indicates whether recoverable page faults should be + * enabled for the current process. 0 means disabled, positive means + * enabled, negative means leave unchanged. If enabled, virtual address + * translations on GFXv9 and later AMD GPUs can return XNACK and retry + * the access until a valid PTE is available. This is used to implement + * device page faults. + * + * On output, @xnack_enabled returns the (new) current mode (0 or + * positive). Therefore, a negative input value can be used to query + * the current mode without changing it. + * + * The XNACK mode fundamentally changes the way SVM managed memory works + * in the driver, with subtle effects on application performance and + * functionality. + * + * Enabling XNACK mode requires shader programs to be compiled + * differently. Furthermore, not all GPUs support changing the mode + * per-process. Therefore changing the mode is only allowed while no + * user mode queues exist in the process. This ensure that no shader + * code is running that may be compiled for the wrong mode. And GPUs + * that cannot change to the requested mode will prevent the XNACK + * mode from occurring. All GPUs used by the process must be in the + * same XNACK mode. + * + * GFXv8 or older GPUs do not support 48 bit virtual addresses or SVM. + * Therefore those GPUs are not considered for the XNACK mode switch. + * + * Return: 0 on success, -errno on failure + */ +struct kfd_ioctl_set_xnack_mode_args { + __s32 xnack_enabled; +}; + #define AMDKFD_IOCTL_BASE 'K' #define AMDKFD_IO(nr) _IO(AMDKFD_IOCTL_BASE, nr) #define AMDKFD_IOR(nr, type) _IOR(AMDKFD_IOCTL_BASE, nr, type) @@ -695,7 +733,10 @@ struct kfd_ioctl_svm_args {
#define AMDKFD_IOC_SVM AMDKFD_IOWR(0x20, struct kfd_ioctl_svm_args)
+#define AMDKFD_IOC_SET_XNACK_MODE \ + AMDKFD_IOWR(0x21, struct kfd_ioctl_set_xnack_mode_args) + #define AMDKFD_COMMAND_START 0x01 -#define AMDKFD_COMMAND_END 0x21 +#define AMDKFD_COMMAND_END 0x22
#endif
From: Philip Yang Philip.Yang@amd.com
Register vram memory as MEMORY_DEVICE_PRIVATE type resource, to allocate vram backing pages for page migration.
Signed-off-by: Philip Yang Philip.Yang@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 3 + drivers/gpu/drm/amd/amdkfd/Makefile | 3 +- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 101 +++++++++++++++++++++ drivers/gpu/drm/amd/amdkfd/kfd_migrate.h | 48 ++++++++++ drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 3 + 5 files changed, 157 insertions(+), 1 deletion(-) create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c index db96d69eb45e..562bb5b69137 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c @@ -30,6 +30,7 @@ #include <linux/dma-buf.h> #include "amdgpu_xgmi.h" #include <uapi/linux/kfd_ioctl.h> +#include "kfd_migrate.h"
/* Total memory size in system memory and all GPU VRAM. Used to * estimate worst case amount of memory to reserve for page tables @@ -170,12 +171,14 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev) }
kgd2kfd_device_init(adev->kfd.dev, adev_to_drm(adev), &gpu_resources); + svm_migrate_init(adev); } }
void amdgpu_amdkfd_device_fini(struct amdgpu_device *adev) { if (adev->kfd.dev) { + svm_migrate_fini(adev); kgd2kfd_device_exit(adev->kfd.dev); adev->kfd.dev = NULL; } diff --git a/drivers/gpu/drm/amd/amdkfd/Makefile b/drivers/gpu/drm/amd/amdkfd/Makefile index 387ce0217d35..a93301dbc464 100644 --- a/drivers/gpu/drm/amd/amdkfd/Makefile +++ b/drivers/gpu/drm/amd/amdkfd/Makefile @@ -55,7 +55,8 @@ AMDKFD_FILES := $(AMDKFD_PATH)/kfd_module.o \ $(AMDKFD_PATH)/kfd_dbgmgr.o \ $(AMDKFD_PATH)/kfd_smi_events.o \ $(AMDKFD_PATH)/kfd_crat.o \ - $(AMDKFD_PATH)/kfd_svm.o + $(AMDKFD_PATH)/kfd_svm.o \ + $(AMDKFD_PATH)/kfd_migrate.o
ifneq ($(CONFIG_AMD_IOMMU_V2),) AMDKFD_FILES += $(AMDKFD_PATH)/kfd_iommu.o diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c new file mode 100644 index 000000000000..1950b86f1562 --- /dev/null +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c @@ -0,0 +1,101 @@ +/* + * Copyright 2020 Advanced Micro Devices, Inc. + * + * Permission is hereby granted, free of charge, to any person obtaining a + * copy of this software and associated documentation files (the "Software"), + * to deal in the Software without restriction, including without limitation + * the rights to use, copy, modify, merge, publish, distribute, sublicense, + * and/or sell copies of the Software, and to permit persons to whom the + * Software is furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR + * OTHER DEALINGS IN THE SOFTWARE. + */ + +#include <linux/types.h> +#include <linux/hmm.h> +#include <linux/dma-direction.h> +#include <linux/dma-mapping.h> +#include "amdgpu_sync.h" +#include "amdgpu_object.h" +#include "amdgpu_vm.h" +#include "amdgpu_mn.h" +#include "kfd_priv.h" +#include "kfd_svm.h" +#include "kfd_migrate.h" + +static void svm_migrate_page_free(struct page *page) +{ +} + +/** + * svm_migrate_to_ram - CPU page fault handler + * @vmf: CPU vm fault vma, address + * + * Context: vm fault handler, mm->mmap_sem is taken + * + * Return: + * 0 - OK + * VM_FAULT_SIGBUS - notice application to have SIGBUS page fault + */ +static vm_fault_t svm_migrate_to_ram(struct vm_fault *vmf) +{ + return VM_FAULT_SIGBUS; +} + +static const struct dev_pagemap_ops svm_migrate_pgmap_ops = { + .page_free = svm_migrate_page_free, + .migrate_to_ram = svm_migrate_to_ram, +}; + +int svm_migrate_init(struct amdgpu_device *adev) +{ + struct kfd_dev *kfddev = adev->kfd.dev; + struct dev_pagemap *pgmap; + struct resource *res; + unsigned long size; + void *r; + + /* Page migration works on Vega10 or newer */ + if (kfddev->device_info->asic_family < CHIP_VEGA10) + return -EINVAL; + + pgmap = &kfddev->pgmap; + memset(pgmap, 0, sizeof(*pgmap)); + + /* TODO: register all vram to HMM for now. + * should remove reserved size + */ + size = ALIGN(adev->gmc.real_vram_size, 2ULL << 20); + res = devm_request_free_mem_region(adev->dev, &iomem_resource, size); + if (IS_ERR(res)) + return -ENOMEM; + + pgmap->type = MEMORY_DEVICE_PRIVATE; + pgmap->res = *res; + pgmap->ops = &svm_migrate_pgmap_ops; + pgmap->owner = adev; + pgmap->flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE; + r = devm_memremap_pages(adev->dev, pgmap); + if (IS_ERR(r)) { + pr_err("failed to register HMM device memory\n"); + return PTR_ERR(r); + } + + pr_info("HMM registered %ldMB device memory\n", size >> 20); + + return 0; +} + +void svm_migrate_fini(struct amdgpu_device *adev) +{ + memunmap_pages(&adev->kfd.dev->pgmap); +} diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h new file mode 100644 index 000000000000..98ab685d3e17 --- /dev/null +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h @@ -0,0 +1,48 @@ +/* + * Copyright 2020 Advanced Micro Devices, Inc. + * + * Permission is hereby granted, free of charge, to any person obtaining a + * copy of this software and associated documentation files (the "Software"), + * to deal in the Software without restriction, including without limitation + * the rights to use, copy, modify, merge, publish, distribute, sublicense, + * and/or sell copies of the Software, and to permit persons to whom the + * Software is furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR + * OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, + * ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR + * OTHER DEALINGS IN THE SOFTWARE. + * + */ + +#ifndef KFD_MIGRATE_H_ +#define KFD_MIGRATE_H_ + +#include <linux/rwsem.h> +#include <linux/list.h> +#include <linux/mutex.h> +#include <linux/sched/mm.h> +#include <linux/hmm.h> +#include "kfd_priv.h" +#include "kfd_svm.h" + +#if defined(CONFIG_DEVICE_PRIVATE) +int svm_migrate_init(struct amdgpu_device *adev); +void svm_migrate_fini(struct amdgpu_device *adev); + +#else +static inline int svm_migrate_init(struct amdgpu_device *adev) +{ + DRM_WARN_ONCE("DEVICE_PRIVATE kernel config option is not enabled, " + "add CONFIG_DEVICE_PRIVATE=y in config file to fix\n"); + return -ENODEV; +} +static inline void svm_migrate_fini(struct amdgpu_device *adev) {} +#endif +#endif /* KFD_MIGRATE_H_ */ diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h index 7a4b4b6dcf32..d5367e770b39 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h @@ -317,6 +317,9 @@ struct kfd_dev { unsigned int max_doorbell_slices;
int noretry; + + /* HMM page migration MEMORY_DEVICE_PRIVATE mapping */ + struct dev_pagemap pgmap; };
enum kfd_mempool {
On Wed, Jan 06, 2021 at 10:01:09PM -0500, Felix Kuehling wrote:
From: Philip Yang Philip.Yang@amd.com
Register vram memory as MEMORY_DEVICE_PRIVATE type resource, to allocate vram backing pages for page migration.
Signed-off-by: Philip Yang Philip.Yang@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com
So maybe I'm getting this all wrong, but I think that the current ttm fault code relies on devmap pte entries (especially for hugepte entries) to stop get_user_pages. But this only works if the pte happens to not point at a range with devmap pages.
This patch here changes that, and so probably breaks this devmap pte hack ttm is using?
If I'm not wrong here then I think we need to first fix up the ttm code to not use the devmap hack anymore, before a ttm based driver can register a dev_pagemap. Also adding Thomas since that just came up in another discussion. -Daniel
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 3 + drivers/gpu/drm/amd/amdkfd/Makefile | 3 +- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 101 +++++++++++++++++++++ drivers/gpu/drm/amd/amdkfd/kfd_migrate.h | 48 ++++++++++ drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 3 + 5 files changed, 157 insertions(+), 1 deletion(-) create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c index db96d69eb45e..562bb5b69137 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c @@ -30,6 +30,7 @@ #include <linux/dma-buf.h> #include "amdgpu_xgmi.h" #include <uapi/linux/kfd_ioctl.h> +#include "kfd_migrate.h"
/* Total memory size in system memory and all GPU VRAM. Used to
- estimate worst case amount of memory to reserve for page tables
@@ -170,12 +171,14 @@ void amdgpu_amdkfd_device_init(struct amdgpu_device *adev) }
kgd2kfd_device_init(adev->kfd.dev, adev_to_drm(adev), &gpu_resources);
svm_migrate_init(adev); }
}
void amdgpu_amdkfd_device_fini(struct amdgpu_device *adev) { if (adev->kfd.dev) {
svm_migrate_fini(adev); kgd2kfd_device_exit(adev->kfd.dev); adev->kfd.dev = NULL; }
diff --git a/drivers/gpu/drm/amd/amdkfd/Makefile b/drivers/gpu/drm/amd/amdkfd/Makefile index 387ce0217d35..a93301dbc464 100644 --- a/drivers/gpu/drm/amd/amdkfd/Makefile +++ b/drivers/gpu/drm/amd/amdkfd/Makefile @@ -55,7 +55,8 @@ AMDKFD_FILES := $(AMDKFD_PATH)/kfd_module.o \ $(AMDKFD_PATH)/kfd_dbgmgr.o \ $(AMDKFD_PATH)/kfd_smi_events.o \ $(AMDKFD_PATH)/kfd_crat.o \
$(AMDKFD_PATH)/kfd_svm.o
$(AMDKFD_PATH)/kfd_svm.o \
$(AMDKFD_PATH)/kfd_migrate.o
ifneq ($(CONFIG_AMD_IOMMU_V2),) AMDKFD_FILES += $(AMDKFD_PATH)/kfd_iommu.o diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c new file mode 100644 index 000000000000..1950b86f1562 --- /dev/null +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c @@ -0,0 +1,101 @@ +/*
- Copyright 2020 Advanced Micro Devices, Inc.
- Permission is hereby granted, free of charge, to any person obtaining a
- copy of this software and associated documentation files (the "Software"),
- to deal in the Software without restriction, including without limitation
- the rights to use, copy, modify, merge, publish, distribute, sublicense,
- and/or sell copies of the Software, and to permit persons to whom the
- Software is furnished to do so, subject to the following conditions:
- The above copyright notice and this permission notice shall be included in
- all copies or substantial portions of the Software.
- THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
- IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
- FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
- THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
- OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
- ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
- OTHER DEALINGS IN THE SOFTWARE.
- */
+#include <linux/types.h> +#include <linux/hmm.h> +#include <linux/dma-direction.h> +#include <linux/dma-mapping.h> +#include "amdgpu_sync.h" +#include "amdgpu_object.h" +#include "amdgpu_vm.h" +#include "amdgpu_mn.h" +#include "kfd_priv.h" +#include "kfd_svm.h" +#include "kfd_migrate.h"
+static void svm_migrate_page_free(struct page *page) +{ +}
+/**
- svm_migrate_to_ram - CPU page fault handler
- @vmf: CPU vm fault vma, address
- Context: vm fault handler, mm->mmap_sem is taken
- Return:
- 0 - OK
- VM_FAULT_SIGBUS - notice application to have SIGBUS page fault
- */
+static vm_fault_t svm_migrate_to_ram(struct vm_fault *vmf) +{
return VM_FAULT_SIGBUS;
+}
+static const struct dev_pagemap_ops svm_migrate_pgmap_ops = {
.page_free = svm_migrate_page_free,
.migrate_to_ram = svm_migrate_to_ram,
+};
+int svm_migrate_init(struct amdgpu_device *adev) +{
struct kfd_dev *kfddev = adev->kfd.dev;
struct dev_pagemap *pgmap;
struct resource *res;
unsigned long size;
void *r;
/* Page migration works on Vega10 or newer */
if (kfddev->device_info->asic_family < CHIP_VEGA10)
return -EINVAL;
pgmap = &kfddev->pgmap;
memset(pgmap, 0, sizeof(*pgmap));
/* TODO: register all vram to HMM for now.
* should remove reserved size
*/
size = ALIGN(adev->gmc.real_vram_size, 2ULL << 20);
res = devm_request_free_mem_region(adev->dev, &iomem_resource, size);
if (IS_ERR(res))
return -ENOMEM;
pgmap->type = MEMORY_DEVICE_PRIVATE;
pgmap->res = *res;
pgmap->ops = &svm_migrate_pgmap_ops;
pgmap->owner = adev;
pgmap->flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
r = devm_memremap_pages(adev->dev, pgmap);
if (IS_ERR(r)) {
pr_err("failed to register HMM device memory\n");
return PTR_ERR(r);
}
pr_info("HMM registered %ldMB device memory\n", size >> 20);
return 0;
+}
+void svm_migrate_fini(struct amdgpu_device *adev) +{
memunmap_pages(&adev->kfd.dev->pgmap);
+} diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h new file mode 100644 index 000000000000..98ab685d3e17 --- /dev/null +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h @@ -0,0 +1,48 @@ +/*
- Copyright 2020 Advanced Micro Devices, Inc.
- Permission is hereby granted, free of charge, to any person obtaining a
- copy of this software and associated documentation files (the "Software"),
- to deal in the Software without restriction, including without limitation
- the rights to use, copy, modify, merge, publish, distribute, sublicense,
- and/or sell copies of the Software, and to permit persons to whom the
- Software is furnished to do so, subject to the following conditions:
- The above copyright notice and this permission notice shall be included in
- all copies or substantial portions of the Software.
- THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
- IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
- FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
- THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
- OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
- ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
- OTHER DEALINGS IN THE SOFTWARE.
- */
+#ifndef KFD_MIGRATE_H_ +#define KFD_MIGRATE_H_
+#include <linux/rwsem.h> +#include <linux/list.h> +#include <linux/mutex.h> +#include <linux/sched/mm.h> +#include <linux/hmm.h> +#include "kfd_priv.h" +#include "kfd_svm.h"
+#if defined(CONFIG_DEVICE_PRIVATE) +int svm_migrate_init(struct amdgpu_device *adev); +void svm_migrate_fini(struct amdgpu_device *adev);
+#else +static inline int svm_migrate_init(struct amdgpu_device *adev) +{
DRM_WARN_ONCE("DEVICE_PRIVATE kernel config option is not enabled, "
"add CONFIG_DEVICE_PRIVATE=y in config file to fix\n");
return -ENODEV;
+} +static inline void svm_migrate_fini(struct amdgpu_device *adev) {} +#endif +#endif /* KFD_MIGRATE_H_ */ diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h index 7a4b4b6dcf32..d5367e770b39 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h @@ -317,6 +317,9 @@ struct kfd_dev { unsigned int max_doorbell_slices;
int noretry;
/* HMM page migration MEMORY_DEVICE_PRIVATE mapping */
struct dev_pagemap pgmap;
};
enum kfd_mempool {
2.29.2
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
On 3/1/21 9:32 AM, Daniel Vetter wrote:
On Wed, Jan 06, 2021 at 10:01:09PM -0500, Felix Kuehling wrote:
From: Philip Yang Philip.Yang@amd.com
Register vram memory as MEMORY_DEVICE_PRIVATE type resource, to allocate vram backing pages for page migration.
Signed-off-by: Philip Yang Philip.Yang@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com
So maybe I'm getting this all wrong, but I think that the current ttm fault code relies on devmap pte entries (especially for hugepte entries) to stop get_user_pages. But this only works if the pte happens to not point at a range with devmap pages.
I don't think that's in TTM yet, but the proposed fix, yes (see email I just sent in another thread), but only for huge ptes.
This patch here changes that, and so probably breaks this devmap pte hack ttm is using?
If I'm not wrong here then I think we need to first fix up the ttm code to not use the devmap hack anymore, before a ttm based driver can register a dev_pagemap. Also adding Thomas since that just came up in another discussion.
It doesn't break the ttm devmap hack per se, but it indeed allows gup to the range registered, but here's where my lack of understanding why we can't allow gup-ing TTM ptes if there indeed is a backing struct-page? Because registering MEMORY_DEVICE_PRIVATE implies that, right?
/Thomas
-Daniel
On Mon, Mar 01, 2021 at 09:46:44AM +0100, Thomas Hellström (Intel) wrote:
On 3/1/21 9:32 AM, Daniel Vetter wrote:
On Wed, Jan 06, 2021 at 10:01:09PM -0500, Felix Kuehling wrote:
From: Philip Yang Philip.Yang@amd.com
Register vram memory as MEMORY_DEVICE_PRIVATE type resource, to allocate vram backing pages for page migration.
Signed-off-by: Philip Yang Philip.Yang@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com
So maybe I'm getting this all wrong, but I think that the current ttm fault code relies on devmap pte entries (especially for hugepte entries) to stop get_user_pages. But this only works if the pte happens to not point at a range with devmap pages.
I don't think that's in TTM yet, but the proposed fix, yes (see email I just sent in another thread), but only for huge ptes.
This patch here changes that, and so probably breaks this devmap pte hack ttm is using?
If I'm not wrong here then I think we need to first fix up the ttm code to not use the devmap hack anymore, before a ttm based driver can register a dev_pagemap. Also adding Thomas since that just came up in another discussion.
It doesn't break the ttm devmap hack per se, but it indeed allows gup to the range registered, but here's where my lack of understanding why we can't allow gup-ing TTM ptes if there indeed is a backing struct-page? Because registering MEMORY_DEVICE_PRIVATE implies that, right?
We need to keep supporting buffer based memory management for all the non-compute users. Because those require end-of-batch dma_fence semantics, which prevents us from using gpu page faults, which makes hmm not really work.
And for buffer based memory manager we can't have gup pin random pages in there, that's not really how it works. Worst case ttm just assumes it can actually move buffers and reallocate them as it sees fit, and your gup mapping (for direct i/o or whatever) now points at a page of a buffer that you don't even own anymore. That's not good. Hence also all the discussions about preventing gup for bo mappings in general.
Once we throw hmm into the mix we need to be really careful that the two worlds don't collide. Pure hmm is fine, pure bo managed memory is fine, mixing them is tricky. -Daniel
On 3/1/21 9:58 AM, Daniel Vetter wrote:
On Mon, Mar 01, 2021 at 09:46:44AM +0100, Thomas Hellström (Intel) wrote:
On 3/1/21 9:32 AM, Daniel Vetter wrote:
On Wed, Jan 06, 2021 at 10:01:09PM -0500, Felix Kuehling wrote:
From: Philip Yang Philip.Yang@amd.com
Register vram memory as MEMORY_DEVICE_PRIVATE type resource, to allocate vram backing pages for page migration.
Signed-off-by: Philip Yang Philip.Yang@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com
So maybe I'm getting this all wrong, but I think that the current ttm fault code relies on devmap pte entries (especially for hugepte entries) to stop get_user_pages. But this only works if the pte happens to not point at a range with devmap pages.
I don't think that's in TTM yet, but the proposed fix, yes (see email I just sent in another thread), but only for huge ptes.
This patch here changes that, and so probably breaks this devmap pte hack ttm is using?
If I'm not wrong here then I think we need to first fix up the ttm code to not use the devmap hack anymore, before a ttm based driver can register a dev_pagemap. Also adding Thomas since that just came up in another discussion.
It doesn't break the ttm devmap hack per se, but it indeed allows gup to the range registered, but here's where my lack of understanding why we can't allow gup-ing TTM ptes if there indeed is a backing struct-page? Because registering MEMORY_DEVICE_PRIVATE implies that, right?
We need to keep supporting buffer based memory management for all the non-compute users. Because those require end-of-batch dma_fence semantics, which prevents us from using gpu page faults, which makes hmm not really work.
And for buffer based memory manager we can't have gup pin random pages in there, that's not really how it works. Worst case ttm just assumes it can actually move buffers and reallocate them as it sees fit, and your gup mapping (for direct i/o or whatever) now points at a page of a buffer that you don't even own anymore. That's not good. Hence also all the discussions about preventing gup for bo mappings in general.
Once we throw hmm into the mix we need to be really careful that the two worlds don't collide. Pure hmm is fine, pure bo managed memory is fine, mixing them is tricky. -Daniel
Hmm, OK so then registering MEMORY_DEVICE_PRIVATE means we can't set pxx_devmap because that would allow gup, which, in turn, means no huge TTM ptes.
/Thomas
Am 2021-03-01 um 3:46 a.m. schrieb Thomas Hellström (Intel):
On 3/1/21 9:32 AM, Daniel Vetter wrote:
On Wed, Jan 06, 2021 at 10:01:09PM -0500, Felix Kuehling wrote:
From: Philip Yang Philip.Yang@amd.com
Register vram memory as MEMORY_DEVICE_PRIVATE type resource, to allocate vram backing pages for page migration.
Signed-off-by: Philip Yang Philip.Yang@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com
So maybe I'm getting this all wrong, but I think that the current ttm fault code relies on devmap pte entries (especially for hugepte entries) to stop get_user_pages. But this only works if the pte happens to not point at a range with devmap pages.
I don't think that's in TTM yet, but the proposed fix, yes (see email I just sent in another thread), but only for huge ptes.
This patch here changes that, and so probably breaks this devmap pte hack ttm is using?
If I'm not wrong here then I think we need to first fix up the ttm code to not use the devmap hack anymore, before a ttm based driver can register a dev_pagemap. Also adding Thomas since that just came up in another discussion.
It doesn't break the ttm devmap hack per se, but it indeed allows gup to the range registered, but here's where my lack of understanding why we can't allow gup-ing TTM ptes if there indeed is a backing struct-page? Because registering MEMORY_DEVICE_PRIVATE implies that, right?
I wasn't aware that TTM used devmap at all. If it does, what type of memory does it use?
MEMORY_DEVICE_PRIVATE is like swapped out memory. It cannot be mapped in the CPU page table. GUP would cause a page fault to swap it back into system memory. We are looking into use MEMORY_DEVICE_GENERIC for a future coherent memory architecture, where device memory can be coherently accessed by the CPU and GPU.
As I understand it, our DEVICE_PRIVATE registration is not tied to an actual physical address. Thus your devmap registration and our devmap registration could probably coexist without any conflict. You'll just have the overhead of two sets of struct pages for the same memory.
Regards, Felix
/Thomas
-Daniel
On 3/4/21 6:58 PM, Felix Kuehling wrote:
Am 2021-03-01 um 3:46 a.m. schrieb Thomas Hellström (Intel):
On 3/1/21 9:32 AM, Daniel Vetter wrote:
On Wed, Jan 06, 2021 at 10:01:09PM -0500, Felix Kuehling wrote:
From: Philip Yang Philip.Yang@amd.com
Register vram memory as MEMORY_DEVICE_PRIVATE type resource, to allocate vram backing pages for page migration.
Signed-off-by: Philip Yang Philip.Yang@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com
So maybe I'm getting this all wrong, but I think that the current ttm fault code relies on devmap pte entries (especially for hugepte entries) to stop get_user_pages. But this only works if the pte happens to not point at a range with devmap pages.
I don't think that's in TTM yet, but the proposed fix, yes (see email I just sent in another thread), but only for huge ptes.
This patch here changes that, and so probably breaks this devmap pte hack ttm is using?
If I'm not wrong here then I think we need to first fix up the ttm code to not use the devmap hack anymore, before a ttm based driver can register a dev_pagemap. Also adding Thomas since that just came up in another discussion.
It doesn't break the ttm devmap hack per se, but it indeed allows gup to the range registered, but here's where my lack of understanding why we can't allow gup-ing TTM ptes if there indeed is a backing struct-page? Because registering MEMORY_DEVICE_PRIVATE implies that, right?
I wasn't aware that TTM used devmap at all. If it does, what type of memory does it use?
MEMORY_DEVICE_PRIVATE is like swapped out memory. It cannot be mapped in the CPU page table. GUP would cause a page fault to swap it back into system memory. We are looking into use MEMORY_DEVICE_GENERIC for a future coherent memory architecture, where device memory can be coherently accessed by the CPU and GPU.
As I understand it, our DEVICE_PRIVATE registration is not tied to an actual physical address. Thus your devmap registration and our devmap registration could probably coexist without any conflict. You'll just have the overhead of two sets of struct pages for the same memory.
Regards, Felix
Hi, Felix. TTM doesn't use devmap yet, but thinking of using it for faking pmd_special() which isn't available. That would mean pmd_devmap() + no_registered_dev_pagemap meaning special in the sense documented by vm_normal_page(). The implication here would be that if you register memory like above, TTM would never be able to set up a huge page table entry to it. But it sounds like that's not an issue?
/Thomas
/Thomas
-Daniel
From: Philip Yang Philip.Yang@amd.com
If svm range perfetch location is not zero, use TTM to alloc amdgpu_bo vram nodes to validate svm range, then map vram nodes to GPUs.
Use offset to sub allocate from the same amdgpu_bo to handle overlap vram range while adding new range or unmapping range.
svm_bo has ref count to trace the shared ranges. If all ranges of shared amdgpu_bo are migrated to ram, ref count becomes 0, then amdgpu_bo is released, all ranges svm_bo is set to NULL.
To migrate range from ram back to vram, allocate the same amdgpu_bo with previous offset if the range has svm_bo.
Signed-off-by: Philip Yang Philip.Yang@amd.com Signed-off-by: Alex Sierra alex.sierra@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com --- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 342 ++++++++++++++++++++++++--- drivers/gpu/drm/amd/amdkfd/kfd_svm.h | 20 ++ 2 files changed, 335 insertions(+), 27 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index e3ba6e7262a7..7d91dc49a5a9 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c @@ -35,7 +35,9 @@ * svm_range_unlink - unlink svm_range from lists and interval tree * @prange: svm range structure to be removed * - * Remove the svm range from svms interval tree and link list + * Remove the svm_range from the svms and svm_bo SRCU lists and the svms + * interval tree. After this call, synchronize_srcu is needed before the + * range can be freed safely. * * Context: The caller must hold svms_lock */ @@ -44,6 +46,12 @@ static void svm_range_unlink(struct svm_range *prange) pr_debug("prange 0x%p [0x%lx 0x%lx]\n", prange, prange->it_node.start, prange->it_node.last);
+ if (prange->svm_bo) { + spin_lock(&prange->svm_bo->list_lock); + list_del(&prange->svm_bo_list); + spin_unlock(&prange->svm_bo->list_lock); + } + list_del_rcu(&prange->list); interval_tree_remove(&prange->it_node, &prange->svms->objects); } @@ -70,6 +78,12 @@ static void svm_range_remove(struct svm_range *prange) pr_debug("svms 0x%p [0x%lx 0x%lx]\n", prange->svms, prange->it_node.start, prange->it_node.last);
+ if (prange->mm_nodes) { + pr_debug("vram prange svms 0x%p [0x%lx 0x%lx]\n", prange->svms, + prange->it_node.start, prange->it_node.last); + svm_range_vram_node_free(prange); + } + kvfree(prange->pages_addr); kfree(prange); } @@ -102,7 +116,9 @@ svm_range *svm_range_new(struct svm_range_list *svms, uint64_t start, INIT_LIST_HEAD(&prange->list); INIT_LIST_HEAD(&prange->update_list); INIT_LIST_HEAD(&prange->remove_list); + INIT_LIST_HEAD(&prange->svm_bo_list); atomic_set(&prange->invalid, 0); + spin_lock_init(&prange->svm_bo_lock); svm_range_set_default_attributes(&prange->preferred_loc, &prange->prefetch_loc, &prange->granularity, &prange->flags); @@ -139,6 +155,16 @@ svm_get_supported_dev_by_id(struct kfd_process *p, uint32_t gpu_id, return dev; }
+struct amdgpu_device * +svm_range_get_adev_by_id(struct svm_range *prange, uint32_t gpu_id) +{ + struct kfd_process *p = + container_of(prange->svms, struct kfd_process, svms); + struct kfd_dev *dev = svm_get_supported_dev_by_id(p, gpu_id, NULL); + + return dev ? (struct amdgpu_device *)dev->kgd : NULL; +} + /** * svm_range_validate_ram - get system memory pages of svm range * @@ -186,14 +212,226 @@ svm_range_validate_ram(struct mm_struct *mm, struct svm_range *prange) return 0; }
+static bool svm_bo_ref_unless_zero(struct svm_range_bo *svm_bo) +{ + if (!svm_bo || !kref_get_unless_zero(&svm_bo->kref)) + return false; + + return true; +} + +static struct svm_range_bo *svm_range_bo_ref(struct svm_range_bo *svm_bo) +{ + if (svm_bo) + kref_get(&svm_bo->kref); + + return svm_bo; +} + +static void svm_range_bo_release(struct kref *kref) +{ + struct svm_range_bo *svm_bo; + + svm_bo = container_of(kref, struct svm_range_bo, kref); + /* This cleanup loop does not need to be SRCU safe because there + * should be no SRCU readers while the ref count is 0. Any SRCU + * reader that has a chance of reducing the ref count must take + * an extra reference before srcu_read_lock and release it after + * srcu_read_unlock. + */ + spin_lock(&svm_bo->list_lock); + while (!list_empty(&svm_bo->range_list)) { + struct svm_range *prange = + list_first_entry(&svm_bo->range_list, + struct svm_range, svm_bo_list); + pr_debug("svms 0x%p [0x%lx 0x%lx]\n", prange->svms, + prange->it_node.start, prange->it_node.last); + spin_lock(&prange->svm_bo_lock); + prange->svm_bo = NULL; + spin_unlock(&prange->svm_bo_lock); + + /* list_del_init tells a concurrent svm_range_vram_node_new when + * it's safe to reuse the svm_bo pointer and svm_bo_list head. + */ + list_del_init(&prange->svm_bo_list); + } + spin_unlock(&svm_bo->list_lock); + + amdgpu_bo_unref(&svm_bo->bo); + kfree(svm_bo); +} + +static void svm_range_bo_unref(struct svm_range_bo *svm_bo) +{ + if (!svm_bo) + return; + + kref_put(&svm_bo->kref, svm_range_bo_release); +} + +static struct svm_range_bo *svm_range_bo_new(void) +{ + struct svm_range_bo *svm_bo; + + svm_bo = kzalloc(sizeof(*svm_bo), GFP_KERNEL); + if (!svm_bo) + return NULL; + + kref_init(&svm_bo->kref); + INIT_LIST_HEAD(&svm_bo->range_list); + spin_lock_init(&svm_bo->list_lock); + + return svm_bo; +} + +int +svm_range_vram_node_new(struct amdgpu_device *adev, struct svm_range *prange, + bool clear) +{ + struct amdkfd_process_info *process_info; + struct amdgpu_bo_param bp; + struct svm_range_bo *svm_bo; + struct amdgpu_bo *bo; + struct kfd_process *p; + int r; + + pr_debug("[0x%lx 0x%lx]\n", prange->it_node.start, + prange->it_node.last); + spin_lock(&prange->svm_bo_lock); + if (prange->svm_bo) { + if (prange->mm_nodes) { + /* We still have a reference, all is well */ + spin_unlock(&prange->svm_bo_lock); + return 0; + } + if (svm_bo_ref_unless_zero(prange->svm_bo)) { + /* The BO was still around and we got + * a new reference to it + */ + spin_unlock(&prange->svm_bo_lock); + pr_debug("reuse old bo [0x%lx 0x%lx]\n", + prange->it_node.start, prange->it_node.last); + + prange->mm_nodes = prange->svm_bo->bo->tbo.mem.mm_node; + return 0; + } + + spin_unlock(&prange->svm_bo_lock); + + /* We need a new svm_bo. Spin-loop to wait for concurrent + * svm_range_bo_release to finish removing this range from + * its range list. After this, it is safe to reuse the + * svm_bo pointer and svm_bo_list head. + */ + while (!list_empty_careful(&prange->svm_bo_list)) + ; + + } else { + spin_unlock(&prange->svm_bo_lock); + } + + svm_bo = svm_range_bo_new(); + if (!svm_bo) { + pr_debug("failed to alloc svm bo\n"); + return -ENOMEM; + } + + memset(&bp, 0, sizeof(bp)); + bp.size = prange->npages * PAGE_SIZE; + bp.byte_align = PAGE_SIZE; + bp.domain = AMDGPU_GEM_DOMAIN_VRAM; + bp.flags = AMDGPU_GEM_CREATE_NO_CPU_ACCESS; + bp.flags |= clear ? AMDGPU_GEM_CREATE_VRAM_CLEARED : 0; + bp.type = ttm_bo_type_device; + bp.resv = NULL; + + r = amdgpu_bo_create(adev, &bp, &bo); + if (r) { + pr_debug("failed %d to create bo\n", r); + kfree(svm_bo); + return r; + } + + p = container_of(prange->svms, struct kfd_process, svms); + r = amdgpu_bo_reserve(bo, true); + if (r) { + pr_debug("failed %d to reserve bo\n", r); + goto reserve_bo_failed; + } + + r = dma_resv_reserve_shared(bo->tbo.base.resv, 1); + if (r) { + pr_debug("failed %d to reserve bo\n", r); + amdgpu_bo_unreserve(bo); + goto reserve_bo_failed; + } + process_info = p->kgd_process_info; + amdgpu_bo_fence(bo, &process_info->eviction_fence->base, true); + + amdgpu_bo_unreserve(bo); + + svm_bo->bo = bo; + prange->svm_bo = svm_bo; + prange->mm_nodes = bo->tbo.mem.mm_node; + prange->offset = 0; + + spin_lock(&svm_bo->list_lock); + list_add(&prange->svm_bo_list, &svm_bo->range_list); + spin_unlock(&svm_bo->list_lock); + + return 0; + +reserve_bo_failed: + kfree(svm_bo); + amdgpu_bo_unref(&bo); + prange->mm_nodes = NULL; + + return r; +} + +void svm_range_vram_node_free(struct svm_range *prange) +{ + pr_debug("svms 0x%p [0x%lx 0x%lx]\n", prange->svms, + prange->it_node.start, prange->it_node.last); + + svm_range_bo_unref(prange->svm_bo); + prange->mm_nodes = NULL; +} + +static int svm_range_validate_vram(struct svm_range *prange) +{ + struct amdgpu_device *adev; + int r; + + pr_debug("svms 0x%p [0x%lx 0x%lx] actual_loc 0x%x\n", prange->svms, + prange->it_node.start, prange->it_node.last, + prange->actual_loc); + + adev = svm_range_get_adev_by_id(prange, prange->actual_loc); + if (!adev) { + pr_debug("failed to get device by id 0x%x\n", + prange->actual_loc); + return -EINVAL; + } + + r = svm_range_vram_node_new(adev, prange, true); + if (r) + pr_debug("failed %d to alloc vram\n", r); + + return r; +} + static int svm_range_validate(struct mm_struct *mm, struct svm_range *prange) { - int r = 0; + int r;
pr_debug("actual loc 0x%x\n", prange->actual_loc);
- r = svm_range_validate_ram(mm, prange); + if (!prange->actual_loc) + r = svm_range_validate_ram(mm, prange); + else + r = svm_range_validate_vram(prange);
pr_debug("svms 0x%p [0x%lx 0x%lx] ret %d invalid %d\n", prange->svms, prange->it_node.start, prange->it_node.last, @@ -349,6 +587,35 @@ svm_range_split_pages(struct svm_range *new, struct svm_range *old, return 0; }
+static int +svm_range_split_nodes(struct svm_range *new, struct svm_range *old, + uint64_t start, uint64_t last) +{ + pr_debug("svms 0x%p new start 0x%lx start 0x%llx last 0x%llx\n", + new->svms, new->it_node.start, start, last); + + old->npages = last - start + 1; + + if (new->it_node.start == old->it_node.start) { + new->offset = old->offset; + old->offset += new->npages; + } else { + new->offset = old->offset + old->npages; + } + + old->it_node.start = start; + old->it_node.last = last; + + new->svm_bo = svm_range_bo_ref(old->svm_bo); + new->mm_nodes = old->mm_nodes; + + spin_lock(&new->svm_bo->list_lock); + list_add(&new->svm_bo_list, &new->svm_bo->range_list); + spin_unlock(&new->svm_bo->list_lock); + + return 0; +} + /** * svm_range_split_adjust - split range and adjust * @@ -382,6 +649,8 @@ svm_range_split_adjust(struct svm_range *new, struct svm_range *old,
if (old->pages_addr) r = svm_range_split_pages(new, old, start, last); + else if (old->actual_loc && old->mm_nodes) + r = svm_range_split_nodes(new, old, start, last); else WARN_ONCE(1, "split adjust invalid pages_addr and nodes\n"); if (r) @@ -438,17 +707,14 @@ svm_range_split(struct svm_range *prange, uint64_t start, uint64_t last, return -EINVAL;
svms = prange->svms; - if (old_start == start) { + if (old_start == start) *new = svm_range_new(svms, last + 1, old_last); - if (!*new) - return -ENOMEM; - r = svm_range_split_adjust(*new, prange, start, last); - } else { + else *new = svm_range_new(svms, old_start, start - 1); - if (!*new) - return -ENOMEM; - r = svm_range_split_adjust(*new, prange, start, last); - } + if (!*new) + return -ENOMEM; + + r = svm_range_split_adjust(*new, prange, start, last);
return r; } @@ -550,7 +816,8 @@ svm_range_get_pte_flags(struct amdgpu_device *adev, struct svm_range *prange) uint64_t pte_flags;
pte_flags = AMDGPU_PTE_VALID; - pte_flags |= AMDGPU_PTE_SYSTEM | AMDGPU_PTE_SNOOPED; + if (!prange->mm_nodes) + pte_flags |= AMDGPU_PTE_SYSTEM | AMDGPU_PTE_SNOOPED;
mapping_flags = AMDGPU_VM_PAGE_READABLE | AMDGPU_VM_PAGE_WRITEABLE;
@@ -570,7 +837,9 @@ svm_range_get_pte_flags(struct amdgpu_device *adev, struct svm_range *prange) /* Apply ASIC specific mapping flags */ amdgpu_gmc_get_vm_pte(adev, &prange->mapping, &pte_flags);
- pr_debug("PTE flags 0x%llx\n", pte_flags); + pr_debug("svms 0x%p [0x%lx 0x%lx] vram %d system %d PTE flags 0x%llx\n", + prange->svms, prange->it_node.start, prange->it_node.last, + prange->mm_nodes ? 1:0, prange->pages_addr ? 1:0, pte_flags);
return pte_flags; } @@ -656,7 +925,9 @@ svm_range_map_to_gpu(struct amdgpu_device *adev, struct amdgpu_vm *vm, struct svm_range *prange, bool reserve_vm, struct dma_fence **fence) { - struct amdgpu_bo *root; + struct ttm_validate_buffer tv[2]; + struct ww_acquire_ctx ticket; + struct list_head list; dma_addr_t *pages_addr; uint64_t pte_flags; int r = 0; @@ -665,13 +936,25 @@ svm_range_map_to_gpu(struct amdgpu_device *adev, struct amdgpu_vm *vm, prange->it_node.start, prange->it_node.last);
if (reserve_vm) { - root = amdgpu_bo_ref(vm->root.base.bo); - r = amdgpu_bo_reserve(root, true); + INIT_LIST_HEAD(&list); + + tv[0].bo = &vm->root.base.bo->tbo; + tv[0].num_shared = 4; + list_add(&tv[0].head, &list); + if (prange->svm_bo && prange->mm_nodes) { + tv[1].bo = &prange->svm_bo->bo->tbo; + tv[1].num_shared = 1; + list_add(&tv[1].head, &list); + } + r = ttm_eu_reserve_buffers(&ticket, &list, true, NULL); if (r) { - pr_debug("failed %d to reserve root bo\n", r); - amdgpu_bo_unref(&root); + pr_debug("failed %d to reserve bo\n", r); goto out; } + if (prange->svm_bo && prange->mm_nodes && + prange->svm_bo->bo->tbo.evicted) + goto unreserve_out; + r = amdgpu_vm_validate_pt_bos(adev, vm, svm_range_bo_validate, NULL); if (r) { @@ -682,7 +965,7 @@ svm_range_map_to_gpu(struct amdgpu_device *adev, struct amdgpu_vm *vm,
prange->mapping.start = prange->it_node.start; prange->mapping.last = prange->it_node.last; - prange->mapping.offset = 0; + prange->mapping.offset = prange->offset; pte_flags = svm_range_get_pte_flags(adev, prange); prange->mapping.flags = pte_flags; pages_addr = prange->pages_addr; @@ -690,7 +973,8 @@ svm_range_map_to_gpu(struct amdgpu_device *adev, struct amdgpu_vm *vm, r = amdgpu_vm_bo_update_mapping(adev, adev, vm, false, false, NULL, prange->mapping.start, prange->mapping.last, pte_flags, - prange->mapping.offset, NULL, + prange->mapping.offset, + prange->mm_nodes, pages_addr, &vm->last_update); if (r) { pr_debug("failed %d to map to gpu 0x%lx\n", r, @@ -710,11 +994,8 @@ svm_range_map_to_gpu(struct amdgpu_device *adev, struct amdgpu_vm *vm, *fence = dma_fence_get(vm->last_update);
unreserve_out: - if (reserve_vm) { - amdgpu_bo_unreserve(root); - amdgpu_bo_unref(&root); - } - + if (reserve_vm) + ttm_eu_backoff_reservation(&ticket, &list); out: return r; } @@ -929,7 +1210,14 @@ struct svm_range *svm_range_clone(struct svm_range *old) memcpy(new->pages_addr, old->pages_addr, old->npages * sizeof(*old->pages_addr)); } - + if (old->svm_bo) { + new->mm_nodes = old->mm_nodes; + new->offset = old->offset; + new->svm_bo = svm_range_bo_ref(old->svm_bo); + spin_lock(&new->svm_bo->list_lock); + list_add(&new->svm_bo_list, &new->svm_bo->range_list); + spin_unlock(&new->svm_bo->list_lock); + } new->flags = old->flags; new->preferred_loc = old->preferred_loc; new->prefetch_loc = old->prefetch_loc; diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h index 4c7daf8e0b6f..b1d2db02043b 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h @@ -32,6 +32,12 @@ #include "amdgpu.h" #include "kfd_priv.h"
+struct svm_range_bo { + struct amdgpu_bo *bo; + struct kref kref; + struct list_head range_list; /* all svm ranges shared this bo */ + spinlock_t list_lock; +}; /** * struct svm_range - shared virtual memory range * @@ -45,6 +51,10 @@ * @mapping: bo_va mapping structure to create and update GPU page table * @npages: number of pages * @pages_addr: list of system memory physical page address + * @mm_nodes: vram nodes allocated + * @offset: range start offset within mm_nodes + * @svm_bo: struct to manage splited amdgpu_bo + * @svm_bo_list:link list node, to scan all ranges which share same svm_bo * @flags: flags defined as KFD_IOCTL_SVM_FLAG_* * @perferred_loc: perferred location, 0 for CPU, or GPU id * @perfetch_loc: last prefetch location, 0 for CPU, or GPU id @@ -68,6 +78,11 @@ struct svm_range { struct amdgpu_bo_va_mapping mapping; uint64_t npages; dma_addr_t *pages_addr; + struct drm_mm_node *mm_nodes; + uint64_t offset; + struct svm_range_bo *svm_bo; + struct list_head svm_bo_list; + spinlock_t svm_bo_lock; uint32_t flags; uint32_t preferred_loc; uint32_t prefetch_loc; @@ -95,5 +110,10 @@ void svm_range_list_fini(struct kfd_process *p); int svm_ioctl(struct kfd_process *p, enum kfd_ioctl_svm_op op, uint64_t start, uint64_t size, uint32_t nattrs, struct kfd_ioctl_svm_attribute *attrs); +struct amdgpu_device *svm_range_get_adev_by_id(struct svm_range *prange, + uint32_t id); +int svm_range_vram_node_new(struct amdgpu_device *adev, + struct svm_range *prange, bool clear); +void svm_range_vram_node_free(struct svm_range *prange);
#endif /* KFD_SVM_H_ */
From: Philip Yang Philip.Yang@amd.com
amdgpu_gmc_get_vm_pte use bo_va->is_xgmi same hive information to set pte flags to update GPU mapping. Add local structure variable bo_va, and update bo_va.is_xgmi, pass it to mapping->bo_va while mapping to GPU.
Assuming xgmi pstate is hi after boot.
Signed-off-by: Philip Yang Philip.Yang@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com --- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 27 ++++++++++++++++++++++++--- 1 file changed, 24 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index 7d91dc49a5a9..8a4d0a3935b6 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c @@ -26,6 +26,8 @@ #include "amdgpu_object.h" #include "amdgpu_vm.h" #include "amdgpu_mn.h" +#include "amdgpu.h" +#include "amdgpu_xgmi.h" #include "kfd_priv.h" #include "kfd_svm.h"
@@ -923,10 +925,11 @@ static int svm_range_bo_validate(void *param, struct amdgpu_bo *bo) static int svm_range_map_to_gpu(struct amdgpu_device *adev, struct amdgpu_vm *vm, struct svm_range *prange, bool reserve_vm, - struct dma_fence **fence) + struct amdgpu_device *bo_adev, struct dma_fence **fence) { struct ttm_validate_buffer tv[2]; struct ww_acquire_ctx ticket; + struct amdgpu_bo_va bo_va; struct list_head list; dma_addr_t *pages_addr; uint64_t pte_flags; @@ -963,6 +966,11 @@ svm_range_map_to_gpu(struct amdgpu_device *adev, struct amdgpu_vm *vm, } }
+ if (prange->svm_bo && prange->mm_nodes) { + bo_va.is_xgmi = amdgpu_xgmi_same_hive(adev, bo_adev); + prange->mapping.bo_va = &bo_va; + } + prange->mapping.start = prange->it_node.start; prange->mapping.last = prange->it_node.last; prange->mapping.offset = prange->offset; @@ -970,7 +978,7 @@ svm_range_map_to_gpu(struct amdgpu_device *adev, struct amdgpu_vm *vm, prange->mapping.flags = pte_flags; pages_addr = prange->pages_addr;
- r = amdgpu_vm_bo_update_mapping(adev, adev, vm, false, false, NULL, + r = amdgpu_vm_bo_update_mapping(adev, bo_adev, vm, false, false, NULL, prange->mapping.start, prange->mapping.last, pte_flags, prange->mapping.offset, @@ -994,6 +1002,7 @@ svm_range_map_to_gpu(struct amdgpu_device *adev, struct amdgpu_vm *vm, *fence = dma_fence_get(vm->last_update);
unreserve_out: + prange->mapping.bo_va = NULL; if (reserve_vm) ttm_eu_backoff_reservation(&ticket, &list); out: @@ -1004,6 +1013,7 @@ static int svm_range_map_to_gpus(struct svm_range *prange, bool reserve_vm) { DECLARE_BITMAP(bitmap, MAX_GPU_INSTANCE); struct kfd_process_device *pdd; + struct amdgpu_device *bo_adev; struct amdgpu_device *adev; struct kfd_process *p; struct kfd_dev *dev; @@ -1011,6 +1021,11 @@ static int svm_range_map_to_gpus(struct svm_range *prange, bool reserve_vm) uint32_t gpuidx; int r = 0;
+ if (prange->svm_bo && prange->mm_nodes) + bo_adev = amdgpu_ttm_adev(prange->svm_bo->bo->tbo.bdev); + else + bo_adev = NULL; + bitmap_or(bitmap, prange->bitmap_access, prange->bitmap_aip, MAX_GPU_INSTANCE); p = container_of(prange->svms, struct kfd_process, svms); @@ -1027,8 +1042,14 @@ static int svm_range_map_to_gpus(struct svm_range *prange, bool reserve_vm) return -EINVAL; adev = (struct amdgpu_device *)dev->kgd;
+ if (bo_adev && adev != bo_adev && + !amdgpu_xgmi_same_hive(adev, bo_adev)) { + pr_debug("cannot map to device idx %d\n", gpuidx); + continue; + } + r = svm_range_map_to_gpu(adev, pdd->vm, prange, reserve_vm, - &fence); + bo_adev, &fence); if (r) break;
From: Philip Yang Philip.Yang@amd.com
Use sdma linear copy to migrate data between ram and vram. The sdma linear copy command uses kernel buffer function queue to access system memory through gart table.
Use reserved gart table window 0 to map system page address, and vram page address is direct mapping. Use the same kernel buffer function to fill in gart table mapping, so this is serialized with memory copy by sdma job submit. We only need wait for the last memory copy sdma fence for larger buffer migration.
Signed-off-by: Philip Yang Philip.Yang@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com --- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 172 +++++++++++++++++++++++ drivers/gpu/drm/amd/amdkfd/kfd_migrate.h | 5 + 2 files changed, 177 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c index 1950b86f1562..f2019c8f0b80 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c @@ -32,6 +32,178 @@ #include "kfd_svm.h" #include "kfd_migrate.h"
+static uint64_t +svm_migrate_direct_mapping_addr(struct amdgpu_device *adev, uint64_t addr) +{ + return addr + amdgpu_ttm_domain_start(adev, TTM_PL_VRAM); +} + +static int +svm_migrate_gart_map(struct amdgpu_ring *ring, uint64_t npages, + uint64_t *addr, uint64_t *gart_addr, uint64_t flags) +{ + struct amdgpu_device *adev = ring->adev; + struct amdgpu_job *job; + unsigned int num_dw, num_bytes; + struct dma_fence *fence; + uint64_t src_addr, dst_addr; + uint64_t pte_flags; + void *cpu_addr; + int r; + + /* use gart window 0 */ + *gart_addr = adev->gmc.gart_start; + + num_dw = ALIGN(adev->mman.buffer_funcs->copy_num_dw, 8); + num_bytes = npages * 8; + + r = amdgpu_job_alloc_with_ib(adev, num_dw * 4 + num_bytes, + AMDGPU_IB_POOL_DELAYED, &job); + if (r) + return r; + + src_addr = num_dw * 4; + src_addr += job->ibs[0].gpu_addr; + + dst_addr = amdgpu_bo_gpu_offset(adev->gart.bo); + amdgpu_emit_copy_buffer(adev, &job->ibs[0], src_addr, + dst_addr, num_bytes, false); + + amdgpu_ring_pad_ib(ring, &job->ibs[0]); + WARN_ON(job->ibs[0].length_dw > num_dw); + + pte_flags = AMDGPU_PTE_VALID | AMDGPU_PTE_READABLE; + pte_flags |= AMDGPU_PTE_SYSTEM | AMDGPU_PTE_SNOOPED; + if (!(flags & KFD_IOCTL_SVM_FLAG_GPU_RO)) + pte_flags |= AMDGPU_PTE_WRITEABLE; + pte_flags |= adev->gart.gart_pte_flags; + + cpu_addr = &job->ibs[0].ptr[num_dw]; + + r = amdgpu_gart_map(adev, 0, npages, addr, pte_flags, cpu_addr); + if (r) + goto error_free; + + r = amdgpu_job_submit(job, &adev->mman.entity, + AMDGPU_FENCE_OWNER_UNDEFINED, &fence); + if (r) + goto error_free; + + dma_fence_put(fence); + + return r; + +error_free: + amdgpu_job_free(job); + return r; +} + +/** + * svm_migrate_copy_memory_gart - sdma copy data between ram and vram + * + * @adev: amdgpu device the sdma ring running + * @src: source page address array + * @dst: destination page address array + * @npages: number of pages to copy + * @direction: enum MIGRATION_COPY_DIR + * @mfence: output, sdma fence to signal after sdma is done + * + * ram address uses GART table continuous entries mapping to ram pages, + * vram address uses direct mapping of vram pages, which must have npages + * number of continuous pages. + * GART update and sdma uses same buf copy function ring, sdma is splited to + * multiple GTT_MAX_PAGES transfer, all sdma operations are serialized, wait for + * the last sdma finish fence which is returned to check copy memory is done. + * + * Context: Process context, takes and releases gtt_window_lock + * + * Return: + * 0 - OK, otherwise error code + */ + +static int +svm_migrate_copy_memory_gart(struct amdgpu_device *adev, uint64_t *src, + uint64_t *dst, uint64_t npages, + enum MIGRATION_COPY_DIR direction, + struct dma_fence **mfence) +{ + const uint64_t GTT_MAX_PAGES = AMDGPU_GTT_MAX_TRANSFER_SIZE; + struct amdgpu_ring *ring = adev->mman.buffer_funcs_ring; + uint64_t gart_s, gart_d; + struct dma_fence *next; + uint64_t size; + int r; + + mutex_lock(&adev->mman.gtt_window_lock); + + while (npages) { + size = min(GTT_MAX_PAGES, npages); + + if (direction == FROM_VRAM_TO_RAM) { + gart_s = svm_migrate_direct_mapping_addr(adev, *src); + r = svm_migrate_gart_map(ring, size, dst, &gart_d, 0); + + } else if (direction == FROM_RAM_TO_VRAM) { + r = svm_migrate_gart_map(ring, size, src, &gart_s, + KFD_IOCTL_SVM_FLAG_GPU_RO); + gart_d = svm_migrate_direct_mapping_addr(adev, *dst); + } + if (r) { + pr_debug("failed %d to create gart mapping\n", r); + goto out_unlock; + } + + r = amdgpu_copy_buffer(ring, gart_s, gart_d, size * PAGE_SIZE, + NULL, &next, false, true, false); + if (r) { + pr_debug("failed %d to copy memory\n", r); + goto out_unlock; + } + + dma_fence_put(*mfence); + *mfence = next; + npages -= size; + if (npages) { + src += size; + dst += size; + } + } + +out_unlock: + mutex_unlock(&adev->mman.gtt_window_lock); + + return r; +} + +/** + * svm_migrate_copy_done - wait for memory copy sdma is done + * + * @adev: amdgpu device the sdma memory copy is executing on + * @mfence: migrate fence + * + * Wait for dma fence is signaled, if the copy ssplit into multiple sdma + * operations, this is the last sdma operation fence. + * + * Context: called after svm_migrate_copy_memory + * + * Return: + * 0 - success + * otherwise - error code from dma fence signal + */ +int +svm_migrate_copy_done(struct amdgpu_device *adev, struct dma_fence *mfence) +{ + int r = 0; + + if (mfence) { + r = dma_fence_wait(mfence, false); + dma_fence_put(mfence); + pr_debug("sdma copy memory fence done\n"); + } + + return r; +} + static void svm_migrate_page_free(struct page *page) { } diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h index 98ab685d3e17..5db5686fa46a 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h @@ -32,6 +32,11 @@ #include "kfd_priv.h" #include "kfd_svm.h"
+enum MIGRATION_COPY_DIR { + FROM_RAM_TO_VRAM = 0, + FROM_VRAM_TO_RAM +}; + #if defined(CONFIG_DEVICE_PRIVATE) int svm_migrate_init(struct amdgpu_device *adev); void svm_migrate_fini(struct amdgpu_device *adev);
From: Philip Yang Philip.Yang@amd.com
Register svm range with same address and size but perferred_location is changed from CPU to GPU or from GPU to CPU, trigger migration the svm range from ram to vram or from vram to ram.
If svm range prefetch location is GPU with flags KFD_IOCTL_SVM_FLAG_HOST_ACCESS, validate the svm range on ram first, then migrate it from ram to vram.
After migrating to vram is done, CPU access will have cpu page fault, page fault handler migrate it back to ram and resume cpu access.
Migration steps:
1. migrate_vma_pages get svm range ram pages, notify the interval is invalidated and unmap from CPU page table, HMM interval notifier callback evict process queues 2. Allocate new pages in vram using TTM 3. Use svm copy memory to sdma copy data from ram to vram 4. migrate_vma_pages copy ram pages structure to vram pages structure 5. migrate_vma_finalize put ram pages to free ram pages and memory 6. Restore work wait for migration is finished, then update GPUs page table mapping to new vram pages, resume process queues
If migrate_vma_setup failed to collect all ram pages of range, retry 3 times until success to start migration.
Signed-off-by: Philip Yang Philip.Yang@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com --- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 265 +++++++++++++++++++++++ drivers/gpu/drm/amd/amdkfd/kfd_migrate.h | 2 + drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 175 ++++++++++++++- drivers/gpu/drm/amd/amdkfd/kfd_svm.h | 2 + 4 files changed, 436 insertions(+), 8 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c index f2019c8f0b80..af23f0be7eaf 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c @@ -204,6 +204,271 @@ svm_migrate_copy_done(struct amdgpu_device *adev, struct dma_fence *mfence) return r; }
+static uint64_t +svm_migrate_node_physical_addr(struct amdgpu_device *adev, + struct drm_mm_node **mm_node, uint64_t *offset) +{ + struct drm_mm_node *node = *mm_node; + uint64_t pos = *offset; + + if (node->start == AMDGPU_BO_INVALID_OFFSET) { + pr_debug("drm node is not validated\n"); + return 0; + } + + pr_debug("vram node start 0x%llx npages 0x%llx\n", node->start, + node->size); + + if (pos >= node->size) { + do { + pos -= node->size; + node++; + } while (pos >= node->size); + + *mm_node = node; + *offset = pos; + } + + return (node->start + pos) << PAGE_SHIFT; +} + +unsigned long +svm_migrate_addr_to_pfn(struct amdgpu_device *adev, unsigned long addr) +{ + return (addr + adev->kfd.dev->pgmap.res.start) >> PAGE_SHIFT; +} + +static void +svm_migrate_get_vram_page(struct svm_range *prange, unsigned long pfn) +{ + struct page *page; + + page = pfn_to_page(pfn); + page->zone_device_data = prange; + get_page(page); + lock_page(page); +} + +static void +svm_migrate_put_vram_page(struct amdgpu_device *adev, unsigned long addr) +{ + struct page *page; + + page = pfn_to_page(svm_migrate_addr_to_pfn(adev, addr)); + unlock_page(page); + put_page(page); +} + + +static int +svm_migrate_copy_to_vram(struct amdgpu_device *adev, struct svm_range *prange, + struct migrate_vma *migrate, + struct dma_fence **mfence) +{ + uint64_t npages = migrate->cpages; + struct drm_mm_node *node; + uint64_t *src, *dst; + uint64_t vram_addr; + uint64_t offset; + uint64_t i, j; + int r = -ENOMEM; + + pr_debug("svms 0x%p [0x%lx 0x%lx]\n", prange->svms, + prange->it_node.start, prange->it_node.last); + + src = kvmalloc_array(npages << 1, sizeof(*src), GFP_KERNEL); + if (!src) + goto out; + dst = src + npages; + + r = svm_range_vram_node_new(adev, prange, false); + if (r) { + pr_debug("failed %d get 0x%llx pages from vram\n", r, npages); + goto out_free; + } + + node = prange->mm_nodes; + offset = prange->offset; + vram_addr = svm_migrate_node_physical_addr(adev, &node, &offset); + if (!vram_addr) { + WARN_ONCE(1, "vram node address is 0\n"); + r = -ENOMEM; + goto out_free; + } + + for (i = j = 0; i < npages; i++) { + struct page *spage; + + spage = migrate_pfn_to_page(migrate->src[i]); + src[i] = page_to_pfn(spage) << PAGE_SHIFT; + + dst[i] = vram_addr + (j << PAGE_SHIFT); + migrate->dst[i] = svm_migrate_addr_to_pfn(adev, dst[i]); + svm_migrate_get_vram_page(prange, migrate->dst[i]); + + migrate->dst[i] = migrate_pfn(migrate->dst[i]); + migrate->dst[i] |= MIGRATE_PFN_LOCKED; + + if (j + offset >= node->size - 1 && i < npages - 1) { + r = svm_migrate_copy_memory_gart(adev, src + i - j, + dst + i - j, j + 1, + FROM_RAM_TO_VRAM, + mfence); + if (r) + goto out_free_vram_pages; + + node++; + pr_debug("next node size 0x%llx\n", node->size); + vram_addr = node->start << PAGE_SHIFT; + offset = 0; + j = 0; + } else { + j++; + } + } + + r = svm_migrate_copy_memory_gart(adev, src + i - j, dst + i - j, j, + FROM_RAM_TO_VRAM, mfence); + if (!r) + goto out_free; + +out_free_vram_pages: + pr_debug("failed %d to copy memory to vram\n", r); + while (i--) { + svm_migrate_put_vram_page(adev, dst[i]); + migrate->dst[i] = 0; + } + +out_free: + kvfree(src); +out: + return r; +} + +static int +svm_migrate_vma_to_vram(struct amdgpu_device *adev, struct svm_range *prange, + struct vm_area_struct *vma, uint64_t start, + uint64_t end) +{ + uint64_t npages = (end - start) >> PAGE_SHIFT; + struct dma_fence *mfence = NULL; + struct migrate_vma migrate; + int r = -ENOMEM; + int retry = 0; + + memset(&migrate, 0, sizeof(migrate)); + migrate.vma = vma; + migrate.start = start; + migrate.end = end; + migrate.flags = MIGRATE_VMA_SELECT_SYSTEM; + migrate.pgmap_owner = adev; + + migrate.src = kvmalloc_array(npages << 1, sizeof(*migrate.src), + GFP_KERNEL | __GFP_ZERO); + if (!migrate.src) + goto out; + migrate.dst = migrate.src + npages; + +retry: + r = migrate_vma_setup(&migrate); + if (r) { + pr_debug("failed %d prepare migrate svms 0x%p [0x%lx 0x%lx]\n", + r, prange->svms, prange->it_node.start, + prange->it_node.last); + goto out_free; + } + if (migrate.cpages != npages) { + pr_debug("collect 0x%lx/0x%llx pages, retry\n", migrate.cpages, + npages); + migrate_vma_finalize(&migrate); + if (retry++ >= 3) { + r = -ENOMEM; + pr_debug("failed %d migrate svms 0x%p [0x%lx 0x%lx]\n", + r, prange->svms, prange->it_node.start, + prange->it_node.last); + goto out_free; + } + + goto retry; + } + + if (migrate.cpages) { + svm_migrate_copy_to_vram(adev, prange, &migrate, &mfence); + migrate_vma_pages(&migrate); + svm_migrate_copy_done(adev, mfence); + migrate_vma_finalize(&migrate); + } + + kvfree(prange->pages_addr); + prange->pages_addr = NULL; + +out_free: + kvfree(migrate.src); +out: + return r; +} + +/** + * svm_migrate_ram_to_vram - migrate svm range from system to device + * @prange: range structure + * @best_loc: the device to migrate to + * + * Context: Process context, caller hold mm->mmap_sem and prange->lock and take + * svms srcu read lock. + * + * Return: + * 0 - OK, otherwise error code + */ +int svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t best_loc) +{ + unsigned long addr, start, end; + struct vm_area_struct *vma; + struct amdgpu_device *adev; + struct mm_struct *mm; + int r = 0; + + if (prange->actual_loc == best_loc) { + pr_debug("svms 0x%p [0x%lx 0x%lx] already on best_loc 0x%x\n", + prange->svms, prange->it_node.start, + prange->it_node.last, best_loc); + return 0; + } + + adev = svm_range_get_adev_by_id(prange, best_loc); + if (!adev) { + pr_debug("failed to get device by id 0x%x\n", best_loc); + return -ENODEV; + } + + pr_debug("svms 0x%p [0x%lx 0x%lx]\n", prange->svms, + prange->it_node.start, prange->it_node.last); + + start = prange->it_node.start << PAGE_SHIFT; + end = (prange->it_node.last + 1) << PAGE_SHIFT; + + mm = current->mm; + + for (addr = start; addr < end;) { + unsigned long next; + + vma = find_vma(mm, addr); + if (!vma || addr < vma->vm_start) + break; + + next = min(vma->vm_end, end); + r = svm_migrate_vma_to_vram(adev, prange, vma, addr, next); + if (r) { + pr_debug("failed to migrate\n"); + break; + } + addr = next; + } + + prange->actual_loc = best_loc; + + return r; +} + static void svm_migrate_page_free(struct page *page) { } diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h index 5db5686fa46a..ffae5f989909 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h @@ -37,6 +37,8 @@ enum MIGRATION_COPY_DIR { FROM_VRAM_TO_RAM };
+int svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t best_loc); + #if defined(CONFIG_DEVICE_PRIVATE) int svm_migrate_init(struct amdgpu_device *adev); void svm_migrate_fini(struct amdgpu_device *adev); diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index 8a4d0a3935b6..0dbc403413a1 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c @@ -30,6 +30,7 @@ #include "amdgpu_xgmi.h" #include "kfd_priv.h" #include "kfd_svm.h" +#include "kfd_migrate.h"
#define AMDGPU_SVM_RANGE_RESTORE_DELAY_MS 1
@@ -120,6 +121,7 @@ svm_range *svm_range_new(struct svm_range_list *svms, uint64_t start, INIT_LIST_HEAD(&prange->remove_list); INIT_LIST_HEAD(&prange->svm_bo_list); atomic_set(&prange->invalid, 0); + mutex_init(&prange->mutex); spin_lock_init(&prange->svm_bo_lock); svm_range_set_default_attributes(&prange->preferred_loc, &prange->prefetch_loc, @@ -409,6 +411,11 @@ static int svm_range_validate_vram(struct svm_range *prange) prange->it_node.start, prange->it_node.last, prange->actual_loc);
+ if (prange->mm_nodes) { + pr_debug("validation skipped after migration\n"); + return 0; + } + adev = svm_range_get_adev_by_id(prange, prange->actual_loc); if (!adev) { pr_debug("failed to get device by id 0x%x\n", @@ -428,7 +435,9 @@ svm_range_validate(struct mm_struct *mm, struct svm_range *prange) { int r;
- pr_debug("actual loc 0x%x\n", prange->actual_loc); + pr_debug("svms 0x%p [0x%lx 0x%lx] actual loc 0x%x\n", prange->svms, + prange->it_node.start, prange->it_node.last, + prange->actual_loc);
if (!prange->actual_loc) r = svm_range_validate_ram(mm, prange); @@ -1109,28 +1118,36 @@ static void svm_range_restore_work(struct work_struct *work) prange->svms, prange->it_node.start, prange->it_node.last, invalid);
+ /* + * If range is migrating, wait for migration is done. + */ + mutex_lock(&prange->mutex); + r = svm_range_validate(mm, prange); if (r) { pr_debug("failed %d to validate [0x%lx 0x%lx]\n", r, prange->it_node.start, prange->it_node.last);
- goto unlock_out; + goto out_unlock; }
r = svm_range_map_to_gpus(prange, true); - if (r) { + if (r) pr_debug("failed %d to map 0x%lx to gpu\n", r, prange->it_node.start); - goto unlock_out; - } + +out_unlock: + mutex_unlock(&prange->mutex); + if (r) + goto out_reschedule;
if (atomic_cmpxchg(&prange->invalid, invalid, 0) != invalid) - goto unlock_out; + goto out_reschedule; }
if (atomic_cmpxchg(&svms->evicted_ranges, evicted_ranges, 0) != evicted_ranges) - goto unlock_out; + goto out_reschedule;
evicted_ranges = 0;
@@ -1144,7 +1161,7 @@ static void svm_range_restore_work(struct work_struct *work)
pr_debug("restore svm ranges successfully\n");
-unlock_out: +out_reschedule: srcu_read_unlock(&svms->srcu, srcu_idx); mmap_read_unlock(mm); mutex_unlock(&process_info->lock); @@ -1617,6 +1634,134 @@ svm_range_add(struct kfd_process *p, uint64_t start, uint64_t size, return 0; }
+/* svm_range_best_location - decide the best actual location + * @prange: svm range structure + * + * For xnack off: + * If range map to single GPU, the best acutal location is prefetch loc, which + * can be CPU or GPU. + * + * If range map to multiple GPUs, only if mGPU connection on xgmi same hive, + * the best actual location could be prefetch_loc GPU. If mGPU connection on + * PCIe, the best actual location is always CPU, because GPU cannot access vram + * of other GPUs, assuming PCIe small bar (large bar support is not upstream). + * + * For xnack on: + * The best actual location is prefetch location. If mGPU connection on xgmi + * same hive, range map to multiple GPUs. Otherwise, the range only map to + * actual location GPU. Other GPU access vm fault will trigger migration. + * + * Context: Process context + * + * Return: + * 0 for CPU or GPU id + */ +static uint32_t svm_range_best_location(struct svm_range *prange) +{ + DECLARE_BITMAP(bitmap, MAX_GPU_INSTANCE); + uint32_t best_loc = prange->prefetch_loc; + struct amdgpu_device *bo_adev; + struct amdgpu_device *adev; + struct kfd_dev *kfd_dev; + struct kfd_process *p; + uint32_t gpuidx; + + p = container_of(prange->svms, struct kfd_process, svms); + + /* xnack on */ + if (p->xnack_enabled) + goto out; + + /* xnack off */ + if (!best_loc || best_loc == KFD_IOCTL_SVM_LOCATION_UNDEFINED) + goto out; + + bo_adev = svm_range_get_adev_by_id(prange, best_loc); + bitmap_or(bitmap, prange->bitmap_access, prange->bitmap_aip, + MAX_GPU_INSTANCE); + + for_each_set_bit(gpuidx, bitmap, MAX_GPU_INSTANCE) { + kfd_process_device_from_gpuidx(p, gpuidx, &kfd_dev); + adev = (struct amdgpu_device *)kfd_dev->kgd; + + if (adev == bo_adev) + continue; + + if (!amdgpu_xgmi_same_hive(adev, bo_adev)) { + best_loc = 0; + break; + } + } + +out: + pr_debug("xnack %d svms 0x%p [0x%lx 0x%lx] best loc 0x%x\n", + p->xnack_enabled, &p->svms, prange->it_node.start, + prange->it_node.last, best_loc); + return best_loc; +} + +/* svm_range_trigger_migration - start page migration if prefetch loc changed + * @mm: current process mm_struct + * @prange: svm range structure + * @migrated: output, true if migration is triggered + * + * If range perfetch_loc is GPU, actual loc is cpu 0, then migrate the range + * from ram to vram. + * If range prefetch_loc is cpu 0, actual loc is GPU, then migrate the range + * from vram to ram. + * + * If GPU vm fault retry is not enabled, migration interact with MMU notifier + * and restore work: + * 1. migrate_vma_setup invalidate pages, MMU notifier callback svm_range_evict + * stops all queues, schedule restore work + * 2. svm_range_restore_work wait for migration is done by + * a. svm_range_validate_vram takes prange->mutex + * b. svm_range_validate_ram HMM get pages wait for CPU fault handle returns + * 3. restore work update mappings of GPU, resume all queues. + * + * Context: Process context + * + * Return: + * 0 - OK, otherwise - error code of migration + */ +static int +svm_range_trigger_migration(struct mm_struct *mm, struct svm_range *prange, + bool *migrated) +{ + uint32_t best_loc; + int r = 0; + + *migrated = false; + best_loc = svm_range_best_location(prange); + + if (best_loc == KFD_IOCTL_SVM_LOCATION_UNDEFINED || + best_loc == prange->actual_loc) + return 0; + + if (best_loc && !prange->actual_loc && + !(prange->flags & KFD_IOCTL_SVM_FLAG_HOST_ACCESS)) + return 0; + + if (best_loc) { + if (!prange->actual_loc && !prange->pages_addr) { + pr_debug("host access and prefetch to gpu\n"); + r = svm_range_validate_ram(mm, prange); + if (r) { + pr_debug("failed %d to validate on ram\n", r); + return r; + } + } + + pr_debug("migrate from ram to vram\n"); + r = svm_migrate_ram_to_vram(prange, best_loc); + + if (!r) + *migrated = true; + } + + return r; +} + static int svm_range_set_attr(struct kfd_process *p, uint64_t start, uint64_t size, uint32_t nattr, struct kfd_ioctl_svm_attribute *attrs) @@ -1675,6 +1820,9 @@ svm_range_set_attr(struct kfd_process *p, uint64_t start, uint64_t size, }
list_for_each_entry(prange, &update_list, update_list) { + bool migrated; + + mutex_lock(&prange->mutex);
r = svm_range_apply_attrs(p, prange, nattr, attrs); if (r) { @@ -1682,6 +1830,16 @@ svm_range_set_attr(struct kfd_process *p, uint64_t start, uint64_t size, goto out_unlock; }
+ r = svm_range_trigger_migration(mm, prange, &migrated); + if (r) + goto out_unlock; + + if (migrated) { + pr_debug("restore_work will update mappings of GPUs\n"); + mutex_unlock(&prange->mutex); + continue; + } + r = svm_range_validate(mm, prange); if (r) { pr_debug("failed %d to validate svm range\n", r); @@ -1693,6 +1851,7 @@ svm_range_set_attr(struct kfd_process *p, uint64_t start, uint64_t size, pr_debug("failed %d to map svm range\n", r);
out_unlock: + mutex_unlock(&prange->mutex); if (r) { mmap_read_unlock(mm); srcu_read_unlock(&prange->svms->srcu, srcu_idx); diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h index b1d2db02043b..b81dfb32135b 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h @@ -42,6 +42,7 @@ struct svm_range_bo { * struct svm_range - shared virtual memory range * * @svms: list of svm ranges, structure defined in kfd_process + * @mutex: to serialize range migration, validation and mapping update * @it_node: node [start, last] stored in interval tree, start, last are page * aligned, page size is (last - start + 1) * @list: link list node, used to scan all ranges of svms @@ -70,6 +71,7 @@ struct svm_range_bo { */ struct svm_range { struct svm_range_list *svms; + struct mutex mutex; struct interval_tree_node it_node; struct list_head list; struct list_head update_list;
From: Philip Yang Philip.Yang@amd.com
If CPU page fault happens, HMM pgmap_ops callback migrate_to_ram start migrate memory from vram to ram in steps:
1. migrate_vma_pages get vram pages, and notify HMM to invalidate the pages, HMM interval notifier callback evict process queues 2. Allocate system memory pages 3. Use svm copy memory to migrate data from vram to ram 4. migrate_vma_pages copy pages structure from vram pages to ram pages 5. Return VM_FAULT_SIGBUS if migration failed, to notify application 6. migrate_vma_finalize put vram pages, page_free callback free vram pages and vram nodes 7. Restore work wait for migration is finished, then update GPU page table mapping to system memory, and resume process queues
Signed-off-by: Philip Yang Philip.Yang@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com --- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 274 ++++++++++++++++++++++- drivers/gpu/drm/amd/amdkfd/kfd_migrate.h | 3 + drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 116 +++++++++- drivers/gpu/drm/amd/amdkfd/kfd_svm.h | 4 + 4 files changed, 392 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c index af23f0be7eaf..d33a4cc63495 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c @@ -259,6 +259,35 @@ svm_migrate_put_vram_page(struct amdgpu_device *adev, unsigned long addr) put_page(page); }
+static unsigned long +svm_migrate_addr(struct amdgpu_device *adev, struct page *page) +{ + unsigned long addr; + + addr = page_to_pfn(page) << PAGE_SHIFT; + return (addr - adev->kfd.dev->pgmap.res.start); +} + +static struct page * +svm_migrate_get_sys_page(struct vm_area_struct *vma, unsigned long addr) +{ + struct page *page; + + page = alloc_page_vma(GFP_HIGHUSER, vma, addr); + if (page) + lock_page(page); + + return page; +} + +void svm_migrate_put_sys_page(unsigned long addr) +{ + struct page *page; + + page = pfn_to_page(addr >> PAGE_SHIFT); + unlock_page(page); + put_page(page); +}
static int svm_migrate_copy_to_vram(struct amdgpu_device *adev, struct svm_range *prange, @@ -471,13 +500,208 @@ int svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t best_loc)
static void svm_migrate_page_free(struct page *page) { + /* Keep this function to avoid warning */ +} + +static int +svm_migrate_copy_to_ram(struct amdgpu_device *adev, struct svm_range *prange, + struct migrate_vma *migrate, + struct dma_fence **mfence) +{ + uint64_t npages = migrate->cpages; + uint64_t *src, *dst; + struct page *dpage; + uint64_t i = 0, j; + uint64_t addr; + int r = 0; + + pr_debug("svms 0x%p [0x%lx 0x%lx]\n", prange->svms, + prange->it_node.start, prange->it_node.last); + + addr = prange->it_node.start << PAGE_SHIFT; + + src = kvmalloc_array(npages << 1, sizeof(*src), GFP_KERNEL); + if (!src) + return -ENOMEM; + + dst = src + npages; + + prange->pages_addr = kvmalloc_array(npages, sizeof(*prange->pages_addr), + GFP_KERNEL | __GFP_ZERO); + if (!prange->pages_addr) { + r = -ENOMEM; + goto out_oom; + } + + for (i = 0, j = 0; i < npages; i++, j++, addr += PAGE_SIZE) { + struct page *spage; + + spage = migrate_pfn_to_page(migrate->src[i]); + if (!spage) { + pr_debug("failed get spage svms 0x%p [0x%lx 0x%lx]\n", + prange->svms, prange->it_node.start, + prange->it_node.last); + r = -ENOMEM; + goto out_oom; + } + src[i] = svm_migrate_addr(adev, spage); + if (i > 0 && src[i] != src[i - 1] + PAGE_SIZE) { + r = svm_migrate_copy_memory_gart(adev, src + i - j, + dst + i - j, j, + FROM_VRAM_TO_RAM, + mfence); + if (r) + goto out_oom; + j = 0; + } + + dpage = svm_migrate_get_sys_page(migrate->vma, addr); + if (!dpage) { + pr_debug("failed get page svms 0x%p [0x%lx 0x%lx]\n", + prange->svms, prange->it_node.start, + prange->it_node.last); + r = -ENOMEM; + goto out_oom; + } + + dst[i] = page_to_pfn(dpage) << PAGE_SHIFT; + *(prange->pages_addr + i) = dst[i]; + + migrate->dst[i] = migrate_pfn(page_to_pfn(dpage)); + migrate->dst[i] |= MIGRATE_PFN_LOCKED; + + } + + r = svm_migrate_copy_memory_gart(adev, src + i - j, dst + i - j, j, + FROM_VRAM_TO_RAM, mfence); + +out_oom: + kvfree(src); + if (r) { + pr_debug("failed %d copy to ram\n", r); + while (i--) { + svm_migrate_put_sys_page(dst[i]); + migrate->dst[i] = 0; + } + } + + return r; +} + +static int +svm_migrate_vma_to_ram(struct amdgpu_device *adev, struct svm_range *prange, + struct vm_area_struct *vma, uint64_t start, uint64_t end) +{ + uint64_t npages = (end - start) >> PAGE_SHIFT; + struct dma_fence *mfence = NULL; + struct migrate_vma migrate; + int r = -ENOMEM; + + memset(&migrate, 0, sizeof(migrate)); + migrate.vma = vma; + migrate.start = start; + migrate.end = end; + migrate.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE; + migrate.pgmap_owner = adev; + + migrate.src = kvmalloc_array(npages << 1, sizeof(*migrate.src), + GFP_KERNEL | __GFP_ZERO); + if (!migrate.src) + goto out; + migrate.dst = migrate.src + npages; + + r = migrate_vma_setup(&migrate); + if (r) { + pr_debug("failed %d prepare migrate svms 0x%p [0x%lx 0x%lx]\n", + r, prange->svms, prange->it_node.start, + prange->it_node.last); + goto out_free; + } + + pr_debug("cpages %ld\n", migrate.cpages); + + if (migrate.cpages) { + svm_migrate_copy_to_ram(adev, prange, &migrate, &mfence); + migrate_vma_pages(&migrate); + svm_migrate_copy_done(adev, mfence); + migrate_vma_finalize(&migrate); + } else { + pr_debug("failed collect migrate device pages [0x%lx 0x%lx]\n", + prange->it_node.start, prange->it_node.last); + } + +out_free: + kvfree(migrate.src); +out: + return r; +} + +/** + * svm_migrate_vram_to_ram - migrate svm range from device to system + * @prange: range structure + * @mm: process mm, use current->mm if NULL + * + * Context: Process context, caller hold mm->mmap_sem and prange->lock and take + * svms srcu read lock + * + * Return: + * 0 - OK, otherwise error code + */ +int svm_migrate_vram_to_ram(struct svm_range *prange, struct mm_struct *mm) +{ + struct amdgpu_device *adev; + struct vm_area_struct *vma; + unsigned long addr; + unsigned long start; + unsigned long end; + int r = 0; + + if (!prange->actual_loc) { + pr_debug("[0x%lx 0x%lx] already migrated to ram\n", + prange->it_node.start, prange->it_node.last); + return 0; + } + + adev = svm_range_get_adev_by_id(prange, prange->actual_loc); + if (!adev) { + pr_debug("failed to get device by id 0x%x\n", + prange->actual_loc); + return -ENODEV; + } + + pr_debug("svms 0x%p [0x%lx 0x%lx]\n", prange->svms, + prange->it_node.start, prange->it_node.last); + + start = prange->it_node.start << PAGE_SHIFT; + end = (prange->it_node.last + 1) << PAGE_SHIFT; + + for (addr = start; addr < end;) { + unsigned long next; + + vma = find_vma(mm, addr); + if (!vma || addr < vma->vm_start) + break; + + next = min(vma->vm_end, end); + r = svm_migrate_vma_to_ram(adev, prange, vma, addr, next); + if (r) { + pr_debug("failed %d to migrate\n", r); + break; + } + addr = next; + } + + svm_range_vram_node_free(prange); + prange->actual_loc = 0; + + return r; }
/** * svm_migrate_to_ram - CPU page fault handler * @vmf: CPU vm fault vma, address * - * Context: vm fault handler, mm->mmap_sem is taken + * Context: vm fault handler, caller holds the mmap lock * * Return: * 0 - OK @@ -485,7 +709,53 @@ static void svm_migrate_page_free(struct page *page) */ static vm_fault_t svm_migrate_to_ram(struct vm_fault *vmf) { - return VM_FAULT_SIGBUS; + unsigned long addr = vmf->address; + struct vm_area_struct *vma; + struct svm_range *prange; + struct list_head list; + struct kfd_process *p; + int r = VM_FAULT_SIGBUS; + int srcu_idx; + + vma = vmf->vma; + + p = kfd_lookup_process_by_mm(vma->vm_mm); + if (!p) { + pr_debug("failed find process at fault address 0x%lx\n", addr); + return VM_FAULT_SIGBUS; + } + + /* To prevent prange is removed */ + srcu_idx = srcu_read_lock(&p->svms.srcu); + + addr >>= PAGE_SHIFT; + pr_debug("CPU page fault svms 0x%p address 0x%lx\n", &p->svms, addr); + + r = svm_range_split_by_granularity(p, addr, &list); + if (r) { + pr_debug("failed %d to split range by granularity\n", r); + goto out_srcu; + } + + list_for_each_entry(prange, &list, update_list) { + mutex_lock(&prange->mutex); + r = svm_migrate_vram_to_ram(prange, vma->vm_mm); + mutex_unlock(&prange->mutex); + if (r) { + pr_debug("failed %d migrate [0x%lx 0x%lx] to ram\n", r, + prange->it_node.start, prange->it_node.last); + goto out_srcu; + } + } + +out_srcu: + srcu_read_unlock(&p->svms.srcu, srcu_idx); + kfd_unref_process(p); + + if (r) + return VM_FAULT_SIGBUS; + + return 0; }
static const struct dev_pagemap_ops svm_migrate_pgmap_ops = { diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h index ffae5f989909..95fd7b21791f 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h @@ -38,6 +38,9 @@ enum MIGRATION_COPY_DIR { };
int svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t best_loc); +int svm_migrate_vram_to_ram(struct svm_range *prange, struct mm_struct *mm); +unsigned long +svm_migrate_addr_to_pfn(struct amdgpu_device *adev, unsigned long addr);
#if defined(CONFIG_DEVICE_PRIVATE) int svm_migrate_init(struct amdgpu_device *adev); diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index 0dbc403413a1..37f35f986930 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c @@ -819,6 +819,92 @@ svm_range_split_add_front(struct svm_range *prange, struct svm_range *new, return 0; }
+/** + * svm_range_split_by_granularity - collect ranges within granularity boundary + * + * @p: the process with svms list + * @addr: the vm fault address in pages, to search ranges + * @list: output, the range list + * + * Collects small ranges that make up one migration granule and splits the first + * and the last range at the granularity boundary + * + * Context: hold and release svms lock + * + * Return: + * 0 - OK, otherwise error code + */ +int svm_range_split_by_granularity(struct kfd_process *p, unsigned long addr, + struct list_head *list) +{ + struct svm_range *prange; + struct svm_range *tail; + struct svm_range *new; + unsigned long start; + unsigned long last; + unsigned long size; + int r = 0; + + svms_lock(&p->svms); + + prange = svm_range_from_addr(&p->svms, addr); + if (!prange) { + pr_debug("cannot find svm range at 0x%lx\n", addr); + svms_unlock(&p->svms); + return -EFAULT; + } + + /* Align splited range start and size to granularity size, then a single + * PTE will be used for whole range, this reduces the number of PTE + * updated and the L1 TLB space used for translation. + */ + size = 1ULL << prange->granularity; + start = ALIGN_DOWN(addr, size); + last = ALIGN(addr + 1, size) - 1; + INIT_LIST_HEAD(list); + + pr_debug("svms 0x%p split [0x%lx 0x%lx] at 0x%lx granularity 0x%lx\n", + prange->svms, start, last, addr, size); + + if (start > prange->it_node.start) { + r = svm_range_split(prange, prange->it_node.start, start - 1, + &new); + if (r) + goto out_unlock; + + svm_range_add_to_svms(new); + } else { + new = prange; + } + + while (size > new->npages) { + struct interval_tree_node *next; + + list_add(&new->update_list, list); + + next = interval_tree_iter_next(&new->it_node, start, last); + if (!next) + goto out_unlock; + + size -= new->npages; + new = container_of(next, struct svm_range, it_node); + } + + if (last < new->it_node.last) { + r = svm_range_split(new, new->it_node.start, last, &tail); + if (r) + goto out_unlock; + svm_range_add_to_svms(tail); + } + + list_add(&new->update_list, list); + +out_unlock: + svms_unlock(&p->svms); + + return r; +} + static uint64_t svm_range_get_pte_flags(struct amdgpu_device *adev, struct svm_range *prange) { @@ -1508,6 +1594,27 @@ static const struct mmu_interval_notifier_ops svm_range_mn_ops = { .invalidate = svm_range_cpu_invalidate_pagetables, };
+/** + * svm_range_from_addr - find svm range from fault address + * @svms: svm range list header + * @addr: address to search range interval tree, in pages + * + * Context: The caller must hold svms_lock + * + * Return: the svm_range found or NULL + */ +struct svm_range * +svm_range_from_addr(struct svm_range_list *svms, unsigned long addr) +{ + struct interval_tree_node *node; + + node = interval_tree_iter_first(&svms->objects, addr, addr); + if (!node) + return NULL; + + return container_of(node, struct svm_range, it_node); +} + void svm_range_list_fini(struct kfd_process *p) { pr_debug("pasid 0x%x svms 0x%p\n", p->pasid, &p->svms); @@ -1754,11 +1861,14 @@ svm_range_trigger_migration(struct mm_struct *mm, struct svm_range *prange,
pr_debug("migrate from ram to vram\n"); r = svm_migrate_ram_to_vram(prange, best_loc); - - if (!r) - *migrated = true; + } else { + pr_debug("migrate from vram to ram\n"); + r = svm_migrate_vram_to_ram(prange, current->mm); }
+ if (!r) + *migrated = true; + return r; }
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h index b81dfb32135b..c67e96f764fe 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h @@ -112,10 +112,14 @@ void svm_range_list_fini(struct kfd_process *p); int svm_ioctl(struct kfd_process *p, enum kfd_ioctl_svm_op op, uint64_t start, uint64_t size, uint32_t nattrs, struct kfd_ioctl_svm_attribute *attrs); +struct svm_range *svm_range_from_addr(struct svm_range_list *svms, + unsigned long addr); struct amdgpu_device *svm_range_get_adev_by_id(struct svm_range *prange, uint32_t id); int svm_range_vram_node_new(struct amdgpu_device *adev, struct svm_range *prange, bool clear); void svm_range_vram_node_free(struct svm_range *prange); +int svm_range_split_by_granularity(struct kfd_process *p, unsigned long addr, + struct list_head *list);
#endif /* KFD_SVM_H_ */
From: Alex Sierra alex.sierra@amd.com
GPU page tables are invalidated by unmapping prange directly at the mmu notifier, when page fault retry is enabled through amdgpu_noretry global parameter. The restore page table is performed at the page fault handler.
If xnack is on, we need update GPU mapping after prefetch migration to avoid GPU vm fault, because range migration unmap the range from GPUs, there is no restore work scheduled to update GPU mapping.
Signed-off-by: Alex Sierra alex.sierra@amd.com Signed-off-by: Philip Yang Philip.Yang@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com --- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 20 +++++++++++++++++--- 1 file changed, 17 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index 37f35f986930..ea27c5ed4ef3 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c @@ -1279,7 +1279,9 @@ svm_range_evict(struct svm_range_list *svms, struct mm_struct *mm, int r = 0; struct interval_tree_node *node; struct svm_range *prange; + struct kfd_process *p;
+ p = container_of(svms, struct kfd_process, svms); svms_lock(svms);
pr_debug("invalidate svms 0x%p [0x%lx 0x%lx]\n", svms, start, last); @@ -1292,8 +1294,13 @@ svm_range_evict(struct svm_range_list *svms, struct mm_struct *mm, next = interval_tree_iter_next(node, start, last);
invalid = atomic_inc_return(&prange->invalid); - evicted_ranges = atomic_inc_return(&svms->evicted_ranges); - if (evicted_ranges == 1) { + + if (!p->xnack_enabled) { + evicted_ranges = + atomic_inc_return(&svms->evicted_ranges); + if (evicted_ranges != 1) + goto next_node; + pr_debug("evicting svms 0x%p range [0x%lx 0x%lx]\n", prange->svms, prange->it_node.start, prange->it_node.last); @@ -1306,7 +1313,14 @@ svm_range_evict(struct svm_range_list *svms, struct mm_struct *mm, pr_debug("schedule to restore svm %p ranges\n", svms); schedule_delayed_work(&svms->restore_work, msecs_to_jiffies(AMDGPU_SVM_RANGE_RESTORE_DELAY_MS)); + } else { + pr_debug("invalidate svms 0x%p [0x%lx 0x%lx] %d\n", + prange->svms, prange->it_node.start, + prange->it_node.last, invalid); + if (invalid == 1) + svm_range_unmap_from_gpus(prange); } +next_node: node = next; }
@@ -1944,7 +1958,7 @@ svm_range_set_attr(struct kfd_process *p, uint64_t start, uint64_t size, if (r) goto out_unlock;
- if (migrated) { + if (migrated && !p->xnack_enabled) { pr_debug("restore_work will update mappings of GPUs\n"); mutex_unlock(&prange->mutex); continue;
From: Alex Sierra alex.sierra@amd.com
Page table restore implementation in SVM API. This is called from the fault handler at amdgpu_vm. To update page tables through the page fault retry IH.
Signed-off-by: Alex Sierra alex.sierra@amd.com Signed-off-by: Philip Yang Philip.Yang@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com --- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 78 ++++++++++++++++++++++++++++ drivers/gpu/drm/amd/amdkfd/kfd_svm.h | 2 + 2 files changed, 80 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index ea27c5ed4ef3..7346255f7c27 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c @@ -1629,6 +1629,84 @@ svm_range_from_addr(struct svm_range_list *svms, unsigned long addr) return container_of(node, struct svm_range, it_node); }
+int +svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid, + uint64_t addr) +{ + int r = 0; + int srcu_idx; + struct mm_struct *mm = NULL; + struct svm_range *prange; + struct svm_range_list *svms; + struct kfd_process *p; + + p = kfd_lookup_process_by_pasid(pasid); + if (!p) { + pr_debug("kfd process not founded pasid 0x%x\n", pasid); + return -ESRCH; + } + svms = &p->svms; + srcu_idx = srcu_read_lock(&svms->srcu); + + pr_debug("restoring svms 0x%p fault address 0x%llx\n", svms, addr); + + svms_lock(svms); + prange = svm_range_from_addr(svms, addr); + svms_unlock(svms); + if (!prange) { + pr_debug("failed to find prange svms 0x%p address [0x%llx]\n", + svms, addr); + r = -EFAULT; + goto unlock_out; + } + + if (!atomic_read(&prange->invalid)) { + pr_debug("svms 0x%p [0x%lx %lx] already restored\n", + svms, prange->it_node.start, prange->it_node.last); + goto unlock_out; + } + + mm = get_task_mm(p->lead_thread); + if (!mm) { + pr_debug("svms 0x%p failed to get mm\n", svms); + r = -ESRCH; + goto unlock_out; + } + + mmap_read_lock(mm); + + /* + * If range is migrating, wait for migration is done. + */ + mutex_lock(&prange->mutex); + + r = svm_range_validate(mm, prange); + if (r) { + pr_debug("failed %d to validate svms 0x%p [0x%lx 0x%lx]\n", r, + svms, prange->it_node.start, prange->it_node.last); + + goto mmput_out; + } + + pr_debug("restoring svms 0x%p [0x%lx %lx] mapping\n", + svms, prange->it_node.start, prange->it_node.last); + + r = svm_range_map_to_gpus(prange, true); + if (r) + pr_debug("failed %d to map svms 0x%p [0x%lx 0x%lx] to gpu\n", r, + svms, prange->it_node.start, prange->it_node.last); + +mmput_out: + mutex_unlock(&prange->mutex); + mmap_read_unlock(mm); + mmput(mm); +unlock_out: + srcu_read_unlock(&svms->srcu, srcu_idx); + kfd_unref_process(p); + + return r; +} + void svm_range_list_fini(struct kfd_process *p) { pr_debug("pasid 0x%x svms 0x%p\n", p->pasid, &p->svms); diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h index c67e96f764fe..e546f36ef709 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h @@ -121,5 +121,7 @@ int svm_range_vram_node_new(struct amdgpu_device *adev, void svm_range_vram_node_free(struct svm_range *prange); int svm_range_split_by_granularity(struct kfd_process *p, unsigned long addr, struct list_head *list); +int svm_range_restore_pages(struct amdgpu_device *adev, + unsigned int pasid, uint64_t addr);
#endif /* KFD_SVM_H_ */
From: Alex Sierra alex.sierra@amd.com
Use SVM API to restore page tables when retry fault and compute context are enabled.
Signed-off-by: Alex Sierra alex.sierra@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 20 +++++++++++++++----- 1 file changed, 15 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index 9c557e8bf0e5..abdd4e7b4c3b 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -37,6 +37,7 @@ #include "amdgpu_gmc.h" #include "amdgpu_xgmi.h" #include "amdgpu_dma_buf.h" +#include "kfd_svm.h"
/** * DOC: GPUVM @@ -3301,18 +3302,29 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, unsigned int pasid, uint64_t value, flags; struct amdgpu_vm *vm; long r; + bool is_compute_context = false;
spin_lock(&adev->vm_manager.pasid_lock); vm = idr_find(&adev->vm_manager.pasid_idr, pasid); - if (vm) + if (vm) { root = amdgpu_bo_ref(vm->root.base.bo); - else + is_compute_context = vm->is_compute_context; + } else { root = NULL; + } spin_unlock(&adev->vm_manager.pasid_lock);
if (!root) return false;
+ addr /= AMDGPU_GPU_PAGE_SIZE; + + if (!amdgpu_noretry && is_compute_context && + !svm_range_restore_pages(adev, pasid, addr)) { + amdgpu_bo_unref(&root); + return true; + } + r = amdgpu_bo_reserve(root, true); if (r) goto error_unref; @@ -3326,18 +3338,16 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, unsigned int pasid, if (!vm) goto error_unlock;
- addr /= AMDGPU_GPU_PAGE_SIZE; flags = AMDGPU_PTE_VALID | AMDGPU_PTE_SNOOPED | AMDGPU_PTE_SYSTEM;
- if (vm->is_compute_context) { + if (is_compute_context) { /* Intentionally setting invalid PTE flag * combination to force a no-retry-fault */ flags = AMDGPU_PTE_EXECUTABLE | AMDGPU_PDE_PTE | AMDGPU_PTE_TF; value = 0; - } else if (amdgpu_vm_fault_stop == AMDGPU_VM_FAULT_STOP_NEVER) { /* Redirect the access to the dummy page */ value = adev->dummy_page_addr;
From: Alex Sierra alex.sierra@amd.com
[why] As part of the SVM functionality, the eviction mechanism used for SVM_BOs is different. This mechanism uses one eviction fence per prange, instead of one fence per kfd_process.
[how] A svm_bo reference to amdgpu_amdkfd_fence to allow differentiate between SVM_BO or regular BO evictions. This also include modifications to set the reference at the fence creation call.
Signed-off-by: Alex Sierra alex.sierra@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 4 +++- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c | 5 +++-- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 6 ++++-- 3 files changed, 10 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h index bc9f0e42e0a2..fb8be788ac1b 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h @@ -75,6 +75,7 @@ struct amdgpu_amdkfd_fence { struct mm_struct *mm; spinlock_t lock; char timeline_name[TASK_COMM_LEN]; + struct svm_range_bo *svm_bo; };
struct amdgpu_kfd_dev { @@ -95,7 +96,8 @@ enum kgd_engine_type { };
struct amdgpu_amdkfd_fence *amdgpu_amdkfd_fence_create(u64 context, - struct mm_struct *mm); + struct mm_struct *mm, + struct svm_range_bo *svm_bo); bool amdkfd_fence_check_mm(struct dma_fence *f, struct mm_struct *mm); struct amdgpu_amdkfd_fence *to_amdgpu_amdkfd_fence(struct dma_fence *f); int amdgpu_amdkfd_remove_fence_on_pt_pd_bos(struct amdgpu_bo *bo); diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c index 3107b9575929..9cc85efa4ed5 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c @@ -60,7 +60,8 @@ static atomic_t fence_seq = ATOMIC_INIT(0); */
struct amdgpu_amdkfd_fence *amdgpu_amdkfd_fence_create(u64 context, - struct mm_struct *mm) + struct mm_struct *mm, + struct svm_range_bo *svm_bo) { struct amdgpu_amdkfd_fence *fence;
@@ -73,7 +74,7 @@ struct amdgpu_amdkfd_fence *amdgpu_amdkfd_fence_create(u64 context, fence->mm = mm; get_task_comm(fence->timeline_name, current); spin_lock_init(&fence->lock); - + fence->svm_bo = svm_bo; dma_fence_init(&fence->base, &amdkfd_fence_ops, &fence->lock, context, atomic_inc_return(&fence_seq));
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c index 99ad4e1d0896..8a43f3880022 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c @@ -928,7 +928,8 @@ static int init_kfd_vm(struct amdgpu_vm *vm, void **process_info,
info->eviction_fence = amdgpu_amdkfd_fence_create(dma_fence_context_alloc(1), - current->mm); + current->mm, + NULL); if (!info->eviction_fence) { pr_err("Failed to create eviction fence\n"); ret = -ENOMEM; @@ -2150,7 +2151,8 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void *info, struct dma_fence **ef) */ new_fence = amdgpu_amdkfd_fence_create( process_info->eviction_fence->base.context, - process_info->eviction_fence->mm); + process_info->eviction_fence->mm, + NULL); if (!new_fence) { pr_err("Failed to create eviction fence\n"); ret = -ENOMEM;
From: Alex Sierra alex.sierra@amd.com
Add CREATE_SVM_BO define bit for SVM BOs. Another define flag was moved to concentrate these KFD type flags in one include file.
Signed-off-by: Alex Sierra alex.sierra@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 7 ++----- drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 5 +++++ 2 files changed, 7 insertions(+), 5 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c index 8a43f3880022..5982d09b6c3d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c @@ -31,9 +31,6 @@ #include "amdgpu_dma_buf.h" #include <uapi/linux/kfd_ioctl.h>
-/* BO flag to indicate a KFD userptr BO */ -#define AMDGPU_AMDKFD_USERPTR_BO (1ULL << 63) - /* Userptr restore delay, just long enough to allow consecutive VM * changes to accumulate */ @@ -207,7 +204,7 @@ void amdgpu_amdkfd_unreserve_memory_limit(struct amdgpu_bo *bo) u32 domain = bo->preferred_domains; bool sg = (bo->preferred_domains == AMDGPU_GEM_DOMAIN_CPU);
- if (bo->flags & AMDGPU_AMDKFD_USERPTR_BO) { + if (bo->flags & AMDGPU_AMDKFD_CREATE_USERPTR_BO) { domain = AMDGPU_GEM_DOMAIN_CPU; sg = false; } @@ -1241,7 +1238,7 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu( bo->kfd_bo = *mem; (*mem)->bo = bo; if (user_addr) - bo->flags |= AMDGPU_AMDKFD_USERPTR_BO; + bo->flags |= AMDGPU_AMDKFD_CREATE_USERPTR_BO;
(*mem)->va = va; (*mem)->domain = domain; diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h index adbefd6a655d..b72772ab93fb 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h @@ -37,6 +37,11 @@ #define AMDGPU_BO_INVALID_OFFSET LONG_MAX #define AMDGPU_BO_MAX_PLACEMENTS 3
+/* BO flag to indicate a KFD userptr BO */ +#define AMDGPU_AMDKFD_CREATE_USERPTR_BO (1ULL << 63) +#define AMDGPU_AMDKFD_CREATE_SVM_BO (1ULL << 62) + + struct amdgpu_bo_param { unsigned long size; int byte_align;
From: Alex Sierra alex.sierra@amd.com
svm_bo eviction mechanism is different from regular BOs. Every SVM_BO created contains one eviction fence and one worker item for eviction process. SVM_BOs can be attached to one or more pranges. For SVM_BO eviction mechanism, TTM will start to call enable_signal callback for every SVM_BO until VRAM space is available. Here, all the ttm_evict calls are synchronous, this guarantees that each eviction has completed and the fence has signaled before it returns.
Signed-off-by: Alex Sierra alex.sierra@amd.com Signed-off-by: Philip Yang Philip.Yang@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com --- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 197 ++++++++++++++++++++------- drivers/gpu/drm/amd/amdkfd/kfd_svm.h | 13 +- 2 files changed, 160 insertions(+), 50 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index 7346255f7c27..63b745a06740 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c @@ -34,6 +34,7 @@
#define AMDGPU_SVM_RANGE_RESTORE_DELAY_MS 1
+static void svm_range_evict_svm_bo_worker(struct work_struct *work); /** * svm_range_unlink - unlink svm_range from lists and interval tree * @prange: svm range structure to be removed @@ -260,7 +261,15 @@ static void svm_range_bo_release(struct kref *kref) list_del_init(&prange->svm_bo_list); } spin_unlock(&svm_bo->list_lock); - + if (!dma_fence_is_signaled(&svm_bo->eviction_fence->base)) { + /* We're not in the eviction worker. + * Signal the fence and synchronize with any + * pending eviction work. + */ + dma_fence_signal(&svm_bo->eviction_fence->base); + cancel_work_sync(&svm_bo->eviction_work); + } + dma_fence_put(&svm_bo->eviction_fence->base); amdgpu_bo_unref(&svm_bo->bo); kfree(svm_bo); } @@ -273,6 +282,62 @@ static void svm_range_bo_unref(struct svm_range_bo *svm_bo) kref_put(&svm_bo->kref, svm_range_bo_release); }
+static bool svm_range_validate_svm_bo(struct svm_range *prange) +{ + spin_lock(&prange->svm_bo_lock); + if (!prange->svm_bo) { + spin_unlock(&prange->svm_bo_lock); + return false; + } + if (prange->mm_nodes) { + /* We still have a reference, all is well */ + spin_unlock(&prange->svm_bo_lock); + return true; + } + if (svm_bo_ref_unless_zero(prange->svm_bo)) { + if (READ_ONCE(prange->svm_bo->evicting)) { + struct dma_fence *f; + struct svm_range_bo *svm_bo; + /* The BO is getting evicted, + * we need to get a new one + */ + spin_unlock(&prange->svm_bo_lock); + svm_bo = prange->svm_bo; + f = dma_fence_get(&svm_bo->eviction_fence->base); + svm_range_bo_unref(prange->svm_bo); + /* wait for the fence to avoid long spin-loop + * at list_empty_careful + */ + dma_fence_wait(f, false); + dma_fence_put(f); + } else { + /* The BO was still around and we got + * a new reference to it + */ + spin_unlock(&prange->svm_bo_lock); + pr_debug("reuse old bo svms 0x%p [0x%lx 0x%lx]\n", + prange->svms, prange->it_node.start, + prange->it_node.last); + + prange->mm_nodes = prange->svm_bo->bo->tbo.mem.mm_node; + return true; + } + + } else { + spin_unlock(&prange->svm_bo_lock); + } + + /* We need a new svm_bo. Spin-loop to wait for concurrent + * svm_range_bo_release to finish removing this range from + * its range list. After this, it is safe to reuse the + * svm_bo pointer and svm_bo_list head. + */ + while (!list_empty_careful(&prange->svm_bo_list)) + ; + + return false; +} + static struct svm_range_bo *svm_range_bo_new(void) { struct svm_range_bo *svm_bo; @@ -292,71 +357,54 @@ int svm_range_vram_node_new(struct amdgpu_device *adev, struct svm_range *prange, bool clear) { - struct amdkfd_process_info *process_info; struct amdgpu_bo_param bp; struct svm_range_bo *svm_bo; struct amdgpu_bo *bo; struct kfd_process *p; + struct mm_struct *mm; int r;
- pr_debug("[0x%lx 0x%lx]\n", prange->it_node.start, - prange->it_node.last); - spin_lock(&prange->svm_bo_lock); - if (prange->svm_bo) { - if (prange->mm_nodes) { - /* We still have a reference, all is well */ - spin_unlock(&prange->svm_bo_lock); - return 0; - } - if (svm_bo_ref_unless_zero(prange->svm_bo)) { - /* The BO was still around and we got - * a new reference to it - */ - spin_unlock(&prange->svm_bo_lock); - pr_debug("reuse old bo [0x%lx 0x%lx]\n", - prange->it_node.start, prange->it_node.last); - - prange->mm_nodes = prange->svm_bo->bo->tbo.mem.mm_node; - return 0; - } - - spin_unlock(&prange->svm_bo_lock); - - /* We need a new svm_bo. Spin-loop to wait for concurrent - * svm_range_bo_release to finish removing this range from - * its range list. After this, it is safe to reuse the - * svm_bo pointer and svm_bo_list head. - */ - while (!list_empty_careful(&prange->svm_bo_list)) - ; + p = container_of(prange->svms, struct kfd_process, svms); + pr_debug("pasid: %x svms 0x%p [0x%lx 0x%lx]\n", p->pasid, prange->svms, + prange->it_node.start, prange->it_node.last);
- } else { - spin_unlock(&prange->svm_bo_lock); - } + if (svm_range_validate_svm_bo(prange)) + return 0;
svm_bo = svm_range_bo_new(); if (!svm_bo) { pr_debug("failed to alloc svm bo\n"); return -ENOMEM; } - + mm = get_task_mm(p->lead_thread); + if (!mm) { + pr_debug("failed to get mm\n"); + kfree(svm_bo); + return -ESRCH; + } + svm_bo->svms = prange->svms; + svm_bo->eviction_fence = + amdgpu_amdkfd_fence_create(dma_fence_context_alloc(1), + mm, + svm_bo); + mmput(mm); + INIT_WORK(&svm_bo->eviction_work, svm_range_evict_svm_bo_worker); + svm_bo->evicting = 0; memset(&bp, 0, sizeof(bp)); bp.size = prange->npages * PAGE_SIZE; bp.byte_align = PAGE_SIZE; bp.domain = AMDGPU_GEM_DOMAIN_VRAM; bp.flags = AMDGPU_GEM_CREATE_NO_CPU_ACCESS; bp.flags |= clear ? AMDGPU_GEM_CREATE_VRAM_CLEARED : 0; + bp.flags |= AMDGPU_AMDKFD_CREATE_SVM_BO; bp.type = ttm_bo_type_device; bp.resv = NULL;
r = amdgpu_bo_create(adev, &bp, &bo); if (r) { pr_debug("failed %d to create bo\n", r); - kfree(svm_bo); - return r; + goto create_bo_failed; } - - p = container_of(prange->svms, struct kfd_process, svms); r = amdgpu_bo_reserve(bo, true); if (r) { pr_debug("failed %d to reserve bo\n", r); @@ -369,8 +417,7 @@ svm_range_vram_node_new(struct amdgpu_device *adev, struct svm_range *prange, amdgpu_bo_unreserve(bo); goto reserve_bo_failed; } - process_info = p->kgd_process_info; - amdgpu_bo_fence(bo, &process_info->eviction_fence->base, true); + amdgpu_bo_fence(bo, &svm_bo->eviction_fence->base, true);
amdgpu_bo_unreserve(bo);
@@ -380,14 +427,16 @@ svm_range_vram_node_new(struct amdgpu_device *adev, struct svm_range *prange, prange->offset = 0;
spin_lock(&svm_bo->list_lock); - list_add(&prange->svm_bo_list, &svm_bo->range_list); + list_add_rcu(&prange->svm_bo_list, &svm_bo->range_list); spin_unlock(&svm_bo->list_lock);
return 0;
reserve_bo_failed: - kfree(svm_bo); amdgpu_bo_unref(&bo); +create_bo_failed: + dma_fence_put(&svm_bo->eviction_fence->base); + kfree(svm_bo); prange->mm_nodes = NULL;
return r; @@ -621,7 +670,7 @@ svm_range_split_nodes(struct svm_range *new, struct svm_range *old, new->mm_nodes = old->mm_nodes;
spin_lock(&new->svm_bo->list_lock); - list_add(&new->svm_bo_list, &new->svm_bo->range_list); + list_add_rcu(&new->svm_bo_list, &new->svm_bo->range_list); spin_unlock(&new->svm_bo->list_lock);
return 0; @@ -1353,7 +1402,7 @@ struct svm_range *svm_range_clone(struct svm_range *old) new->offset = old->offset; new->svm_bo = svm_range_bo_ref(old->svm_bo); spin_lock(&new->svm_bo->list_lock); - list_add(&new->svm_bo_list, &new->svm_bo->range_list); + list_add_rcu(&new->svm_bo_list, &new->svm_bo->range_list); spin_unlock(&new->svm_bo->list_lock); } new->flags = old->flags; @@ -1964,6 +2013,62 @@ svm_range_trigger_migration(struct mm_struct *mm, struct svm_range *prange, return r; }
+int svm_range_schedule_evict_svm_bo(struct amdgpu_amdkfd_fence *fence) +{ + if (!fence) + return -EINVAL; + + if (dma_fence_is_signaled(&fence->base)) + return 0; + + if (fence->svm_bo) { + WRITE_ONCE(fence->svm_bo->evicting, 1); + schedule_work(&fence->svm_bo->eviction_work); + } + + return 0; +} + +static void svm_range_evict_svm_bo_worker(struct work_struct *work) +{ + struct svm_range_bo *svm_bo; + struct svm_range *prange; + struct kfd_process *p; + struct mm_struct *mm; + int srcu_idx; + + svm_bo = container_of(work, struct svm_range_bo, eviction_work); + if (!svm_bo_ref_unless_zero(svm_bo)) + return; /* svm_bo was freed while eviction was pending */ + + /* svm_range_bo_release destroys this worker thread. So during + * the lifetime of this thread, kfd_process and mm will be valid. + */ + p = container_of(svm_bo->svms, struct kfd_process, svms); + mm = p->mm; + if (!mm) + return; + + mmap_read_lock(mm); + srcu_idx = srcu_read_lock(&svm_bo->svms->srcu); + list_for_each_entry_rcu(prange, &svm_bo->range_list, svm_bo_list) { + pr_debug("svms 0x%p [0x%lx 0x%lx]\n", prange->svms, + prange->it_node.start, prange->it_node.last); + mutex_lock(&prange->mutex); + svm_migrate_vram_to_ram(prange, svm_bo->eviction_fence->mm); + mutex_unlock(&prange->mutex); + } + srcu_read_unlock(&svm_bo->svms->srcu, srcu_idx); + mmap_read_unlock(mm); + + dma_fence_signal(&svm_bo->eviction_fence->base); + /* This is the last reference to svm_bo, after svm_range_vram_node_free + * has been called in svm_migrate_vram_to_ram + */ + WARN_ONCE(kref_read(&svm_bo->kref) != 1, "This was not the last reference\n"); + svm_range_bo_unref(svm_bo); +} + static int svm_range_set_attr(struct kfd_process *p, uint64_t start, uint64_t size, uint32_t nattr, struct kfd_ioctl_svm_attribute *attrs) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h index e546f36ef709..143573621956 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h @@ -33,10 +33,14 @@ #include "kfd_priv.h"
struct svm_range_bo { - struct amdgpu_bo *bo; - struct kref kref; - struct list_head range_list; /* all svm ranges shared this bo */ - spinlock_t list_lock; + struct amdgpu_bo *bo; + struct kref kref; + struct list_head range_list; /* all svm ranges shared this bo */ + spinlock_t list_lock; + struct amdgpu_amdkfd_fence *eviction_fence; + struct work_struct eviction_work; + struct svm_range_list *svms; + uint32_t evicting; }; /** * struct svm_range - shared virtual memory range @@ -123,5 +127,6 @@ int svm_range_split_by_granularity(struct kfd_process *p, unsigned long addr, struct list_head *list); int svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid, uint64_t addr); +int svm_range_schedule_evict_svm_bo(struct amdgpu_amdkfd_fence *fence);
#endif /* KFD_SVM_H_ */
From: Alex Sierra alex.sierra@amd.com
[why] To support svm bo eviction mechanism.
[how] If the BO crated has AMDGPU_AMDKFD_CREATE_SVM_BO flag set, enable_signal callback will be called inside amdgpu_evict_flags. This also causes gutting of the BO by removing all placements, so that TTM won't actually do an eviction. Instead it will discard the memory held by the BO. This is needed for HMM migration to user mode system memory pages.
Signed-off-by: Alex Sierra alex.sierra@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com --- drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c index f423f42cb9b5..62d4da95d22d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c @@ -107,6 +107,20 @@ static void amdgpu_evict_flags(struct ttm_buffer_object *bo, }
abo = ttm_to_amdgpu_bo(bo); + if (abo->flags & AMDGPU_AMDKFD_CREATE_SVM_BO) { + struct dma_fence *fence; + struct dma_resv *resv = &bo->base._resv; + + rcu_read_lock(); + fence = rcu_dereference(resv->fence_excl); + if (fence && !fence->ops->signaled) + dma_fence_enable_sw_signaling(fence); + + placement->num_placement = 0; + placement->num_busy_placement = 0; + rcu_read_unlock(); + return; + } switch (bo->mem.mem_type) { case AMDGPU_PL_GDS: case AMDGPU_PL_GWS:
Am 07.01.21 um 04:01 schrieb Felix Kuehling:
From: Alex Sierra alex.sierra@amd.com
[why] To support svm bo eviction mechanism.
[how] If the BO crated has AMDGPU_AMDKFD_CREATE_SVM_BO flag set, enable_signal callback will be called inside amdgpu_evict_flags. This also causes gutting of the BO by removing all placements, so that TTM won't actually do an eviction. Instead it will discard the memory held by the BO. This is needed for HMM migration to user mode system memory pages.
I don't think that this will work. What exactly are you doing here?
As Daniel pointed out HMM and dma_fences are fundamentally incompatible.
Christian.
Signed-off-by: Alex Sierra alex.sierra@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c index f423f42cb9b5..62d4da95d22d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c @@ -107,6 +107,20 @@ static void amdgpu_evict_flags(struct ttm_buffer_object *bo, }
abo = ttm_to_amdgpu_bo(bo);
- if (abo->flags & AMDGPU_AMDKFD_CREATE_SVM_BO) {
struct dma_fence *fence;
struct dma_resv *resv = &bo->base._resv;
rcu_read_lock();
fence = rcu_dereference(resv->fence_excl);
if (fence && !fence->ops->signaled)
dma_fence_enable_sw_signaling(fence);
placement->num_placement = 0;
placement->num_busy_placement = 0;
rcu_read_unlock();
return;
- } switch (bo->mem.mem_type) { case AMDGPU_PL_GDS: case AMDGPU_PL_GWS:
Am 2021-01-07 um 5:56 a.m. schrieb Christian König:
Am 07.01.21 um 04:01 schrieb Felix Kuehling:
From: Alex Sierra alex.sierra@amd.com
[why] To support svm bo eviction mechanism.
[how] If the BO crated has AMDGPU_AMDKFD_CREATE_SVM_BO flag set, enable_signal callback will be called inside amdgpu_evict_flags. This also causes gutting of the BO by removing all placements, so that TTM won't actually do an eviction. Instead it will discard the memory held by the BO. This is needed for HMM migration to user mode system memory pages.
I don't think that this will work. What exactly are you doing here?
We discussed this a while ago when we talked about pipelined gutting. And you actually helped us out with a fix for that (https://patchwork.freedesktop.org/patch/379039/).
SVM BOs are BOs in VRAM containing data for HMM ranges. When such a BO is evicted by TTM, we do an HMM migration of the data to system memory (triggered by kgd2kfd_schedule_evict_and_restore_process in patch 30). That means we don't need TTM to copy the BO contents to GTT any more. Instead we want to use pipelined gutting to allow the VRAM to be freed once the fence signals that the HMM migration is done (the dma_fence_signal call near the end of svm_range_evict_svm_bo_worker in patch 28).
Regards, Felix
As Daniel pointed out HMM and dma_fences are fundamentally incompatible.
Christian.
Signed-off-by: Alex Sierra alex.sierra@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c index f423f42cb9b5..62d4da95d22d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c @@ -107,6 +107,20 @@ static void amdgpu_evict_flags(struct ttm_buffer_object *bo, } abo = ttm_to_amdgpu_bo(bo); + if (abo->flags & AMDGPU_AMDKFD_CREATE_SVM_BO) { + struct dma_fence *fence; + struct dma_resv *resv = &bo->base._resv;
+ rcu_read_lock(); + fence = rcu_dereference(resv->fence_excl); + if (fence && !fence->ops->signaled) + dma_fence_enable_sw_signaling(fence);
+ placement->num_placement = 0; + placement->num_busy_placement = 0; + rcu_read_unlock(); + return; + } switch (bo->mem.mem_type) { case AMDGPU_PL_GDS: case AMDGPU_PL_GWS:
Am 07.01.21 um 17:16 schrieb Felix Kuehling:
Am 2021-01-07 um 5:56 a.m. schrieb Christian König:
Am 07.01.21 um 04:01 schrieb Felix Kuehling:
From: Alex Sierra alex.sierra@amd.com
[why] To support svm bo eviction mechanism.
[how] If the BO crated has AMDGPU_AMDKFD_CREATE_SVM_BO flag set, enable_signal callback will be called inside amdgpu_evict_flags. This also causes gutting of the BO by removing all placements, so that TTM won't actually do an eviction. Instead it will discard the memory held by the BO. This is needed for HMM migration to user mode system memory pages.
I don't think that this will work. What exactly are you doing here?
We discussed this a while ago when we talked about pipelined gutting. And you actually helped us out with a fix for that (https://patchwork.freedesktop.org/patch/379039/).
That's not what I meant. The pipelined gutting is ok, but why the enable_signaling()?
Christian.
SVM BOs are BOs in VRAM containing data for HMM ranges. When such a BO is evicted by TTM, we do an HMM migration of the data to system memory (triggered by kgd2kfd_schedule_evict_and_restore_process in patch 30). That means we don't need TTM to copy the BO contents to GTT any more. Instead we want to use pipelined gutting to allow the VRAM to be freed once the fence signals that the HMM migration is done (the dma_fence_signal call near the end of svm_range_evict_svm_bo_worker in patch 28).
Regards, Felix
As Daniel pointed out HMM and dma_fences are fundamentally incompatible.
Christian.
Signed-off-by: Alex Sierra alex.sierra@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c index f423f42cb9b5..62d4da95d22d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c @@ -107,6 +107,20 @@ static void amdgpu_evict_flags(struct ttm_buffer_object *bo, } abo = ttm_to_amdgpu_bo(bo); + if (abo->flags & AMDGPU_AMDKFD_CREATE_SVM_BO) { + struct dma_fence *fence; + struct dma_resv *resv = &bo->base._resv;
+ rcu_read_lock(); + fence = rcu_dereference(resv->fence_excl); + if (fence && !fence->ops->signaled) + dma_fence_enable_sw_signaling(fence);
+ placement->num_placement = 0; + placement->num_busy_placement = 0; + rcu_read_unlock(); + return; + } switch (bo->mem.mem_type) { case AMDGPU_PL_GDS: case AMDGPU_PL_GWS:
Am 2021-01-07 um 11:28 a.m. schrieb Christian König:
Am 07.01.21 um 17:16 schrieb Felix Kuehling:
Am 2021-01-07 um 5:56 a.m. schrieb Christian König:
Am 07.01.21 um 04:01 schrieb Felix Kuehling:
From: Alex Sierra alex.sierra@amd.com
[why] To support svm bo eviction mechanism.
[how] If the BO crated has AMDGPU_AMDKFD_CREATE_SVM_BO flag set, enable_signal callback will be called inside amdgpu_evict_flags. This also causes gutting of the BO by removing all placements, so that TTM won't actually do an eviction. Instead it will discard the memory held by the BO. This is needed for HMM migration to user mode system memory pages.
I don't think that this will work. What exactly are you doing here?
We discussed this a while ago when we talked about pipelined gutting. And you actually helped us out with a fix for that (https://patchwork.freedesktop.org/patch/379039/).
That's not what I meant. The pipelined gutting is ok, but why the enable_signaling()?
That's what triggers our eviction fence callback amdkfd_fence_enable_signaling that schedules the worker doing the eviction. Without pipelined gutting we'd be getting that callback from the GPU scheduler when it tries to execute the job that does the migration. With pipelined gutting we have to call this somewhere ourselves.
I guess we could schedule the eviction worker directly without going through the fence callback. I think we did it this way because it's more similar to our KFD BO eviction handling where the worker gets scheduled by the fence callback.
Regards, Felix
Christian.
SVM BOs are BOs in VRAM containing data for HMM ranges. When such a BO is evicted by TTM, we do an HMM migration of the data to system memory (triggered by kgd2kfd_schedule_evict_and_restore_process in patch 30). That means we don't need TTM to copy the BO contents to GTT any more. Instead we want to use pipelined gutting to allow the VRAM to be freed once the fence signals that the HMM migration is done (the dma_fence_signal call near the end of svm_range_evict_svm_bo_worker in patch 28).
Regards, Felix
As Daniel pointed out HMM and dma_fences are fundamentally incompatible.
Christian.
Signed-off-by: Alex Sierra alex.sierra@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c index f423f42cb9b5..62d4da95d22d 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c @@ -107,6 +107,20 @@ static void amdgpu_evict_flags(struct ttm_buffer_object *bo, } abo = ttm_to_amdgpu_bo(bo); + if (abo->flags & AMDGPU_AMDKFD_CREATE_SVM_BO) { + struct dma_fence *fence; + struct dma_resv *resv = &bo->base._resv;
+ rcu_read_lock(); + fence = rcu_dereference(resv->fence_excl); + if (fence && !fence->ops->signaled) + dma_fence_enable_sw_signaling(fence);
+ placement->num_placement = 0; + placement->num_busy_placement = 0; + rcu_read_unlock(); + return; + } switch (bo->mem.mem_type) { case AMDGPU_PL_GDS: case AMDGPU_PL_GWS:
From: Alex Sierra alex.sierra@amd.com
Add to amdgpu_amdkfd_fence.enable_signal callback, support for svm_bo fence eviction.
Signed-off-by: Alex Sierra alex.sierra@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com --- drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c index 9cc85efa4ed5..98d6e08f22d8 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c @@ -28,6 +28,7 @@ #include <linux/slab.h> #include <linux/sched/mm.h> #include "amdgpu_amdkfd.h" +#include "kfd_svm.h"
static const struct dma_fence_ops amdkfd_fence_ops; static atomic_t fence_seq = ATOMIC_INIT(0); @@ -123,9 +124,13 @@ static bool amdkfd_fence_enable_signaling(struct dma_fence *f) if (dma_fence_is_signaled(f)) return true;
- if (!kgd2kfd_schedule_evict_and_restore_process(fence->mm, f)) - return true; - + if (!fence->svm_bo) { + if (!kgd2kfd_schedule_evict_and_restore_process(fence->mm, f)) + return true; + } else { + if (!svm_range_schedule_evict_svm_bo(fence)) + return true; + } return false; }
From: Philip Yang Philip.Yang@amd.com
Forgot to reserve a fence slot to use sdma to update page table, cause below kernel BUG backtrace to handle vm retry fault while application is exiting.
[ 133.048143] kernel BUG at /home/yangp/git/compute_staging/kernel/drivers/dma-buf/dma-resv.c:281! [ 133.048487] Workqueue: events amdgpu_irq_handle_ih1 [amdgpu] [ 133.048506] RIP: 0010:dma_resv_add_shared_fence+0x204/0x280 [ 133.048672] amdgpu_vm_sdma_commit+0x134/0x220 [amdgpu] [ 133.048788] amdgpu_vm_bo_update_range+0x220/0x250 [amdgpu] [ 133.048905] amdgpu_vm_handle_fault+0x202/0x370 [amdgpu] [ 133.049031] gmc_v9_0_process_interrupt+0x1ab/0x310 [amdgpu] [ 133.049165] ? kgd2kfd_interrupt+0x9a/0x180 [amdgpu] [ 133.049289] ? amdgpu_irq_dispatch+0xb6/0x240 [amdgpu] [ 133.049408] amdgpu_irq_dispatch+0xb6/0x240 [amdgpu] [ 133.049534] amdgpu_ih_process+0x9b/0x1c0 [amdgpu] [ 133.049657] amdgpu_irq_handle_ih1+0x21/0x60 [amdgpu] [ 133.049669] process_one_work+0x29f/0x640 [ 133.049678] worker_thread+0x39/0x3f0 [ 133.049685] ? process_one_work+0x640/0x640
Signed-off-by: Philip Yang Philip.Yang@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index abdd4e7b4c3b..bd9de870f8f1 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -3301,7 +3301,7 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, unsigned int pasid, struct amdgpu_bo *root; uint64_t value, flags; struct amdgpu_vm *vm; - long r; + int r; bool is_compute_context = false;
spin_lock(&adev->vm_manager.pasid_lock); @@ -3359,6 +3359,12 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, unsigned int pasid, value = 0; }
+ r = dma_resv_reserve_shared(root->tbo.base.resv, 1); + if (r) { + pr_debug("failed %d to reserve fence slot\n", r); + goto error_unlock; + } + r = amdgpu_vm_bo_update_mapping(adev, adev, vm, true, false, NULL, addr, addr, flags, value, NULL, NULL, NULL); @@ -3370,7 +3376,7 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, unsigned int pasid, error_unlock: amdgpu_bo_unreserve(root); if (r < 0) - DRM_ERROR("Can't handle page fault (%ld)\n", r); + DRM_ERROR("Can't handle page fault (%d)\n", r);
error_unref: amdgpu_bo_unref(&root);
Am 07.01.21 um 04:01 schrieb Felix Kuehling:
From: Philip Yang Philip.Yang@amd.com
Forgot to reserve a fence slot to use sdma to update page table, cause below kernel BUG backtrace to handle vm retry fault while application is exiting.
[ 133.048143] kernel BUG at /home/yangp/git/compute_staging/kernel/drivers/dma-buf/dma-resv.c:281! [ 133.048487] Workqueue: events amdgpu_irq_handle_ih1 [amdgpu] [ 133.048506] RIP: 0010:dma_resv_add_shared_fence+0x204/0x280 [ 133.048672] amdgpu_vm_sdma_commit+0x134/0x220 [amdgpu] [ 133.048788] amdgpu_vm_bo_update_range+0x220/0x250 [amdgpu] [ 133.048905] amdgpu_vm_handle_fault+0x202/0x370 [amdgpu] [ 133.049031] gmc_v9_0_process_interrupt+0x1ab/0x310 [amdgpu] [ 133.049165] ? kgd2kfd_interrupt+0x9a/0x180 [amdgpu] [ 133.049289] ? amdgpu_irq_dispatch+0xb6/0x240 [amdgpu] [ 133.049408] amdgpu_irq_dispatch+0xb6/0x240 [amdgpu] [ 133.049534] amdgpu_ih_process+0x9b/0x1c0 [amdgpu] [ 133.049657] amdgpu_irq_handle_ih1+0x21/0x60 [amdgpu] [ 133.049669] process_one_work+0x29f/0x640 [ 133.049678] worker_thread+0x39/0x3f0 [ 133.049685] ? process_one_work+0x640/0x640
Signed-off-by: Philip Yang Philip.Yang@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com
Reviewed-by: Christian König christian.koenig@amd.com
drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index abdd4e7b4c3b..bd9de870f8f1 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -3301,7 +3301,7 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, unsigned int pasid, struct amdgpu_bo *root; uint64_t value, flags; struct amdgpu_vm *vm;
- long r;
int r; bool is_compute_context = false;
spin_lock(&adev->vm_manager.pasid_lock);
@@ -3359,6 +3359,12 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, unsigned int pasid, value = 0; }
- r = dma_resv_reserve_shared(root->tbo.base.resv, 1);
- if (r) {
pr_debug("failed %d to reserve fence slot\n", r);
goto error_unlock;
- }
- r = amdgpu_vm_bo_update_mapping(adev, adev, vm, true, false, NULL, addr, addr, flags, value, NULL, NULL, NULL);
@@ -3370,7 +3376,7 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, unsigned int pasid, error_unlock: amdgpu_bo_unreserve(root); if (r < 0)
DRM_ERROR("Can't handle page fault (%ld)\n", r);
DRM_ERROR("Can't handle page fault (%d)\n", r);
error_unref: amdgpu_bo_unref(&root);
From: Philip Yang Philip.Yang@amd.com
If xnack is on, VM retry fault interrupt send to IH ring1, and ring1 will be full quickly. IH cannot receive other interrupts, this causes deadlock if migrating buffer using sdma and waiting for sdma done while handling retry fault.
Remove VMC from IH storm client, enable ring1 write pointer overflow, then IH will drop retry fault interrupts and be able to receive other interrupts while driver is handling retry fault.
IH ring1 write pointer doesn't writeback to memory by IH, and ring1 write pointer recorded by self-irq is not updated, so always read the latest ring1 write pointer from register.
Signed-off-by: Philip Yang Philip.Yang@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com --- drivers/gpu/drm/amd/amdgpu/vega10_ih.c | 32 +++++++++----------------- drivers/gpu/drm/amd/amdgpu/vega20_ih.c | 32 +++++++++----------------- 2 files changed, 22 insertions(+), 42 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/vega10_ih.c b/drivers/gpu/drm/amd/amdgpu/vega10_ih.c index 88626d83e07b..ca8efa5c6978 100644 --- a/drivers/gpu/drm/amd/amdgpu/vega10_ih.c +++ b/drivers/gpu/drm/amd/amdgpu/vega10_ih.c @@ -220,10 +220,8 @@ static int vega10_ih_enable_ring(struct amdgpu_device *adev, tmp = vega10_ih_rb_cntl(ih, tmp); if (ih == &adev->irq.ih) tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, RPTR_REARM, !!adev->irq.msi_enabled); - if (ih == &adev->irq.ih1) { - tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, WPTR_OVERFLOW_ENABLE, 0); + if (ih == &adev->irq.ih1) tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, RB_FULL_DRAIN_ENABLE, 1); - } if (amdgpu_sriov_vf(adev)) { if (psp_reg_program(&adev->psp, ih_regs->psp_reg_id, tmp)) { dev_err(adev->dev, "PSP program IH_RB_CNTL failed!\n"); @@ -265,7 +263,6 @@ static int vega10_ih_irq_init(struct amdgpu_device *adev) u32 ih_chicken; int ret; int i; - u32 tmp;
/* disable irqs */ ret = vega10_ih_toggle_interrupts(adev, false); @@ -291,15 +288,6 @@ static int vega10_ih_irq_init(struct amdgpu_device *adev) } }
- tmp = RREG32_SOC15(OSSSYS, 0, mmIH_STORM_CLIENT_LIST_CNTL); - tmp = REG_SET_FIELD(tmp, IH_STORM_CLIENT_LIST_CNTL, - CLIENT18_IS_STORM_CLIENT, 1); - WREG32_SOC15(OSSSYS, 0, mmIH_STORM_CLIENT_LIST_CNTL, tmp); - - tmp = RREG32_SOC15(OSSSYS, 0, mmIH_INT_FLOOD_CNTL); - tmp = REG_SET_FIELD(tmp, IH_INT_FLOOD_CNTL, FLOOD_CNTL_ENABLE, 1); - WREG32_SOC15(OSSSYS, 0, mmIH_INT_FLOOD_CNTL, tmp); - pci_set_master(adev->pdev);
/* enable interrupts */ @@ -345,11 +333,17 @@ static u32 vega10_ih_get_wptr(struct amdgpu_device *adev, u32 wptr, tmp; struct amdgpu_ih_regs *ih_regs;
- wptr = le32_to_cpu(*ih->wptr_cpu); - ih_regs = &ih->ih_regs; + if (ih == &adev->irq.ih) { + /* Only ring0 supports writeback. On other rings fall back + * to register-based code with overflow checking below. + */ + wptr = le32_to_cpu(*ih->wptr_cpu);
- if (!REG_GET_FIELD(wptr, IH_RB_WPTR, RB_OVERFLOW)) - goto out; + if (!REG_GET_FIELD(wptr, IH_RB_WPTR, RB_OVERFLOW)) + goto out; + } + + ih_regs = &ih->ih_regs;
/* Double check that the overflow wasn't already cleared. */ wptr = RREG32_NO_KIQ(ih_regs->ih_rb_wptr); @@ -440,15 +434,11 @@ static int vega10_ih_self_irq(struct amdgpu_device *adev, struct amdgpu_irq_src *source, struct amdgpu_iv_entry *entry) { - uint32_t wptr = cpu_to_le32(entry->src_data[0]); - switch (entry->ring_id) { case 1: - *adev->irq.ih1.wptr_cpu = wptr; schedule_work(&adev->irq.ih1_work); break; case 2: - *adev->irq.ih2.wptr_cpu = wptr; schedule_work(&adev->irq.ih2_work); break; default: break; diff --git a/drivers/gpu/drm/amd/amdgpu/vega20_ih.c b/drivers/gpu/drm/amd/amdgpu/vega20_ih.c index 42032ca380cc..60d1bd51781e 100644 --- a/drivers/gpu/drm/amd/amdgpu/vega20_ih.c +++ b/drivers/gpu/drm/amd/amdgpu/vega20_ih.c @@ -220,10 +220,8 @@ static int vega20_ih_enable_ring(struct amdgpu_device *adev, tmp = vega20_ih_rb_cntl(ih, tmp); if (ih == &adev->irq.ih) tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, RPTR_REARM, !!adev->irq.msi_enabled); - if (ih == &adev->irq.ih1) { - tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, WPTR_OVERFLOW_ENABLE, 0); + if (ih == &adev->irq.ih1) tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, RB_FULL_DRAIN_ENABLE, 1); - } if (amdgpu_sriov_vf(adev)) { if (psp_reg_program(&adev->psp, ih_regs->psp_reg_id, tmp)) { dev_err(adev->dev, "PSP program IH_RB_CNTL failed!\n"); @@ -297,7 +295,6 @@ static int vega20_ih_irq_init(struct amdgpu_device *adev) u32 ih_chicken; int ret; int i; - u32 tmp;
/* disable irqs */ ret = vega20_ih_toggle_interrupts(adev, false); @@ -326,15 +323,6 @@ static int vega20_ih_irq_init(struct amdgpu_device *adev) } }
- tmp = RREG32_SOC15(OSSSYS, 0, mmIH_STORM_CLIENT_LIST_CNTL); - tmp = REG_SET_FIELD(tmp, IH_STORM_CLIENT_LIST_CNTL, - CLIENT18_IS_STORM_CLIENT, 1); - WREG32_SOC15(OSSSYS, 0, mmIH_STORM_CLIENT_LIST_CNTL, tmp); - - tmp = RREG32_SOC15(OSSSYS, 0, mmIH_INT_FLOOD_CNTL); - tmp = REG_SET_FIELD(tmp, IH_INT_FLOOD_CNTL, FLOOD_CNTL_ENABLE, 1); - WREG32_SOC15(OSSSYS, 0, mmIH_INT_FLOOD_CNTL, tmp); - pci_set_master(adev->pdev);
/* enable interrupts */ @@ -379,11 +367,17 @@ static u32 vega20_ih_get_wptr(struct amdgpu_device *adev, u32 wptr, tmp; struct amdgpu_ih_regs *ih_regs;
- wptr = le32_to_cpu(*ih->wptr_cpu); - ih_regs = &ih->ih_regs; + if (ih == &adev->irq.ih) { + /* Only ring0 supports writeback. On other rings fall back + * to register-based code with overflow checking below. + */ + wptr = le32_to_cpu(*ih->wptr_cpu);
- if (!REG_GET_FIELD(wptr, IH_RB_WPTR, RB_OVERFLOW)) - goto out; + if (!REG_GET_FIELD(wptr, IH_RB_WPTR, RB_OVERFLOW)) + goto out; + } + + ih_regs = &ih->ih_regs;
/* Double check that the overflow wasn't already cleared. */ wptr = RREG32_NO_KIQ(ih_regs->ih_rb_wptr); @@ -473,15 +467,11 @@ static int vega20_ih_self_irq(struct amdgpu_device *adev, struct amdgpu_irq_src *source, struct amdgpu_iv_entry *entry) { - uint32_t wptr = cpu_to_le32(entry->src_data[0]); - switch (entry->ring_id) { case 1: - *adev->irq.ih1.wptr_cpu = wptr; schedule_work(&adev->irq.ih1_work); break; case 2: - *adev->irq.ih2.wptr_cpu = wptr; schedule_work(&adev->irq.ih2_work); break; default: break;
Am 07.01.21 um 04:01 schrieb Felix Kuehling:
From: Philip Yang Philip.Yang@amd.com
If xnack is on, VM retry fault interrupt send to IH ring1, and ring1 will be full quickly. IH cannot receive other interrupts, this causes deadlock if migrating buffer using sdma and waiting for sdma done while handling retry fault.
Remove VMC from IH storm client, enable ring1 write pointer overflow, then IH will drop retry fault interrupts and be able to receive other interrupts while driver is handling retry fault.
IH ring1 write pointer doesn't writeback to memory by IH, and ring1 write pointer recorded by self-irq is not updated, so always read the latest ring1 write pointer from register.
Signed-off-by: Philip Yang Philip.Yang@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com
Reviewed-by: Christian König christian.koenig@amd.com
drivers/gpu/drm/amd/amdgpu/vega10_ih.c | 32 +++++++++----------------- drivers/gpu/drm/amd/amdgpu/vega20_ih.c | 32 +++++++++----------------- 2 files changed, 22 insertions(+), 42 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/vega10_ih.c b/drivers/gpu/drm/amd/amdgpu/vega10_ih.c index 88626d83e07b..ca8efa5c6978 100644 --- a/drivers/gpu/drm/amd/amdgpu/vega10_ih.c +++ b/drivers/gpu/drm/amd/amdgpu/vega10_ih.c @@ -220,10 +220,8 @@ static int vega10_ih_enable_ring(struct amdgpu_device *adev, tmp = vega10_ih_rb_cntl(ih, tmp); if (ih == &adev->irq.ih) tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, RPTR_REARM, !!adev->irq.msi_enabled);
- if (ih == &adev->irq.ih1) {
tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, WPTR_OVERFLOW_ENABLE, 0);
- if (ih == &adev->irq.ih1) tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, RB_FULL_DRAIN_ENABLE, 1);
- } if (amdgpu_sriov_vf(adev)) { if (psp_reg_program(&adev->psp, ih_regs->psp_reg_id, tmp)) { dev_err(adev->dev, "PSP program IH_RB_CNTL failed!\n");
@@ -265,7 +263,6 @@ static int vega10_ih_irq_init(struct amdgpu_device *adev) u32 ih_chicken; int ret; int i;
u32 tmp;
/* disable irqs */ ret = vega10_ih_toggle_interrupts(adev, false);
@@ -291,15 +288,6 @@ static int vega10_ih_irq_init(struct amdgpu_device *adev) } }
tmp = RREG32_SOC15(OSSSYS, 0, mmIH_STORM_CLIENT_LIST_CNTL);
tmp = REG_SET_FIELD(tmp, IH_STORM_CLIENT_LIST_CNTL,
CLIENT18_IS_STORM_CLIENT, 1);
WREG32_SOC15(OSSSYS, 0, mmIH_STORM_CLIENT_LIST_CNTL, tmp);
tmp = RREG32_SOC15(OSSSYS, 0, mmIH_INT_FLOOD_CNTL);
tmp = REG_SET_FIELD(tmp, IH_INT_FLOOD_CNTL, FLOOD_CNTL_ENABLE, 1);
WREG32_SOC15(OSSSYS, 0, mmIH_INT_FLOOD_CNTL, tmp);
pci_set_master(adev->pdev);
/* enable interrupts */
@@ -345,11 +333,17 @@ static u32 vega10_ih_get_wptr(struct amdgpu_device *adev, u32 wptr, tmp; struct amdgpu_ih_regs *ih_regs;
- wptr = le32_to_cpu(*ih->wptr_cpu);
- ih_regs = &ih->ih_regs;
- if (ih == &adev->irq.ih) {
/* Only ring0 supports writeback. On other rings fall back
* to register-based code with overflow checking below.
*/
wptr = le32_to_cpu(*ih->wptr_cpu);
- if (!REG_GET_FIELD(wptr, IH_RB_WPTR, RB_OVERFLOW))
goto out;
if (!REG_GET_FIELD(wptr, IH_RB_WPTR, RB_OVERFLOW))
goto out;
}
ih_regs = &ih->ih_regs;
/* Double check that the overflow wasn't already cleared. */ wptr = RREG32_NO_KIQ(ih_regs->ih_rb_wptr);
@@ -440,15 +434,11 @@ static int vega10_ih_self_irq(struct amdgpu_device *adev, struct amdgpu_irq_src *source, struct amdgpu_iv_entry *entry) {
- uint32_t wptr = cpu_to_le32(entry->src_data[0]);
- switch (entry->ring_id) { case 1:
schedule_work(&adev->irq.ih1_work); break; case 2:*adev->irq.ih1.wptr_cpu = wptr;
schedule_work(&adev->irq.ih2_work); break; default: break;*adev->irq.ih2.wptr_cpu = wptr;
diff --git a/drivers/gpu/drm/amd/amdgpu/vega20_ih.c b/drivers/gpu/drm/amd/amdgpu/vega20_ih.c index 42032ca380cc..60d1bd51781e 100644 --- a/drivers/gpu/drm/amd/amdgpu/vega20_ih.c +++ b/drivers/gpu/drm/amd/amdgpu/vega20_ih.c @@ -220,10 +220,8 @@ static int vega20_ih_enable_ring(struct amdgpu_device *adev, tmp = vega20_ih_rb_cntl(ih, tmp); if (ih == &adev->irq.ih) tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, RPTR_REARM, !!adev->irq.msi_enabled);
- if (ih == &adev->irq.ih1) {
tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, WPTR_OVERFLOW_ENABLE, 0);
- if (ih == &adev->irq.ih1) tmp = REG_SET_FIELD(tmp, IH_RB_CNTL, RB_FULL_DRAIN_ENABLE, 1);
- } if (amdgpu_sriov_vf(adev)) { if (psp_reg_program(&adev->psp, ih_regs->psp_reg_id, tmp)) { dev_err(adev->dev, "PSP program IH_RB_CNTL failed!\n");
@@ -297,7 +295,6 @@ static int vega20_ih_irq_init(struct amdgpu_device *adev) u32 ih_chicken; int ret; int i;
u32 tmp;
/* disable irqs */ ret = vega20_ih_toggle_interrupts(adev, false);
@@ -326,15 +323,6 @@ static int vega20_ih_irq_init(struct amdgpu_device *adev) } }
tmp = RREG32_SOC15(OSSSYS, 0, mmIH_STORM_CLIENT_LIST_CNTL);
tmp = REG_SET_FIELD(tmp, IH_STORM_CLIENT_LIST_CNTL,
CLIENT18_IS_STORM_CLIENT, 1);
WREG32_SOC15(OSSSYS, 0, mmIH_STORM_CLIENT_LIST_CNTL, tmp);
tmp = RREG32_SOC15(OSSSYS, 0, mmIH_INT_FLOOD_CNTL);
tmp = REG_SET_FIELD(tmp, IH_INT_FLOOD_CNTL, FLOOD_CNTL_ENABLE, 1);
WREG32_SOC15(OSSSYS, 0, mmIH_INT_FLOOD_CNTL, tmp);
pci_set_master(adev->pdev);
/* enable interrupts */
@@ -379,11 +367,17 @@ static u32 vega20_ih_get_wptr(struct amdgpu_device *adev, u32 wptr, tmp; struct amdgpu_ih_regs *ih_regs;
- wptr = le32_to_cpu(*ih->wptr_cpu);
- ih_regs = &ih->ih_regs;
- if (ih == &adev->irq.ih) {
/* Only ring0 supports writeback. On other rings fall back
* to register-based code with overflow checking below.
*/
wptr = le32_to_cpu(*ih->wptr_cpu);
- if (!REG_GET_FIELD(wptr, IH_RB_WPTR, RB_OVERFLOW))
goto out;
if (!REG_GET_FIELD(wptr, IH_RB_WPTR, RB_OVERFLOW))
goto out;
}
ih_regs = &ih->ih_regs;
/* Double check that the overflow wasn't already cleared. */ wptr = RREG32_NO_KIQ(ih_regs->ih_rb_wptr);
@@ -473,15 +467,11 @@ static int vega20_ih_self_irq(struct amdgpu_device *adev, struct amdgpu_irq_src *source, struct amdgpu_iv_entry *entry) {
- uint32_t wptr = cpu_to_le32(entry->src_data[0]);
- switch (entry->ring_id) { case 1:
schedule_work(&adev->irq.ih1_work); break; case 2:*adev->irq.ih1.wptr_cpu = wptr;
schedule_work(&adev->irq.ih2_work); break; default: break;*adev->irq.ih2.wptr_cpu = wptr;
From: Philip Yang Philip.Yang@amd.com
With xnack on, GPU vm fault handler decide the best restore location, then migrate range to the best restore location and update GPU mapping to recover the GPU vm fault.
Signed-off-by: Philip Yang Philip.Yang@amd.com Signed-off-by: Alex Sierra alex.sierra@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com --- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 2 +- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 25 +++- drivers/gpu/drm/amd/amdkfd/kfd_migrate.h | 3 +- drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 3 + drivers/gpu/drm/amd/amdkfd/kfd_process.c | 16 +++ drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 162 +++++++++++++++++++---- drivers/gpu/drm/amd/amdkfd/kfd_svm.h | 3 +- 7 files changed, 180 insertions(+), 34 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c index bd9de870f8f1..50a8f4db22f6 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c @@ -3320,7 +3320,7 @@ bool amdgpu_vm_handle_fault(struct amdgpu_device *adev, unsigned int pasid, addr /= AMDGPU_GPU_PAGE_SIZE;
if (!amdgpu_noretry && is_compute_context && - !svm_range_restore_pages(adev, pasid, addr)) { + !svm_range_restore_pages(adev, vm, pasid, addr)) { amdgpu_bo_unref(&root); return true; } diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c index d33a4cc63495..2095417c7846 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c @@ -441,6 +441,7 @@ svm_migrate_vma_to_vram(struct amdgpu_device *adev, struct svm_range *prange, * svm_migrate_ram_to_vram - migrate svm range from system to device * @prange: range structure * @best_loc: the device to migrate to + * @mm: the process mm structure * * Context: Process context, caller hold mm->mmap_sem and prange->lock and take * svms srcu read lock. @@ -448,12 +449,12 @@ svm_migrate_vma_to_vram(struct amdgpu_device *adev, struct svm_range *prange, * Return: * 0 - OK, otherwise error code */ -int svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t best_loc) +int svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t best_loc, + struct mm_struct *mm) { unsigned long addr, start, end; struct vm_area_struct *vma; struct amdgpu_device *adev; - struct mm_struct *mm; int r = 0;
if (prange->actual_loc == best_loc) { @@ -475,8 +476,6 @@ int svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t best_loc) start = prange->it_node.start << PAGE_SHIFT; end = (prange->it_node.last + 1) << PAGE_SHIFT;
- mm = current->mm; - for (addr = start; addr < end;) { unsigned long next;
@@ -740,12 +739,26 @@ static vm_fault_t svm_migrate_to_ram(struct vm_fault *vmf) list_for_each_entry(prange, &list, update_list) { mutex_lock(&prange->mutex); r = svm_migrate_vram_to_ram(prange, vma->vm_mm); - mutex_unlock(&prange->mutex); if (r) { pr_debug("failed %d migrate [0x%lx 0x%lx] to ram\n", r, prange->it_node.start, prange->it_node.last); - goto out_srcu; + goto next; } + + /* xnack off, svm_range_restore_work will update GPU mapping */ + if (!p->xnack_enabled) + goto next; + + /* xnack on, update mapping on GPUs with ACCESS_IN_PLACE */ + r = svm_range_map_to_gpus(prange, true); + if (r) + pr_debug("failed %d to map svms 0x%p [0x%lx 0x%lx]\n", + r, prange->svms, prange->it_node.start, + prange->it_node.last); +next: + mutex_unlock(&prange->mutex); + if (r) + break; }
out_srcu: diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h index 95fd7b21791f..9949b55d3b6a 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h @@ -37,7 +37,8 @@ enum MIGRATION_COPY_DIR { FROM_VRAM_TO_RAM };
-int svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t best_loc); +int svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t best_loc, + struct mm_struct *mm); int svm_migrate_vram_to_ram(struct svm_range *prange, struct mm_struct *mm); unsigned long svm_migrate_addr_to_pfn(struct amdgpu_device *adev, unsigned long addr); diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h index d5367e770b39..db94f963eb7e 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h @@ -864,6 +864,9 @@ int kfd_process_gpuid_from_gpuidx(struct kfd_process *p, int kfd_process_gpuidx_from_gpuid(struct kfd_process *p, uint32_t gpu_id); int kfd_process_device_from_gpuidx(struct kfd_process *p, uint32_t gpu_idx, struct kfd_dev **gpu); +int kfd_process_gpuid_from_kgd(struct kfd_process *p, + struct amdgpu_device *adev, uint32_t *gpuid, + uint32_t *gpuidx); void kfd_unref_process(struct kfd_process *p); int kfd_process_evict_queues(struct kfd_process *p); int kfd_process_restore_queues(struct kfd_process *p); diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_process.c b/drivers/gpu/drm/amd/amdkfd/kfd_process.c index f7a50a364d78..69970a3bc176 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_process.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_process.c @@ -1637,6 +1637,22 @@ int kfd_process_device_from_gpuidx(struct kfd_process *p, return -EINVAL; }
+int +kfd_process_gpuid_from_kgd(struct kfd_process *p, struct amdgpu_device *adev, + uint32_t *gpuid, uint32_t *gpuidx) +{ + struct kgd_dev *kgd = (struct kgd_dev *)adev; + int i; + + for (i = 0; i < p->n_pdds; i++) + if (p->pdds[i] && p->pdds[i]->dev->kgd == kgd) { + *gpuid = p->pdds[i]->dev->id; + *gpuidx = i; + return 0; + } + return -EINVAL; +} + static void evict_process_worker(struct work_struct *work) { int ret; diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index 63b745a06740..8b57f5a471bd 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c @@ -1153,7 +1153,7 @@ svm_range_map_to_gpu(struct amdgpu_device *adev, struct amdgpu_vm *vm, return r; }
-static int svm_range_map_to_gpus(struct svm_range *prange, bool reserve_vm) +int svm_range_map_to_gpus(struct svm_range *prange, bool reserve_vm) { DECLARE_BITMAP(bitmap, MAX_GPU_INSTANCE); struct kfd_process_device *pdd; @@ -1170,9 +1170,29 @@ static int svm_range_map_to_gpus(struct svm_range *prange, bool reserve_vm) else bo_adev = NULL;
- bitmap_or(bitmap, prange->bitmap_access, prange->bitmap_aip, - MAX_GPU_INSTANCE); p = container_of(prange->svms, struct kfd_process, svms); + if (p->xnack_enabled) { + bitmap_copy(bitmap, prange->bitmap_aip, MAX_GPU_INSTANCE); + + /* If prefetch range to GPU, or GPU retry fault migrate range to + * GPU, which has ACCESS attribute to the range, create mapping + * on that GPU. + */ + if (prange->actual_loc) { + gpuidx = kfd_process_gpuidx_from_gpuid(p, + prange->actual_loc); + if (gpuidx < 0) { + WARN_ONCE(1, "failed get device by id 0x%x\n", + prange->actual_loc); + return -EINVAL; + } + if (test_bit(gpuidx, prange->bitmap_access)) + bitmap_set(bitmap, gpuidx, 1); + } + } else { + bitmap_or(bitmap, prange->bitmap_access, prange->bitmap_aip, + MAX_GPU_INSTANCE); + }
for_each_set_bit(gpuidx, bitmap, MAX_GPU_INSTANCE) { r = kfd_process_device_from_gpuidx(p, gpuidx, &dev); @@ -1678,16 +1698,77 @@ svm_range_from_addr(struct svm_range_list *svms, unsigned long addr) return container_of(node, struct svm_range, it_node); }
+/* svm_range_best_restore_location - decide the best fault restore location + * @prange: svm range structure + * @adev: the GPU on which vm fault happened + * + * This is only called when xnack is on, to decide the best location to restore + * the range mapping after GPU vm fault. Caller uses the best location to do + * migration if actual loc is not best location, then update GPU page table + * mapping to the best location. + * + * If vm fault gpu is range preferred loc, the best_loc is preferred loc. + * If vm fault gpu idx is on range ACCESSIBLE bitmap, best_loc is vm fault gpu + * If vm fault gpu idx is on range ACCESSIBLE_IN_PLACE bitmap, then + * if range actual loc is cpu, best_loc is cpu + * if vm fault gpu is on xgmi same hive of range actual loc gpu, best_loc is + * range actual loc. + * Otherwise, GPU no access, best_loc is -1. + * + * Return: + * -1 means vm fault GPU no access + * 0 for CPU or GPU id + */ +static int32_t +svm_range_best_restore_location(struct svm_range *prange, + struct amdgpu_device *adev) +{ + struct amdgpu_device *bo_adev; + struct kfd_process *p; + int32_t gpuidx; + uint32_t gpuid; + int r; + + p = container_of(prange->svms, struct kfd_process, svms); + + r = kfd_process_gpuid_from_kgd(p, adev, &gpuid, &gpuidx); + if (r < 0) { + pr_debug("failed to get gpuid from kgd\n"); + return -1; + } + + if (prange->preferred_loc == gpuid) + return prange->preferred_loc; + + if (test_bit(gpuidx, prange->bitmap_access)) + return gpuid; + + if (test_bit(gpuidx, prange->bitmap_aip)) { + if (!prange->actual_loc) + return 0; + + bo_adev = svm_range_get_adev_by_id(prange, prange->actual_loc); + if (amdgpu_xgmi_same_hive(adev, bo_adev)) + return prange->actual_loc; + else + return 0; + } + + return -1; +} + int -svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid, - uint64_t addr) +svm_range_restore_pages(struct amdgpu_device *adev, struct amdgpu_vm *vm, + unsigned int pasid, uint64_t addr) { - int r = 0; - int srcu_idx; + struct amdgpu_device *bo_adev; struct mm_struct *mm = NULL; - struct svm_range *prange; struct svm_range_list *svms; + struct svm_range *prange; struct kfd_process *p; + int32_t best_loc; + int srcu_idx; + int r = 0;
p = kfd_lookup_process_by_pasid(pasid); if (!p) { @@ -1706,20 +1787,20 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid, pr_debug("failed to find prange svms 0x%p address [0x%llx]\n", svms, addr); r = -EFAULT; - goto unlock_out; + goto out_srcu_unlock; }
if (!atomic_read(&prange->invalid)) { pr_debug("svms 0x%p [0x%lx %lx] already restored\n", svms, prange->it_node.start, prange->it_node.last); - goto unlock_out; + goto out_srcu_unlock; }
mm = get_task_mm(p->lead_thread); if (!mm) { pr_debug("svms 0x%p failed to get mm\n", svms); r = -ESRCH; - goto unlock_out; + goto out_srcu_unlock; }
mmap_read_lock(mm); @@ -1729,27 +1810,57 @@ svm_range_restore_pages(struct amdgpu_device *adev, unsigned int pasid, */ mutex_lock(&prange->mutex);
+ best_loc = svm_range_best_restore_location(prange, adev); + if (best_loc == -1) { + pr_debug("svms %p failed get best restore loc [0x%lx 0x%lx]\n", + svms, prange->it_node.start, prange->it_node.last); + r = -EACCES; + goto out_mmput; + } + + pr_debug("svms %p [0x%lx 0x%lx] best restore 0x%x, actual loc 0x%x\n", + svms, prange->it_node.start, prange->it_node.last, best_loc, + prange->actual_loc); + + if (prange->actual_loc != best_loc) { + if (best_loc) + r = svm_migrate_ram_to_vram(prange, best_loc, mm); + else + r = svm_migrate_vram_to_ram(prange, mm); + if (r) { + pr_debug("failed %d to migrate svms %p [0x%lx 0x%lx]\n", + r, svms, prange->it_node.start, + prange->it_node.last); + goto out_mmput; + } + } + r = svm_range_validate(mm, prange); if (r) { - pr_debug("failed %d to validate svms 0x%p [0x%lx 0x%lx]\n", r, + pr_debug("failed %d to validate svms %p [0x%lx 0x%lx]\n", r, svms, prange->it_node.start, prange->it_node.last); - - goto mmput_out; + goto out_mmput; }
- pr_debug("restoring svms 0x%p [0x%lx %lx] mapping\n", - svms, prange->it_node.start, prange->it_node.last); + if (prange->svm_bo && prange->mm_nodes) + bo_adev = amdgpu_ttm_adev(prange->svm_bo->bo->tbo.bdev); + else + bo_adev = NULL; + + pr_debug("restoring svms 0x%p [0x%lx %lx] mapping, bo_adev is %s\n", + svms, prange->it_node.start, prange->it_node.last, + bo_adev ? "not NULL" : "NULL");
r = svm_range_map_to_gpus(prange, true); if (r) - pr_debug("failed %d to map svms 0x%p [0x%lx 0x%lx] to gpu\n", r, - svms, prange->it_node.start, prange->it_node.last); + pr_debug("failed %d to map svms 0x%p [0x%lx 0x%lx] to gpus\n", + r, svms, prange->it_node.start, prange->it_node.last);
-mmput_out: +out_mmput: mutex_unlock(&prange->mutex); mmap_read_unlock(mm); mmput(mm); -unlock_out: +out_srcu_unlock: srcu_read_unlock(&svms->srcu, srcu_idx); kfd_unref_process(p);
@@ -1882,7 +1993,7 @@ svm_range_add(struct kfd_process *p, uint64_t start, uint64_t size, return 0; }
-/* svm_range_best_location - decide the best actual location +/* svm_range_best_prefetch_location - decide the best prefetch location * @prange: svm range structure * * For xnack off: @@ -1904,7 +2015,8 @@ svm_range_add(struct kfd_process *p, uint64_t start, uint64_t size, * Return: * 0 for CPU or GPU id */ -static uint32_t svm_range_best_location(struct svm_range *prange) +static uint32_t +svm_range_best_prefetch_location(struct svm_range *prange) { DECLARE_BITMAP(bitmap, MAX_GPU_INSTANCE); uint32_t best_loc = prange->prefetch_loc; @@ -1980,7 +2092,7 @@ svm_range_trigger_migration(struct mm_struct *mm, struct svm_range *prange, int r = 0;
*migrated = false; - best_loc = svm_range_best_location(prange); + best_loc = svm_range_best_prefetch_location(prange);
if (best_loc == KFD_IOCTL_SVM_LOCATION_UNDEFINED || best_loc == prange->actual_loc) @@ -2001,10 +2113,10 @@ svm_range_trigger_migration(struct mm_struct *mm, struct svm_range *prange, }
pr_debug("migrate from ram to vram\n"); - r = svm_migrate_ram_to_vram(prange, best_loc); + r = svm_migrate_ram_to_vram(prange, best_loc, mm); } else { pr_debug("migrate from vram to ram\n"); - r = svm_migrate_vram_to_ram(prange, current->mm); + r = svm_migrate_vram_to_ram(prange, mm); }
if (!r) diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h index 143573621956..0685eb04b87c 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h @@ -125,8 +125,9 @@ int svm_range_vram_node_new(struct amdgpu_device *adev, void svm_range_vram_node_free(struct svm_range *prange); int svm_range_split_by_granularity(struct kfd_process *p, unsigned long addr, struct list_head *list); -int svm_range_restore_pages(struct amdgpu_device *adev, +int svm_range_restore_pages(struct amdgpu_device *adev, struct amdgpu_vm *vm, unsigned int pasid, uint64_t addr); int svm_range_schedule_evict_svm_bo(struct amdgpu_amdkfd_fence *fence); +int svm_range_map_to_gpus(struct svm_range *prange, bool reserve_vm);
#endif /* KFD_SVM_H_ */
From: Philip Yang Philip.Yang@amd.com
With xnack on, add validate timestamp in order to handle GPU vm fault from multiple GPUs.
If GPU retry fault need migrate the range to the best restore location, use range validate timestamp to record system timestamp after range is restored to update GPU page table.
Because multiple pages of same range have multiple retry fault, define AMDGPU_SVM_RANGE_RETRY_FAULT_PENDING to the long time period that pending retry fault may still comes after page table update, to skip duplicate retry fault of same range.
If difference between system timestamp and range last validate timestamp is bigger than AMDGPU_SVM_RANGE_RETRY_FAULT_PENDING, that means the retry fault is from another GPU, then continue to handle retry fault recover.
Signed-off-by: Philip Yang Philip.Yang@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com --- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 27 +++++++++++++++++++++++---- drivers/gpu/drm/amd/amdkfd/kfd_svm.h | 2 ++ 2 files changed, 25 insertions(+), 4 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index 8b57f5a471bd..65f20a72ddcb 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c @@ -34,6 +34,11 @@
#define AMDGPU_SVM_RANGE_RESTORE_DELAY_MS 1
+/* Long enough to ensure no retry fault comes after svm range is restored and + * page table is updated. + */ +#define AMDGPU_SVM_RANGE_RETRY_FAULT_PENDING 2000 + static void svm_range_evict_svm_bo_worker(struct work_struct *work); /** * svm_range_unlink - unlink svm_range from lists and interval tree @@ -122,6 +127,7 @@ svm_range *svm_range_new(struct svm_range_list *svms, uint64_t start, INIT_LIST_HEAD(&prange->remove_list); INIT_LIST_HEAD(&prange->svm_bo_list); atomic_set(&prange->invalid, 0); + prange->validate_timestamp = ktime_to_us(ktime_get()); mutex_init(&prange->mutex); spin_lock_init(&prange->svm_bo_lock); svm_range_set_default_attributes(&prange->preferred_loc, @@ -482,20 +488,28 @@ static int svm_range_validate_vram(struct svm_range *prange) static int svm_range_validate(struct mm_struct *mm, struct svm_range *prange) { + struct kfd_process *p; int r;
pr_debug("svms 0x%p [0x%lx 0x%lx] actual loc 0x%x\n", prange->svms, prange->it_node.start, prange->it_node.last, prange->actual_loc);
+ p = container_of(prange->svms, struct kfd_process, svms); + if (!prange->actual_loc) r = svm_range_validate_ram(mm, prange); else r = svm_range_validate_vram(prange);
- pr_debug("svms 0x%p [0x%lx 0x%lx] ret %d invalid %d\n", prange->svms, - prange->it_node.start, prange->it_node.last, - r, atomic_read(&prange->invalid)); + if (!r) { + if (p->xnack_enabled) + atomic_set(&prange->invalid, 0); + prange->validate_timestamp = ktime_to_us(ktime_get()); + } + + pr_debug("svms 0x%p [0x%lx 0x%lx] ret %d\n", prange->svms, + prange->it_node.start, prange->it_node.last, r);
return r; } @@ -1766,6 +1780,7 @@ svm_range_restore_pages(struct amdgpu_device *adev, struct amdgpu_vm *vm, struct svm_range_list *svms; struct svm_range *prange; struct kfd_process *p; + uint64_t timestamp; int32_t best_loc; int srcu_idx; int r = 0; @@ -1790,7 +1805,11 @@ svm_range_restore_pages(struct amdgpu_device *adev, struct amdgpu_vm *vm, goto out_srcu_unlock; }
- if (!atomic_read(&prange->invalid)) { + mutex_lock(&prange->mutex); + timestamp = ktime_to_us(ktime_get()) - prange->validate_timestamp; + mutex_unlock(&prange->mutex); + /* skip duplicate vm fault on different pages of same range */ + if (timestamp < AMDGPU_SVM_RANGE_RETRY_FAULT_PENDING) { pr_debug("svms 0x%p [0x%lx %lx] already restored\n", svms, prange->it_node.start, prange->it_node.last); goto out_srcu_unlock; diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h index 0685eb04b87c..466ec5537bbb 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.h +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.h @@ -66,6 +66,7 @@ struct svm_range_bo { * @actual_loc: the actual location, 0 for CPU, or GPU id * @granularity:migration granularity, log2 num pages * @invalid: not 0 means cpu page table is invalidated + * @validate_timestamp: system timestamp when range is validated * @bitmap_access: index bitmap of GPUs which can access the range * @bitmap_aip: index bitmap of GPUs which can access the range in place * @@ -95,6 +96,7 @@ struct svm_range { uint32_t actual_loc; uint8_t granularity; atomic_t invalid; + uint64_t validate_timestamp; DECLARE_BITMAP(bitmap_access, MAX_GPU_INSTANCE); DECLARE_BITMAP(bitmap_aip, MAX_GPU_INSTANCE); };
From: Philip Yang Philip.Yang@amd.com
If prefetch range to gpu with acutal location is another gpu, or GPU retry fault restore pages to migrate the range with acutal location is gpu, then migrate from one gpu to another gpu.
Use system memory as bridge because sdma engine may not able to access another gpu vram, use sdma of source gpu to migrate to system memory, then use sdma of destination gpu to migrate from system memory to gpu.
Print out gpuid or gpuidx in debug messages.
Signed-off-by: Philip Yang Philip.Yang@amd.com Signed-off-by: Felix Kuehling Felix.Kuehling@amd.com --- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 57 +++++++++++++++++-- drivers/gpu/drm/amd/amdkfd/kfd_migrate.h | 4 +- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 70 +++++++++++++++++------- 3 files changed, 103 insertions(+), 28 deletions(-)
diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c index 2095417c7846..6c644472cead 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c @@ -449,8 +449,9 @@ svm_migrate_vma_to_vram(struct amdgpu_device *adev, struct svm_range *prange, * Return: * 0 - OK, otherwise error code */ -int svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t best_loc, - struct mm_struct *mm) +static int +svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t best_loc, + struct mm_struct *mm) { unsigned long addr, start, end; struct vm_area_struct *vma; @@ -470,8 +471,8 @@ int svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t best_loc, return -ENODEV; }
- pr_debug("svms 0x%p [0x%lx 0x%lx]\n", prange->svms, - prange->it_node.start, prange->it_node.last); + pr_debug("svms 0x%p [0x%lx 0x%lx] to gpu 0x%x\n", prange->svms, + prange->it_node.start, prange->it_node.last, best_loc);
start = prange->it_node.start << PAGE_SHIFT; end = (prange->it_node.last + 1) << PAGE_SHIFT; @@ -668,8 +669,9 @@ int svm_migrate_vram_to_ram(struct svm_range *prange, struct mm_struct *mm) return -ENODEV; }
- pr_debug("svms 0x%p [0x%lx 0x%lx]\n", prange->svms, - prange->it_node.start, prange->it_node.last); + pr_debug("svms 0x%p [0x%lx 0x%lx] from gpu 0x%x to ram\n", prange->svms, + prange->it_node.start, prange->it_node.last, + prange->actual_loc);
start = prange->it_node.start << PAGE_SHIFT; end = (prange->it_node.last + 1) << PAGE_SHIFT; @@ -696,6 +698,49 @@ int svm_migrate_vram_to_ram(struct svm_range *prange, struct mm_struct *mm) return r; }
+/** + * svm_migrate_vram_to_vram - migrate svm range from device to device + * @prange: range structure + * @best_loc: the device to migrate to + * @mm: process mm, use current->mm if NULL + * + * Context: Process context, caller hold mm->mmap_sem and prange->lock and take + * svms srcu read lock + * + * Return: + * 0 - OK, otherwise error code + */ +static int +svm_migrate_vram_to_vram(struct svm_range *prange, uint32_t best_loc, + struct mm_struct *mm) +{ + int r; + + /* + * TODO: for both devices with PCIe large bar or on same xgmi hive, skip + * system memory as migration bridge + */ + + pr_debug("from gpu 0x%x to gpu 0x%x\n", prange->actual_loc, best_loc); + + r = svm_migrate_vram_to_ram(prange, mm); + if (r) + return r; + + return svm_migrate_ram_to_vram(prange, best_loc, mm); +} + +int +svm_migrate_to_vram(struct svm_range *prange, uint32_t best_loc, + struct mm_struct *mm) +{ + if (!prange->actual_loc) + return svm_migrate_ram_to_vram(prange, best_loc, mm); + else + return svm_migrate_vram_to_vram(prange, best_loc, mm); + +} + /** * svm_migrate_to_ram - CPU page fault handler * @vmf: CPU vm fault vma, address diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h index 9949b55d3b6a..bc680619d135 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.h @@ -37,8 +37,8 @@ enum MIGRATION_COPY_DIR { FROM_VRAM_TO_RAM };
-int svm_migrate_ram_to_vram(struct svm_range *prange, uint32_t best_loc, - struct mm_struct *mm); +int svm_migrate_to_vram(struct svm_range *prange, uint32_t best_loc, + struct mm_struct *mm); int svm_migrate_vram_to_ram(struct svm_range *prange, struct mm_struct *mm); unsigned long svm_migrate_addr_to_pfn(struct amdgpu_device *adev, unsigned long addr); diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c index 65f20a72ddcb..d029fce94db0 100644 --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c @@ -288,8 +288,11 @@ static void svm_range_bo_unref(struct svm_range_bo *svm_bo) kref_put(&svm_bo->kref, svm_range_bo_release); }
-static bool svm_range_validate_svm_bo(struct svm_range *prange) +static bool +svm_range_validate_svm_bo(struct amdgpu_device *adev, struct svm_range *prange) { + struct amdgpu_device *bo_adev; + spin_lock(&prange->svm_bo_lock); if (!prange->svm_bo) { spin_unlock(&prange->svm_bo_lock); @@ -301,6 +304,22 @@ static bool svm_range_validate_svm_bo(struct svm_range *prange) return true; } if (svm_bo_ref_unless_zero(prange->svm_bo)) { + /* + * Migrate from GPU to GPU, remove range from source bo_adev + * svm_bo range list, and return false to allocate svm_bo from + * destination adev. + */ + bo_adev = amdgpu_ttm_adev(prange->svm_bo->bo->tbo.bdev); + if (bo_adev != adev) { + spin_unlock(&prange->svm_bo_lock); + + spin_lock(&prange->svm_bo->list_lock); + list_del_init(&prange->svm_bo_list); + spin_unlock(&prange->svm_bo->list_lock); + + svm_range_bo_unref(prange->svm_bo); + return false; + } if (READ_ONCE(prange->svm_bo->evicting)) { struct dma_fence *f; struct svm_range_bo *svm_bo; @@ -374,7 +393,7 @@ svm_range_vram_node_new(struct amdgpu_device *adev, struct svm_range *prange, pr_debug("pasid: %x svms 0x%p [0x%lx 0x%lx]\n", p->pasid, prange->svms, prange->it_node.start, prange->it_node.last);
- if (svm_range_validate_svm_bo(prange)) + if (svm_range_validate_svm_bo(adev, prange)) return 0;
svm_bo = svm_range_bo_new(); @@ -1209,6 +1228,7 @@ int svm_range_map_to_gpus(struct svm_range *prange, bool reserve_vm) }
for_each_set_bit(gpuidx, bitmap, MAX_GPU_INSTANCE) { + pr_debug("mapping to gpu idx 0x%x\n", gpuidx); r = kfd_process_device_from_gpuidx(p, gpuidx, &dev); if (r) { pr_debug("failed to find device idx %d\n", gpuidx); @@ -1843,7 +1863,7 @@ svm_range_restore_pages(struct amdgpu_device *adev, struct amdgpu_vm *vm,
if (prange->actual_loc != best_loc) { if (best_loc) - r = svm_migrate_ram_to_vram(prange, best_loc, mm); + r = svm_migrate_to_vram(prange, best_loc, mm); else r = svm_migrate_vram_to_ram(prange, mm); if (r) { @@ -2056,6 +2076,11 @@ svm_range_best_prefetch_location(struct svm_range *prange) goto out;
bo_adev = svm_range_get_adev_by_id(prange, best_loc); + if (!bo_adev) { + WARN_ONCE(1, "failed to get device by id 0x%x\n", best_loc); + best_loc = 0; + goto out; + } bitmap_or(bitmap, prange->bitmap_access, prange->bitmap_aip, MAX_GPU_INSTANCE);
@@ -2076,6 +2101,7 @@ svm_range_best_prefetch_location(struct svm_range *prange) pr_debug("xnack %d svms 0x%p [0x%lx 0x%lx] best loc 0x%x\n", p->xnack_enabled, &p->svms, prange->it_node.start, prange->it_node.last, best_loc); + return best_loc; }
@@ -2117,29 +2143,33 @@ svm_range_trigger_migration(struct mm_struct *mm, struct svm_range *prange, best_loc == prange->actual_loc) return 0;
+ /* + * Prefetch to GPU without host access flag, set actual_loc to gpu, then + * validate on gpu and map to gpus will be handled afterwards. + */ if (best_loc && !prange->actual_loc && - !(prange->flags & KFD_IOCTL_SVM_FLAG_HOST_ACCESS)) + !(prange->flags & KFD_IOCTL_SVM_FLAG_HOST_ACCESS)) { + prange->actual_loc = best_loc; return 0; + }
- if (best_loc) { - if (!prange->actual_loc && !prange->pages_addr) { - pr_debug("host access and prefetch to gpu\n"); - r = svm_range_validate_ram(mm, prange); - if (r) { - pr_debug("failed %d to validate on ram\n", r); - return r; - } - } - - pr_debug("migrate from ram to vram\n"); - r = svm_migrate_ram_to_vram(prange, best_loc, mm); - } else { - pr_debug("migrate from vram to ram\n"); + if (!best_loc) { r = svm_migrate_vram_to_ram(prange, mm); + *migrated = !r; + return r; + } + + if (!prange->actual_loc && !prange->pages_addr) { + pr_debug("host access and prefetch to gpu\n"); + r = svm_range_validate_ram(mm, prange); + if (r) { + pr_debug("failed %d to validate on ram\n", r); + return r; + } }
- if (!r) - *migrated = true; + r = svm_migrate_to_vram(prange, best_loc, mm); + *migrated = !r;
return r; }
On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote:
This is the first version of our HMM based shared virtual memory manager for KFD. There are still a number of known issues that we're working through (see below). This will likely lead to some pretty significant changes in MMU notifier handling and locking on the migration code paths. So don't get hung up on those details yet.
But I think this is a good time to start getting feedback. We're pretty confident about the ioctl API, which is both simple and extensible for the future. (see patches 4,16) The user mode side of the API can be found here: https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/fxkamd/hmm-wi...
I'd also like another pair of eyes on how we're interfacing with the GPU VM code in amdgpu_vm.c (see patches 12,13), retry page fault handling (24,25), and some retry IRQ handling changes (32).
Known issues:
- won't work with IOMMU enabled, we need to dma_map all pages properly
- still working on some race conditions and random bugs
- performance is not great yet
Still catching up, but I think there's another one for your list:
* hmm gpu context preempt vs page fault handling. I've had a short discussion about this one with Christian before the holidays, and also some private chats with Jerome. It's nasty since no easy fix, much less a good idea what's the best approach here.
I'll try to look at this more in-depth when I'm catching up on mails. -Daniel
Alex Sierra (12): drm/amdgpu: replace per_device_list by array drm/amdkfd: helper to convert gpu id and idx drm/amdkfd: add xnack enabled flag to kfd_process drm/amdkfd: add ioctl to configure and query xnack retries drm/amdkfd: invalidate tables on page retry fault drm/amdkfd: page table restore through svm API drm/amdkfd: SVM API call to restore page tables drm/amdkfd: add svm_bo reference for eviction fence drm/amdgpu: add param bit flag to create SVM BOs drm/amdkfd: add svm_bo eviction mechanism support drm/amdgpu: svm bo enable_signal call condition drm/amdgpu: add svm_bo eviction to enable_signal cb
Philip Yang (23): drm/amdkfd: select kernel DEVICE_PRIVATE option drm/amdkfd: add svm ioctl API drm/amdkfd: Add SVM API support capability bits drm/amdkfd: register svm range drm/amdkfd: add svm ioctl GET_ATTR op drm/amdgpu: add common HMM get pages function drm/amdkfd: validate svm range system memory drm/amdkfd: register overlap system memory range drm/amdkfd: deregister svm range drm/amdgpu: export vm update mapping interface drm/amdkfd: map svm range to GPUs drm/amdkfd: svm range eviction and restore drm/amdkfd: register HMM device private zone drm/amdkfd: validate vram svm range from TTM drm/amdkfd: support xgmi same hive mapping drm/amdkfd: copy memory through gart table drm/amdkfd: HMM migrate ram to vram drm/amdkfd: HMM migrate vram to ram drm/amdgpu: reserve fence slot to update page table drm/amdgpu: enable retry fault wptr overflow drm/amdkfd: refine migration policy with xnack on drm/amdkfd: add svm range validate timestamp drm/amdkfd: multiple gpu migrate vram to vram
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 3 + drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 4 +- .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c | 16 +- .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 13 +- drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c | 83 + drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h | 7 + drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 5 + drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 90 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 47 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 10 + drivers/gpu/drm/amd/amdgpu/vega10_ih.c | 32 +- drivers/gpu/drm/amd/amdgpu/vega20_ih.c | 32 +- drivers/gpu/drm/amd/amdkfd/Kconfig | 1 + drivers/gpu/drm/amd/amdkfd/Makefile | 4 +- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 170 +- drivers/gpu/drm/amd/amdkfd/kfd_iommu.c | 8 +- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 866 ++++++ drivers/gpu/drm/amd/amdkfd/kfd_migrate.h | 59 + drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 52 +- drivers/gpu/drm/amd/amdkfd/kfd_process.c | 200 +- .../amd/amdkfd/kfd_process_queue_manager.c | 6 +- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 2564 +++++++++++++++++ drivers/gpu/drm/amd/amdkfd/kfd_svm.h | 135 + drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 1 + drivers/gpu/drm/amd/amdkfd/kfd_topology.h | 10 +- include/uapi/linux/kfd_ioctl.h | 169 +- 26 files changed, 4296 insertions(+), 291 deletions(-) create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.c create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.h
-- 2.29.2
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Am 2021-01-07 um 4:23 a.m. schrieb Daniel Vetter:
On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote:
This is the first version of our HMM based shared virtual memory manager for KFD. There are still a number of known issues that we're working through (see below). This will likely lead to some pretty significant changes in MMU notifier handling and locking on the migration code paths. So don't get hung up on those details yet.
But I think this is a good time to start getting feedback. We're pretty confident about the ioctl API, which is both simple and extensible for the future. (see patches 4,16) The user mode side of the API can be found here: https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/fxkamd/hmm-wi...
I'd also like another pair of eyes on how we're interfacing with the GPU VM code in amdgpu_vm.c (see patches 12,13), retry page fault handling (24,25), and some retry IRQ handling changes (32).
Known issues:
- won't work with IOMMU enabled, we need to dma_map all pages properly
- still working on some race conditions and random bugs
- performance is not great yet
Still catching up, but I think there's another one for your list:
- hmm gpu context preempt vs page fault handling. I've had a short discussion about this one with Christian before the holidays, and also some private chats with Jerome. It's nasty since no easy fix, much less a good idea what's the best approach here.
Do you have a pointer to that discussion or any more details?
Thanks, Felix
I'll try to look at this more in-depth when I'm catching up on mails. -Daniel
Alex Sierra (12): drm/amdgpu: replace per_device_list by array drm/amdkfd: helper to convert gpu id and idx drm/amdkfd: add xnack enabled flag to kfd_process drm/amdkfd: add ioctl to configure and query xnack retries drm/amdkfd: invalidate tables on page retry fault drm/amdkfd: page table restore through svm API drm/amdkfd: SVM API call to restore page tables drm/amdkfd: add svm_bo reference for eviction fence drm/amdgpu: add param bit flag to create SVM BOs drm/amdkfd: add svm_bo eviction mechanism support drm/amdgpu: svm bo enable_signal call condition drm/amdgpu: add svm_bo eviction to enable_signal cb
Philip Yang (23): drm/amdkfd: select kernel DEVICE_PRIVATE option drm/amdkfd: add svm ioctl API drm/amdkfd: Add SVM API support capability bits drm/amdkfd: register svm range drm/amdkfd: add svm ioctl GET_ATTR op drm/amdgpu: add common HMM get pages function drm/amdkfd: validate svm range system memory drm/amdkfd: register overlap system memory range drm/amdkfd: deregister svm range drm/amdgpu: export vm update mapping interface drm/amdkfd: map svm range to GPUs drm/amdkfd: svm range eviction and restore drm/amdkfd: register HMM device private zone drm/amdkfd: validate vram svm range from TTM drm/amdkfd: support xgmi same hive mapping drm/amdkfd: copy memory through gart table drm/amdkfd: HMM migrate ram to vram drm/amdkfd: HMM migrate vram to ram drm/amdgpu: reserve fence slot to update page table drm/amdgpu: enable retry fault wptr overflow drm/amdkfd: refine migration policy with xnack on drm/amdkfd: add svm range validate timestamp drm/amdkfd: multiple gpu migrate vram to vram
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 3 + drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 4 +- .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c | 16 +- .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 13 +- drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c | 83 + drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h | 7 + drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 5 + drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 90 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 47 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 10 + drivers/gpu/drm/amd/amdgpu/vega10_ih.c | 32 +- drivers/gpu/drm/amd/amdgpu/vega20_ih.c | 32 +- drivers/gpu/drm/amd/amdkfd/Kconfig | 1 + drivers/gpu/drm/amd/amdkfd/Makefile | 4 +- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 170 +- drivers/gpu/drm/amd/amdkfd/kfd_iommu.c | 8 +- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 866 ++++++ drivers/gpu/drm/amd/amdkfd/kfd_migrate.h | 59 + drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 52 +- drivers/gpu/drm/amd/amdkfd/kfd_process.c | 200 +- .../amd/amdkfd/kfd_process_queue_manager.c | 6 +- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 2564 +++++++++++++++++ drivers/gpu/drm/amd/amdkfd/kfd_svm.h | 135 + drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 1 + drivers/gpu/drm/amd/amdkfd/kfd_topology.h | 10 +- include/uapi/linux/kfd_ioctl.h | 169 +- 26 files changed, 4296 insertions(+), 291 deletions(-) create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.c create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.h
-- 2.29.2
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Thu, Jan 07, 2021 at 11:25:41AM -0500, Felix Kuehling wrote:
Am 2021-01-07 um 4:23 a.m. schrieb Daniel Vetter:
On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote:
This is the first version of our HMM based shared virtual memory manager for KFD. There are still a number of known issues that we're working through (see below). This will likely lead to some pretty significant changes in MMU notifier handling and locking on the migration code paths. So don't get hung up on those details yet.
But I think this is a good time to start getting feedback. We're pretty confident about the ioctl API, which is both simple and extensible for the future. (see patches 4,16) The user mode side of the API can be found here: https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/fxkamd/hmm-wi...
I'd also like another pair of eyes on how we're interfacing with the GPU VM code in amdgpu_vm.c (see patches 12,13), retry page fault handling (24,25), and some retry IRQ handling changes (32).
Known issues:
- won't work with IOMMU enabled, we need to dma_map all pages properly
- still working on some race conditions and random bugs
- performance is not great yet
Still catching up, but I think there's another one for your list:
- hmm gpu context preempt vs page fault handling. I've had a short discussion about this one with Christian before the holidays, and also some private chats with Jerome. It's nasty since no easy fix, much less a good idea what's the best approach here.
Do you have a pointer to that discussion or any more details?
Essentially if you're handling an hmm page fault from the gpu, you can deadlock by calling dma_fence_wait on a (chain of, possibly) other command submissions or compute contexts with dma_fence_wait. Which deadlocks if you can't preempt while you have that page fault pending. Two solutions:
- your hw can (at least for compute ctx) preempt even when a page fault is pending
- lots of screaming in trying to come up with an alternate solution. They all suck.
Note that the dma_fence_wait is hard requirement, because we need that for mmu notifiers and shrinkers, disallowing that would disable dynamic memory management. Which is the current "ttm is self-limited to 50% of system memory" limitation Christian is trying to lift. So that's really not a restriction we can lift, at least not in upstream where we need to also support old style hardware which doesn't have page fault support and really has no other option to handle memory management than dma_fence_wait.
Thread was here:
https://lore.kernel.org/dri-devel/CAKMK7uGgoeF8LmFBwWh5mW1k4xWjuUh3hdSFpVH1N...
There's a few ways to resolve this (without having preempt-capable hardware), but they're all supremely nasty. -Daniel
Thanks, Felix
I'll try to look at this more in-depth when I'm catching up on mails. -Daniel
Alex Sierra (12): drm/amdgpu: replace per_device_list by array drm/amdkfd: helper to convert gpu id and idx drm/amdkfd: add xnack enabled flag to kfd_process drm/amdkfd: add ioctl to configure and query xnack retries drm/amdkfd: invalidate tables on page retry fault drm/amdkfd: page table restore through svm API drm/amdkfd: SVM API call to restore page tables drm/amdkfd: add svm_bo reference for eviction fence drm/amdgpu: add param bit flag to create SVM BOs drm/amdkfd: add svm_bo eviction mechanism support drm/amdgpu: svm bo enable_signal call condition drm/amdgpu: add svm_bo eviction to enable_signal cb
Philip Yang (23): drm/amdkfd: select kernel DEVICE_PRIVATE option drm/amdkfd: add svm ioctl API drm/amdkfd: Add SVM API support capability bits drm/amdkfd: register svm range drm/amdkfd: add svm ioctl GET_ATTR op drm/amdgpu: add common HMM get pages function drm/amdkfd: validate svm range system memory drm/amdkfd: register overlap system memory range drm/amdkfd: deregister svm range drm/amdgpu: export vm update mapping interface drm/amdkfd: map svm range to GPUs drm/amdkfd: svm range eviction and restore drm/amdkfd: register HMM device private zone drm/amdkfd: validate vram svm range from TTM drm/amdkfd: support xgmi same hive mapping drm/amdkfd: copy memory through gart table drm/amdkfd: HMM migrate ram to vram drm/amdkfd: HMM migrate vram to ram drm/amdgpu: reserve fence slot to update page table drm/amdgpu: enable retry fault wptr overflow drm/amdkfd: refine migration policy with xnack on drm/amdkfd: add svm range validate timestamp drm/amdkfd: multiple gpu migrate vram to vram
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 3 + drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 4 +- .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c | 16 +- .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 13 +- drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c | 83 + drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h | 7 + drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 5 + drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 90 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 47 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 10 + drivers/gpu/drm/amd/amdgpu/vega10_ih.c | 32 +- drivers/gpu/drm/amd/amdgpu/vega20_ih.c | 32 +- drivers/gpu/drm/amd/amdkfd/Kconfig | 1 + drivers/gpu/drm/amd/amdkfd/Makefile | 4 +- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 170 +- drivers/gpu/drm/amd/amdkfd/kfd_iommu.c | 8 +- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 866 ++++++ drivers/gpu/drm/amd/amdkfd/kfd_migrate.h | 59 + drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 52 +- drivers/gpu/drm/amd/amdkfd/kfd_process.c | 200 +- .../amd/amdkfd/kfd_process_queue_manager.c | 6 +- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 2564 +++++++++++++++++ drivers/gpu/drm/amd/amdkfd/kfd_svm.h | 135 + drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 1 + drivers/gpu/drm/amd/amdkfd/kfd_topology.h | 10 +- include/uapi/linux/kfd_ioctl.h | 169 +- 26 files changed, 4296 insertions(+), 291 deletions(-) create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.c create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.h
-- 2.29.2
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Am 08.01.21 um 15:40 schrieb Daniel Vetter:
On Thu, Jan 07, 2021 at 11:25:41AM -0500, Felix Kuehling wrote:
Am 2021-01-07 um 4:23 a.m. schrieb Daniel Vetter:
On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote:
This is the first version of our HMM based shared virtual memory manager for KFD. There are still a number of known issues that we're working through (see below). This will likely lead to some pretty significant changes in MMU notifier handling and locking on the migration code paths. So don't get hung up on those details yet.
But I think this is a good time to start getting feedback. We're pretty confident about the ioctl API, which is both simple and extensible for the future. (see patches 4,16) The user mode side of the API can be found here: https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/fxkamd/hmm-wi...
I'd also like another pair of eyes on how we're interfacing with the GPU VM code in amdgpu_vm.c (see patches 12,13), retry page fault handling (24,25), and some retry IRQ handling changes (32).
Known issues:
- won't work with IOMMU enabled, we need to dma_map all pages properly
- still working on some race conditions and random bugs
- performance is not great yet
Still catching up, but I think there's another one for your list:
- hmm gpu context preempt vs page fault handling. I've had a short discussion about this one with Christian before the holidays, and also some private chats with Jerome. It's nasty since no easy fix, much less a good idea what's the best approach here.
Do you have a pointer to that discussion or any more details?
Essentially if you're handling an hmm page fault from the gpu, you can deadlock by calling dma_fence_wait on a (chain of, possibly) other command submissions or compute contexts with dma_fence_wait. Which deadlocks if you can't preempt while you have that page fault pending. Two solutions:
your hw can (at least for compute ctx) preempt even when a page fault is pending
lots of screaming in trying to come up with an alternate solution. They all suck.
Note that the dma_fence_wait is hard requirement, because we need that for mmu notifiers and shrinkers, disallowing that would disable dynamic memory management. Which is the current "ttm is self-limited to 50% of system memory" limitation Christian is trying to lift. So that's really not a restriction we can lift, at least not in upstream where we need to also support old style hardware which doesn't have page fault support and really has no other option to handle memory management than dma_fence_wait.
Thread was here:
https://lore.kernel.org/dri-devel/CAKMK7uGgoeF8LmFBwWh5mW1k4xWjuUh3hdSFpVH1N...
There's a few ways to resolve this (without having preempt-capable hardware), but they're all supremely nasty. -Daniel
Thanks, Felix
I'll try to look at this more in-depth when I'm catching up on mails. -Daniel
Alex Sierra (12): drm/amdgpu: replace per_device_list by array drm/amdkfd: helper to convert gpu id and idx drm/amdkfd: add xnack enabled flag to kfd_process drm/amdkfd: add ioctl to configure and query xnack retries drm/amdkfd: invalidate tables on page retry fault drm/amdkfd: page table restore through svm API drm/amdkfd: SVM API call to restore page tables drm/amdkfd: add svm_bo reference for eviction fence drm/amdgpu: add param bit flag to create SVM BOs drm/amdkfd: add svm_bo eviction mechanism support drm/amdgpu: svm bo enable_signal call condition drm/amdgpu: add svm_bo eviction to enable_signal cb
Philip Yang (23): drm/amdkfd: select kernel DEVICE_PRIVATE option drm/amdkfd: add svm ioctl API drm/amdkfd: Add SVM API support capability bits drm/amdkfd: register svm range drm/amdkfd: add svm ioctl GET_ATTR op drm/amdgpu: add common HMM get pages function drm/amdkfd: validate svm range system memory drm/amdkfd: register overlap system memory range drm/amdkfd: deregister svm range drm/amdgpu: export vm update mapping interface drm/amdkfd: map svm range to GPUs drm/amdkfd: svm range eviction and restore drm/amdkfd: register HMM device private zone drm/amdkfd: validate vram svm range from TTM drm/amdkfd: support xgmi same hive mapping drm/amdkfd: copy memory through gart table drm/amdkfd: HMM migrate ram to vram drm/amdkfd: HMM migrate vram to ram drm/amdgpu: reserve fence slot to update page table drm/amdgpu: enable retry fault wptr overflow drm/amdkfd: refine migration policy with xnack on drm/amdkfd: add svm range validate timestamp drm/amdkfd: multiple gpu migrate vram to vram
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 3 + drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 4 +- .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c | 16 +- .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 13 +- drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c | 83 + drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h | 7 + drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 5 + drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 90 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 47 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 10 + drivers/gpu/drm/amd/amdgpu/vega10_ih.c | 32 +- drivers/gpu/drm/amd/amdgpu/vega20_ih.c | 32 +- drivers/gpu/drm/amd/amdkfd/Kconfig | 1 + drivers/gpu/drm/amd/amdkfd/Makefile | 4 +- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 170 +- drivers/gpu/drm/amd/amdkfd/kfd_iommu.c | 8 +- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 866 ++++++ drivers/gpu/drm/amd/amdkfd/kfd_migrate.h | 59 + drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 52 +- drivers/gpu/drm/amd/amdkfd/kfd_process.c | 200 +- .../amd/amdkfd/kfd_process_queue_manager.c | 6 +- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 2564 +++++++++++++++++ drivers/gpu/drm/amd/amdkfd/kfd_svm.h | 135 + drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 1 + drivers/gpu/drm/amd/amdkfd/kfd_topology.h | 10 +- include/uapi/linux/kfd_ioctl.h | 169 +- 26 files changed, 4296 insertions(+), 291 deletions(-) create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.c create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.h
-- 2.29.2
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Am 2021-01-08 um 9:40 a.m. schrieb Daniel Vetter:
On Thu, Jan 07, 2021 at 11:25:41AM -0500, Felix Kuehling wrote:
Am 2021-01-07 um 4:23 a.m. schrieb Daniel Vetter:
On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote:
This is the first version of our HMM based shared virtual memory manager for KFD. There are still a number of known issues that we're working through (see below). This will likely lead to some pretty significant changes in MMU notifier handling and locking on the migration code paths. So don't get hung up on those details yet.
But I think this is a good time to start getting feedback. We're pretty confident about the ioctl API, which is both simple and extensible for the future. (see patches 4,16) The user mode side of the API can be found here: https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/fxkamd/hmm-wi...
I'd also like another pair of eyes on how we're interfacing with the GPU VM code in amdgpu_vm.c (see patches 12,13), retry page fault handling (24,25), and some retry IRQ handling changes (32).
Known issues:
- won't work with IOMMU enabled, we need to dma_map all pages properly
- still working on some race conditions and random bugs
- performance is not great yet
Still catching up, but I think there's another one for your list:
- hmm gpu context preempt vs page fault handling. I've had a short discussion about this one with Christian before the holidays, and also some private chats with Jerome. It's nasty since no easy fix, much less a good idea what's the best approach here.
Do you have a pointer to that discussion or any more details?
Essentially if you're handling an hmm page fault from the gpu, you can deadlock by calling dma_fence_wait on a (chain of, possibly) other command submissions or compute contexts with dma_fence_wait. Which deadlocks if you can't preempt while you have that page fault pending. Two solutions:
- your hw can (at least for compute ctx) preempt even when a page fault is pending
Our GFXv9 GPUs can do this. GFXv10 cannot.
- lots of screaming in trying to come up with an alternate solution. They all suck.
My idea for GFXv10 is to avoid preemption for memory management purposes and rely 100% on page faults instead. That is, if the memory manager needs to prevent GPU access to certain memory, just invalidate the GPU page table entries pointing to that memory. No waiting for fences is necessary, except for the SDMA job that invalidates the PTEs, which runs on a special high-priority queue that should never deadlock. That should prevent the CPU getting involved in deadlocks in kernel mode. But you can still deadlock the GPU in user mode if all compute units get stuck in page faults and can't switch to any useful work any more. So it's possible that we won't be able to use GPU page faults on our GFXv10 GPUs.
Regards, Felix
Note that the dma_fence_wait is hard requirement, because we need that for mmu notifiers and shrinkers, disallowing that would disable dynamic memory management. Which is the current "ttm is self-limited to 50% of system memory" limitation Christian is trying to lift. So that's really not a restriction we can lift, at least not in upstream where we need to also support old style hardware which doesn't have page fault support and really has no other option to handle memory management than dma_fence_wait.
Thread was here:
https://lore.kernel.org/dri-devel/CAKMK7uGgoeF8LmFBwWh5mW1k4xWjuUh3hdSFpVH1N...
There's a few ways to resolve this (without having preempt-capable hardware), but they're all supremely nasty. -Daniel
Thanks, Felix
I'll try to look at this more in-depth when I'm catching up on mails. -Daniel
Alex Sierra (12): drm/amdgpu: replace per_device_list by array drm/amdkfd: helper to convert gpu id and idx drm/amdkfd: add xnack enabled flag to kfd_process drm/amdkfd: add ioctl to configure and query xnack retries drm/amdkfd: invalidate tables on page retry fault drm/amdkfd: page table restore through svm API drm/amdkfd: SVM API call to restore page tables drm/amdkfd: add svm_bo reference for eviction fence drm/amdgpu: add param bit flag to create SVM BOs drm/amdkfd: add svm_bo eviction mechanism support drm/amdgpu: svm bo enable_signal call condition drm/amdgpu: add svm_bo eviction to enable_signal cb
Philip Yang (23): drm/amdkfd: select kernel DEVICE_PRIVATE option drm/amdkfd: add svm ioctl API drm/amdkfd: Add SVM API support capability bits drm/amdkfd: register svm range drm/amdkfd: add svm ioctl GET_ATTR op drm/amdgpu: add common HMM get pages function drm/amdkfd: validate svm range system memory drm/amdkfd: register overlap system memory range drm/amdkfd: deregister svm range drm/amdgpu: export vm update mapping interface drm/amdkfd: map svm range to GPUs drm/amdkfd: svm range eviction and restore drm/amdkfd: register HMM device private zone drm/amdkfd: validate vram svm range from TTM drm/amdkfd: support xgmi same hive mapping drm/amdkfd: copy memory through gart table drm/amdkfd: HMM migrate ram to vram drm/amdkfd: HMM migrate vram to ram drm/amdgpu: reserve fence slot to update page table drm/amdgpu: enable retry fault wptr overflow drm/amdkfd: refine migration policy with xnack on drm/amdkfd: add svm range validate timestamp drm/amdkfd: multiple gpu migrate vram to vram
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 3 + drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 4 +- .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c | 16 +- .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 13 +- drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c | 83 + drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h | 7 + drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 5 + drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 90 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 47 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 10 + drivers/gpu/drm/amd/amdgpu/vega10_ih.c | 32 +- drivers/gpu/drm/amd/amdgpu/vega20_ih.c | 32 +- drivers/gpu/drm/amd/amdkfd/Kconfig | 1 + drivers/gpu/drm/amd/amdkfd/Makefile | 4 +- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 170 +- drivers/gpu/drm/amd/amdkfd/kfd_iommu.c | 8 +- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 866 ++++++ drivers/gpu/drm/amd/amdkfd/kfd_migrate.h | 59 + drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 52 +- drivers/gpu/drm/amd/amdkfd/kfd_process.c | 200 +- .../amd/amdkfd/kfd_process_queue_manager.c | 6 +- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 2564 +++++++++++++++++ drivers/gpu/drm/amd/amdkfd/kfd_svm.h | 135 + drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 1 + drivers/gpu/drm/amd/amdkfd/kfd_topology.h | 10 +- include/uapi/linux/kfd_ioctl.h | 169 +- 26 files changed, 4296 insertions(+), 291 deletions(-) create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.c create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.h
-- 2.29.2
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Fri, Jan 8, 2021 at 4:58 PM Felix Kuehling felix.kuehling@amd.com wrote:
Am 2021-01-08 um 9:40 a.m. schrieb Daniel Vetter:
On Thu, Jan 07, 2021 at 11:25:41AM -0500, Felix Kuehling wrote:
Am 2021-01-07 um 4:23 a.m. schrieb Daniel Vetter:
On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote:
This is the first version of our HMM based shared virtual memory manager for KFD. There are still a number of known issues that we're working through (see below). This will likely lead to some pretty significant changes in MMU notifier handling and locking on the migration code paths. So don't get hung up on those details yet.
But I think this is a good time to start getting feedback. We're pretty confident about the ioctl API, which is both simple and extensible for the future. (see patches 4,16) The user mode side of the API can be found here: https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/fxkamd/hmm-wi...
I'd also like another pair of eyes on how we're interfacing with the GPU VM code in amdgpu_vm.c (see patches 12,13), retry page fault handling (24,25), and some retry IRQ handling changes (32).
Known issues:
- won't work with IOMMU enabled, we need to dma_map all pages properly
- still working on some race conditions and random bugs
- performance is not great yet
Still catching up, but I think there's another one for your list:
- hmm gpu context preempt vs page fault handling. I've had a short discussion about this one with Christian before the holidays, and also some private chats with Jerome. It's nasty since no easy fix, much less a good idea what's the best approach here.
Do you have a pointer to that discussion or any more details?
Essentially if you're handling an hmm page fault from the gpu, you can deadlock by calling dma_fence_wait on a (chain of, possibly) other command submissions or compute contexts with dma_fence_wait. Which deadlocks if you can't preempt while you have that page fault pending. Two solutions:
- your hw can (at least for compute ctx) preempt even when a page fault is pending
Our GFXv9 GPUs can do this. GFXv10 cannot.
Uh, why did your hw guys drop this :-/
- lots of screaming in trying to come up with an alternate solution. They all suck.
My idea for GFXv10 is to avoid preemption for memory management purposes and rely 100% on page faults instead. That is, if the memory manager needs to prevent GPU access to certain memory, just invalidate the GPU page table entries pointing to that memory. No waiting for fences is necessary, except for the SDMA job that invalidates the PTEs, which runs on a special high-priority queue that should never deadlock. That should prevent the CPU getting involved in deadlocks in kernel mode. But you can still deadlock the GPU in user mode if all compute units get stuck in page faults and can't switch to any useful work any more. So it's possible that we won't be able to use GPU page faults on our GFXv10 GPUs.
This only works if _everything_ in the system works like this, since you're defacto breaking the cross-driver contract. As soon as there's some legacy gl workload (userptr) or another driver involved, this approach falls apart.
I do think it can be rescued with what I call gang scheduling of engines: I.e. when a given engine is running a context (or a group of engines, depending how your hw works) that can cause a page fault, you must flush out all workloads running on the same engine which could block a dma_fence (preempt them, or for non-compute stuff, force their completion). And the other way round, i.e. before you can run a legacy gl workload with a dma_fence on these engines you need to preempt all ctxs that could cause page faults and take them at least out of the hw scheduler queue.
Just reserving an sdma engine for copy jobs and ptes updates and that stuff is necessary, but not sufficient.
Another approach that Jerome suggested is to track the reverse dependency graph of all dma_fence somehow and make sure that direct reclaim never recurses on an engine you're serving a pagefault for. Possible in theory, but in practice I think not feasible to implement because way too much work to implement.
Either way it's imo really nasty to come up with a scheme here that doesn't fail in some corner, or becomes really nasty with inconsistent rules across different drivers and hw :-(
Cheers, Daniel
Regards, Felix
Note that the dma_fence_wait is hard requirement, because we need that for mmu notifiers and shrinkers, disallowing that would disable dynamic memory management. Which is the current "ttm is self-limited to 50% of system memory" limitation Christian is trying to lift. So that's really not a restriction we can lift, at least not in upstream where we need to also support old style hardware which doesn't have page fault support and really has no other option to handle memory management than dma_fence_wait.
Thread was here:
https://lore.kernel.org/dri-devel/CAKMK7uGgoeF8LmFBwWh5mW1k4xWjuUh3hdSFpVH1N...
There's a few ways to resolve this (without having preempt-capable hardware), but they're all supremely nasty. -Daniel
Thanks, Felix
I'll try to look at this more in-depth when I'm catching up on mails. -Daniel
Alex Sierra (12): drm/amdgpu: replace per_device_list by array drm/amdkfd: helper to convert gpu id and idx drm/amdkfd: add xnack enabled flag to kfd_process drm/amdkfd: add ioctl to configure and query xnack retries drm/amdkfd: invalidate tables on page retry fault drm/amdkfd: page table restore through svm API drm/amdkfd: SVM API call to restore page tables drm/amdkfd: add svm_bo reference for eviction fence drm/amdgpu: add param bit flag to create SVM BOs drm/amdkfd: add svm_bo eviction mechanism support drm/amdgpu: svm bo enable_signal call condition drm/amdgpu: add svm_bo eviction to enable_signal cb
Philip Yang (23): drm/amdkfd: select kernel DEVICE_PRIVATE option drm/amdkfd: add svm ioctl API drm/amdkfd: Add SVM API support capability bits drm/amdkfd: register svm range drm/amdkfd: add svm ioctl GET_ATTR op drm/amdgpu: add common HMM get pages function drm/amdkfd: validate svm range system memory drm/amdkfd: register overlap system memory range drm/amdkfd: deregister svm range drm/amdgpu: export vm update mapping interface drm/amdkfd: map svm range to GPUs drm/amdkfd: svm range eviction and restore drm/amdkfd: register HMM device private zone drm/amdkfd: validate vram svm range from TTM drm/amdkfd: support xgmi same hive mapping drm/amdkfd: copy memory through gart table drm/amdkfd: HMM migrate ram to vram drm/amdkfd: HMM migrate vram to ram drm/amdgpu: reserve fence slot to update page table drm/amdgpu: enable retry fault wptr overflow drm/amdkfd: refine migration policy with xnack on drm/amdkfd: add svm range validate timestamp drm/amdkfd: multiple gpu migrate vram to vram
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 3 + drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 4 +- .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c | 16 +- .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 13 +- drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c | 83 + drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h | 7 + drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 5 + drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 90 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 47 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 10 + drivers/gpu/drm/amd/amdgpu/vega10_ih.c | 32 +- drivers/gpu/drm/amd/amdgpu/vega20_ih.c | 32 +- drivers/gpu/drm/amd/amdkfd/Kconfig | 1 + drivers/gpu/drm/amd/amdkfd/Makefile | 4 +- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 170 +- drivers/gpu/drm/amd/amdkfd/kfd_iommu.c | 8 +- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 866 ++++++ drivers/gpu/drm/amd/amdkfd/kfd_migrate.h | 59 + drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 52 +- drivers/gpu/drm/amd/amdkfd/kfd_process.c | 200 +- .../amd/amdkfd/kfd_process_queue_manager.c | 6 +- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 2564 +++++++++++++++++ drivers/gpu/drm/amd/amdkfd/kfd_svm.h | 135 + drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 1 + drivers/gpu/drm/amd/amdkfd/kfd_topology.h | 10 +- include/uapi/linux/kfd_ioctl.h | 169 +- 26 files changed, 4296 insertions(+), 291 deletions(-) create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.c create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.h
-- 2.29.2
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
Am 2021-01-08 um 11:06 a.m. schrieb Daniel Vetter:
On Fri, Jan 8, 2021 at 4:58 PM Felix Kuehling felix.kuehling@amd.com wrote:
Am 2021-01-08 um 9:40 a.m. schrieb Daniel Vetter:
On Thu, Jan 07, 2021 at 11:25:41AM -0500, Felix Kuehling wrote:
Am 2021-01-07 um 4:23 a.m. schrieb Daniel Vetter:
On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote:
This is the first version of our HMM based shared virtual memory manager for KFD. There are still a number of known issues that we're working through (see below). This will likely lead to some pretty significant changes in MMU notifier handling and locking on the migration code paths. So don't get hung up on those details yet.
But I think this is a good time to start getting feedback. We're pretty confident about the ioctl API, which is both simple and extensible for the future. (see patches 4,16) The user mode side of the API can be found here: https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/fxkamd/hmm-wi...
I'd also like another pair of eyes on how we're interfacing with the GPU VM code in amdgpu_vm.c (see patches 12,13), retry page fault handling (24,25), and some retry IRQ handling changes (32).
Known issues:
- won't work with IOMMU enabled, we need to dma_map all pages properly
- still working on some race conditions and random bugs
- performance is not great yet
Still catching up, but I think there's another one for your list:
- hmm gpu context preempt vs page fault handling. I've had a short discussion about this one with Christian before the holidays, and also some private chats with Jerome. It's nasty since no easy fix, much less a good idea what's the best approach here.
Do you have a pointer to that discussion or any more details?
Essentially if you're handling an hmm page fault from the gpu, you can deadlock by calling dma_fence_wait on a (chain of, possibly) other command submissions or compute contexts with dma_fence_wait. Which deadlocks if you can't preempt while you have that page fault pending. Two solutions:
- your hw can (at least for compute ctx) preempt even when a page fault is pending
Our GFXv9 GPUs can do this. GFXv10 cannot.
Uh, why did your hw guys drop this :-/
- lots of screaming in trying to come up with an alternate solution. They all suck.
My idea for GFXv10 is to avoid preemption for memory management purposes and rely 100% on page faults instead. That is, if the memory manager needs to prevent GPU access to certain memory, just invalidate the GPU page table entries pointing to that memory. No waiting for fences is necessary, except for the SDMA job that invalidates the PTEs, which runs on a special high-priority queue that should never deadlock. That should prevent the CPU getting involved in deadlocks in kernel mode. But you can still deadlock the GPU in user mode if all compute units get stuck in page faults and can't switch to any useful work any more. So it's possible that we won't be able to use GPU page faults on our GFXv10 GPUs.
This only works if _everything_ in the system works like this, since you're defacto breaking the cross-driver contract. As soon as there's some legacy gl workload (userptr) or another driver involved, this approach falls apart.
I think the scenario you have in mind involves a dma_fence that depends on the resolution of a GPU page fault. With our user mode command submission model for compute contexts, there are no DMA fences that get signaled by compute jobs that could get stuck on page faults.
The legacy GL workload would not get GPU page faults. The only way it could get stuck is, if all CUs are stuck on page faults and the command processor can't find any HW resources to execute it on. That's my user mode deadlock scenario below. So yeah, you're right, kernel mode can't avoid getting involved in that unless everything uses user mode command submissions.
If (big if) we switched to user mode command submission for all compute and graphics contexts, and no longer use DMA fences to signal their completion, I think that would solve the problem as far as the kernel is concerned.
I do think it can be rescued with what I call gang scheduling of engines: I.e. when a given engine is running a context (or a group of engines, depending how your hw works) that can cause a page fault, you must flush out all workloads running on the same engine which could block a dma_fence (preempt them, or for non-compute stuff, force their completion). And the other way round, i.e. before you can run a legacy gl workload with a dma_fence on these engines you need to preempt all ctxs that could cause page faults and take them at least out of the hw scheduler queue.
Yuck! But yeah, that would work. A less invasive alternative would be to reserve some compute units for graphics contexts so we can guarantee forward progress for graphics contexts even when all CUs working on compute stuff are stuck on page faults.
Just reserving an sdma engine for copy jobs and ptes updates and that stuff is necessary, but not sufficient.
Another approach that Jerome suggested is to track the reverse dependency graph of all dma_fence somehow and make sure that direct reclaim never recurses on an engine you're serving a pagefault for. Possible in theory, but in practice I think not feasible to implement because way too much work to implement.
I agree.
Either way it's imo really nasty to come up with a scheme here that doesn't fail in some corner, or becomes really nasty with inconsistent rules across different drivers and hw :-(
Yeah. The cleanest approach is to avoid DMA fences altogether for device/engines that can get stuck on page faults. A user mode command submission model would do that.
Reserving some compute units for graphics contexts that signal fences but never page fault should also work.
Regards, Felix
Cheers, Daniel
Regards, Felix
Note that the dma_fence_wait is hard requirement, because we need that for mmu notifiers and shrinkers, disallowing that would disable dynamic memory management. Which is the current "ttm is self-limited to 50% of system memory" limitation Christian is trying to lift. So that's really not a restriction we can lift, at least not in upstream where we need to also support old style hardware which doesn't have page fault support and really has no other option to handle memory management than dma_fence_wait.
Thread was here:
https://lore.kernel.org/dri-devel/CAKMK7uGgoeF8LmFBwWh5mW1k4xWjuUh3hdSFpVH1N...
There's a few ways to resolve this (without having preempt-capable hardware), but they're all supremely nasty. -Daniel
Thanks, Felix
I'll try to look at this more in-depth when I'm catching up on mails. -Daniel
Alex Sierra (12): drm/amdgpu: replace per_device_list by array drm/amdkfd: helper to convert gpu id and idx drm/amdkfd: add xnack enabled flag to kfd_process drm/amdkfd: add ioctl to configure and query xnack retries drm/amdkfd: invalidate tables on page retry fault drm/amdkfd: page table restore through svm API drm/amdkfd: SVM API call to restore page tables drm/amdkfd: add svm_bo reference for eviction fence drm/amdgpu: add param bit flag to create SVM BOs drm/amdkfd: add svm_bo eviction mechanism support drm/amdgpu: svm bo enable_signal call condition drm/amdgpu: add svm_bo eviction to enable_signal cb
Philip Yang (23): drm/amdkfd: select kernel DEVICE_PRIVATE option drm/amdkfd: add svm ioctl API drm/amdkfd: Add SVM API support capability bits drm/amdkfd: register svm range drm/amdkfd: add svm ioctl GET_ATTR op drm/amdgpu: add common HMM get pages function drm/amdkfd: validate svm range system memory drm/amdkfd: register overlap system memory range drm/amdkfd: deregister svm range drm/amdgpu: export vm update mapping interface drm/amdkfd: map svm range to GPUs drm/amdkfd: svm range eviction and restore drm/amdkfd: register HMM device private zone drm/amdkfd: validate vram svm range from TTM drm/amdkfd: support xgmi same hive mapping drm/amdkfd: copy memory through gart table drm/amdkfd: HMM migrate ram to vram drm/amdkfd: HMM migrate vram to ram drm/amdgpu: reserve fence slot to update page table drm/amdgpu: enable retry fault wptr overflow drm/amdkfd: refine migration policy with xnack on drm/amdkfd: add svm range validate timestamp drm/amdkfd: multiple gpu migrate vram to vram
drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 3 + drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 4 +- .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c | 16 +- .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 13 +- drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c | 83 + drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h | 7 + drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 5 + drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 90 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 47 +- drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 10 + drivers/gpu/drm/amd/amdgpu/vega10_ih.c | 32 +- drivers/gpu/drm/amd/amdgpu/vega20_ih.c | 32 +- drivers/gpu/drm/amd/amdkfd/Kconfig | 1 + drivers/gpu/drm/amd/amdkfd/Makefile | 4 +- drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 170 +- drivers/gpu/drm/amd/amdkfd/kfd_iommu.c | 8 +- drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 866 ++++++ drivers/gpu/drm/amd/amdkfd/kfd_migrate.h | 59 + drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 52 +- drivers/gpu/drm/amd/amdkfd/kfd_process.c | 200 +- .../amd/amdkfd/kfd_process_queue_manager.c | 6 +- drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 2564 +++++++++++++++++ drivers/gpu/drm/amd/amdkfd/kfd_svm.h | 135 + drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 1 + drivers/gpu/drm/amd/amdkfd/kfd_topology.h | 10 +- include/uapi/linux/kfd_ioctl.h | 169 +- 26 files changed, 4296 insertions(+), 291 deletions(-) create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.c create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.h
-- 2.29.2
dri-devel mailing list dri-devel@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Fri, Jan 8, 2021 at 5:36 PM Felix Kuehling felix.kuehling@amd.com wrote:
Am 2021-01-08 um 11:06 a.m. schrieb Daniel Vetter:
On Fri, Jan 8, 2021 at 4:58 PM Felix Kuehling felix.kuehling@amd.com wrote:
Am 2021-01-08 um 9:40 a.m. schrieb Daniel Vetter:
On Thu, Jan 07, 2021 at 11:25:41AM -0500, Felix Kuehling wrote:
Am 2021-01-07 um 4:23 a.m. schrieb Daniel Vetter:
On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote: > This is the first version of our HMM based shared virtual memory manager > for KFD. There are still a number of known issues that we're working through > (see below). This will likely lead to some pretty significant changes in > MMU notifier handling and locking on the migration code paths. So don't > get hung up on those details yet. > > But I think this is a good time to start getting feedback. We're pretty > confident about the ioctl API, which is both simple and extensible for the > future. (see patches 4,16) The user mode side of the API can be found here: > https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/fxkamd/hmm-wi... > > I'd also like another pair of eyes on how we're interfacing with the GPU VM > code in amdgpu_vm.c (see patches 12,13), retry page fault handling (24,25), > and some retry IRQ handling changes (32). > > > Known issues: > * won't work with IOMMU enabled, we need to dma_map all pages properly > * still working on some race conditions and random bugs > * performance is not great yet Still catching up, but I think there's another one for your list:
- hmm gpu context preempt vs page fault handling. I've had a short discussion about this one with Christian before the holidays, and also some private chats with Jerome. It's nasty since no easy fix, much less a good idea what's the best approach here.
Do you have a pointer to that discussion or any more details?
Essentially if you're handling an hmm page fault from the gpu, you can deadlock by calling dma_fence_wait on a (chain of, possibly) other command submissions or compute contexts with dma_fence_wait. Which deadlocks if you can't preempt while you have that page fault pending. Two solutions:
- your hw can (at least for compute ctx) preempt even when a page fault is pending
Our GFXv9 GPUs can do this. GFXv10 cannot.
Uh, why did your hw guys drop this :-/
- lots of screaming in trying to come up with an alternate solution. They all suck.
My idea for GFXv10 is to avoid preemption for memory management purposes and rely 100% on page faults instead. That is, if the memory manager needs to prevent GPU access to certain memory, just invalidate the GPU page table entries pointing to that memory. No waiting for fences is necessary, except for the SDMA job that invalidates the PTEs, which runs on a special high-priority queue that should never deadlock. That should prevent the CPU getting involved in deadlocks in kernel mode. But you can still deadlock the GPU in user mode if all compute units get stuck in page faults and can't switch to any useful work any more. So it's possible that we won't be able to use GPU page faults on our GFXv10 GPUs.
This only works if _everything_ in the system works like this, since you're defacto breaking the cross-driver contract. As soon as there's some legacy gl workload (userptr) or another driver involved, this approach falls apart.
I think the scenario you have in mind involves a dma_fence that depends on the resolution of a GPU page fault. With our user mode command submission model for compute contexts, there are no DMA fences that get signaled by compute jobs that could get stuck on page faults.
The legacy GL workload would not get GPU page faults. The only way it could get stuck is, if all CUs are stuck on page faults and the command processor can't find any HW resources to execute it on. That's my user mode deadlock scenario below. So yeah, you're right, kernel mode can't avoid getting involved in that unless everything uses user mode command submissions.
If (big if) we switched to user mode command submission for all compute and graphics contexts, and no longer use DMA fences to signal their completion, I think that would solve the problem as far as the kernel is concerned.
We can't throw dma_fence away because it's uapi built into various compositor protocols. Otherwise we could pull a wddm2 like microsoft did on windows and do what you're describing. So completely getting rid of dma_fences (even just limited on newer gpus) is also a decadel effort at least, since that's roughly how long it'll take to sunset and convert everything over.
The other problem is that we're now building more stuff on top of dma_resv like the dynamic dma-buf p2p stuff, now integrated into rdma. I think even internally in the kernel it would be a massive pain to untangle our fencing sufficiently to make this all happen without loops. And I'm not even sure whether we could prevent deadlocks by splitting dma_fence up into the userspace sync parts and the kernel internal sync parts since they leak into each another.
I do think it can be rescued with what I call gang scheduling of engines: I.e. when a given engine is running a context (or a group of engines, depending how your hw works) that can cause a page fault, you must flush out all workloads running on the same engine which could block a dma_fence (preempt them, or for non-compute stuff, force their completion). And the other way round, i.e. before you can run a legacy gl workload with a dma_fence on these engines you need to preempt all ctxs that could cause page faults and take them at least out of the hw scheduler queue.
Yuck! But yeah, that would work. A less invasive alternative would be to reserve some compute units for graphics contexts so we can guarantee forward progress for graphics contexts even when all CUs working on compute stuff are stuck on page faults.
Won't this hurt compute workloads? I think we need something were at least pure compute or pure gl/vk workloads run at full performance. And without preempt we can't take anything back when we need it, so would have to always upfront reserve some cores just in case.
Just reserving an sdma engine for copy jobs and ptes updates and that stuff is necessary, but not sufficient.
Another approach that Jerome suggested is to track the reverse dependency graph of all dma_fence somehow and make sure that direct reclaim never recurses on an engine you're serving a pagefault for. Possible in theory, but in practice I think not feasible to implement because way too much work to implement.
I agree.
Either way it's imo really nasty to come up with a scheme here that doesn't fail in some corner, or becomes really nasty with inconsistent rules across different drivers and hw :-(
Yeah. The cleanest approach is to avoid DMA fences altogether for device/engines that can get stuck on page faults. A user mode command submission model would do that.
Reserving some compute units for graphics contexts that signal fences but never page fault should also work.
The trouble is you don't just need engines, you need compute resources/cores behind them too (assuming I'm understading correctly how this works on amd hw). Otherwise you end up with a gl context that should complete to resolve the deadlock, but can't because it can't run it's shader because all the shader cores are stuck in compute page faults somewhere. Hence the gang scheduling would need to be at a level were you can guarantee full isolation of hw resources, either because you can preempt stuck compute kernels and let gl shaders run, or because of hw core partitiion or something else. If you cant, you need to gang schedule the entire gpu.
I think in practice that's not too ugly since for pure compute workloads you're not going to have a desktop running most likely. And for developer machines we should be able to push the occasional gfx update through the gpu still without causing too much stutter on the desktop or costing too much perf on the compute side. And pure gl/vk or pure compute workloads should keep running at full performance. -Daniel
Regards, Felix
Cheers, Daniel
Regards, Felix
Note that the dma_fence_wait is hard requirement, because we need that for mmu notifiers and shrinkers, disallowing that would disable dynamic memory management. Which is the current "ttm is self-limited to 50% of system memory" limitation Christian is trying to lift. So that's really not a restriction we can lift, at least not in upstream where we need to also support old style hardware which doesn't have page fault support and really has no other option to handle memory management than dma_fence_wait.
Thread was here:
https://lore.kernel.org/dri-devel/CAKMK7uGgoeF8LmFBwWh5mW1k4xWjuUh3hdSFpVH1N...
There's a few ways to resolve this (without having preempt-capable hardware), but they're all supremely nasty. -Daniel
Thanks, Felix
I'll try to look at this more in-depth when I'm catching up on mails. -Daniel
> Alex Sierra (12): > drm/amdgpu: replace per_device_list by array > drm/amdkfd: helper to convert gpu id and idx > drm/amdkfd: add xnack enabled flag to kfd_process > drm/amdkfd: add ioctl to configure and query xnack retries > drm/amdkfd: invalidate tables on page retry fault > drm/amdkfd: page table restore through svm API > drm/amdkfd: SVM API call to restore page tables > drm/amdkfd: add svm_bo reference for eviction fence > drm/amdgpu: add param bit flag to create SVM BOs > drm/amdkfd: add svm_bo eviction mechanism support > drm/amdgpu: svm bo enable_signal call condition > drm/amdgpu: add svm_bo eviction to enable_signal cb > > Philip Yang (23): > drm/amdkfd: select kernel DEVICE_PRIVATE option > drm/amdkfd: add svm ioctl API > drm/amdkfd: Add SVM API support capability bits > drm/amdkfd: register svm range > drm/amdkfd: add svm ioctl GET_ATTR op > drm/amdgpu: add common HMM get pages function > drm/amdkfd: validate svm range system memory > drm/amdkfd: register overlap system memory range > drm/amdkfd: deregister svm range > drm/amdgpu: export vm update mapping interface > drm/amdkfd: map svm range to GPUs > drm/amdkfd: svm range eviction and restore > drm/amdkfd: register HMM device private zone > drm/amdkfd: validate vram svm range from TTM > drm/amdkfd: support xgmi same hive mapping > drm/amdkfd: copy memory through gart table > drm/amdkfd: HMM migrate ram to vram > drm/amdkfd: HMM migrate vram to ram > drm/amdgpu: reserve fence slot to update page table > drm/amdgpu: enable retry fault wptr overflow > drm/amdkfd: refine migration policy with xnack on > drm/amdkfd: add svm range validate timestamp > drm/amdkfd: multiple gpu migrate vram to vram > > drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 3 + > drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 4 +- > .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c | 16 +- > .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 13 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c | 83 + > drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h | 7 + > drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 5 + > drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 90 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 47 +- > drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 10 + > drivers/gpu/drm/amd/amdgpu/vega10_ih.c | 32 +- > drivers/gpu/drm/amd/amdgpu/vega20_ih.c | 32 +- > drivers/gpu/drm/amd/amdkfd/Kconfig | 1 + > drivers/gpu/drm/amd/amdkfd/Makefile | 4 +- > drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 170 +- > drivers/gpu/drm/amd/amdkfd/kfd_iommu.c | 8 +- > drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 866 ++++++ > drivers/gpu/drm/amd/amdkfd/kfd_migrate.h | 59 + > drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 52 +- > drivers/gpu/drm/amd/amdkfd/kfd_process.c | 200 +- > .../amd/amdkfd/kfd_process_queue_manager.c | 6 +- > drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 2564 +++++++++++++++++ > drivers/gpu/drm/amd/amdkfd/kfd_svm.h | 135 + > drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 1 + > drivers/gpu/drm/amd/amdkfd/kfd_topology.h | 10 +- > include/uapi/linux/kfd_ioctl.h | 169 +- > 26 files changed, 4296 insertions(+), 291 deletions(-) > create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c > create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h > create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.c > create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.h > > -- > 2.29.2 > > _______________________________________________ > dri-devel mailing list > dri-devel@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/dri-devel
Am 2021-01-08 um 11:53 a.m. schrieb Daniel Vetter:
On Fri, Jan 8, 2021 at 5:36 PM Felix Kuehling felix.kuehling@amd.com wrote:
Am 2021-01-08 um 11:06 a.m. schrieb Daniel Vetter:
On Fri, Jan 8, 2021 at 4:58 PM Felix Kuehling felix.kuehling@amd.com wrote:
Am 2021-01-08 um 9:40 a.m. schrieb Daniel Vetter:
On Thu, Jan 07, 2021 at 11:25:41AM -0500, Felix Kuehling wrote:
Am 2021-01-07 um 4:23 a.m. schrieb Daniel Vetter: > On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote: >> This is the first version of our HMM based shared virtual memory manager >> for KFD. There are still a number of known issues that we're working through >> (see below). This will likely lead to some pretty significant changes in >> MMU notifier handling and locking on the migration code paths. So don't >> get hung up on those details yet. >> >> But I think this is a good time to start getting feedback. We're pretty >> confident about the ioctl API, which is both simple and extensible for the >> future. (see patches 4,16) The user mode side of the API can be found here: >> https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/fxkamd/hmm-wi... >> >> I'd also like another pair of eyes on how we're interfacing with the GPU VM >> code in amdgpu_vm.c (see patches 12,13), retry page fault handling (24,25), >> and some retry IRQ handling changes (32). >> >> >> Known issues: >> * won't work with IOMMU enabled, we need to dma_map all pages properly >> * still working on some race conditions and random bugs >> * performance is not great yet > Still catching up, but I think there's another one for your list: > > * hmm gpu context preempt vs page fault handling. I've had a short > discussion about this one with Christian before the holidays, and also > some private chats with Jerome. It's nasty since no easy fix, much less > a good idea what's the best approach here. Do you have a pointer to that discussion or any more details?
Essentially if you're handling an hmm page fault from the gpu, you can deadlock by calling dma_fence_wait on a (chain of, possibly) other command submissions or compute contexts with dma_fence_wait. Which deadlocks if you can't preempt while you have that page fault pending. Two solutions:
- your hw can (at least for compute ctx) preempt even when a page fault is pending
Our GFXv9 GPUs can do this. GFXv10 cannot.
Uh, why did your hw guys drop this :-/
Performance. It's the same reason why the XNACK mode selection API exists (patch 16). When we enable recoverable page fault handling in the compute units on GFXv9, it costs some performance even when no page faults are happening. On GFXv10 that retry fault handling moved out of the compute units, so they don't take the performance hit. But that sacrificed the ability to preempt during page faults. We'll need to work with our hardware teams to restore that capability in a future generation.
- lots of screaming in trying to come up with an alternate solution. They all suck.
My idea for GFXv10 is to avoid preemption for memory management purposes and rely 100% on page faults instead. That is, if the memory manager needs to prevent GPU access to certain memory, just invalidate the GPU page table entries pointing to that memory. No waiting for fences is necessary, except for the SDMA job that invalidates the PTEs, which runs on a special high-priority queue that should never deadlock. That should prevent the CPU getting involved in deadlocks in kernel mode. But you can still deadlock the GPU in user mode if all compute units get stuck in page faults and can't switch to any useful work any more. So it's possible that we won't be able to use GPU page faults on our GFXv10 GPUs.
This only works if _everything_ in the system works like this, since you're defacto breaking the cross-driver contract. As soon as there's some legacy gl workload (userptr) or another driver involved, this approach falls apart.
I think the scenario you have in mind involves a dma_fence that depends on the resolution of a GPU page fault. With our user mode command submission model for compute contexts, there are no DMA fences that get signaled by compute jobs that could get stuck on page faults.
The legacy GL workload would not get GPU page faults. The only way it could get stuck is, if all CUs are stuck on page faults and the command processor can't find any HW resources to execute it on. That's my user mode deadlock scenario below. So yeah, you're right, kernel mode can't avoid getting involved in that unless everything uses user mode command submissions.
If (big if) we switched to user mode command submission for all compute and graphics contexts, and no longer use DMA fences to signal their completion, I think that would solve the problem as far as the kernel is concerned.
We can't throw dma_fence away because it's uapi built into various compositor protocols. Otherwise we could pull a wddm2 like microsoft did on windows and do what you're describing. So completely getting rid of dma_fences (even just limited on newer gpus) is also a decadel effort at least, since that's roughly how long it'll take to sunset and convert everything over.
OK.
The other problem is that we're now building more stuff on top of dma_resv like the dynamic dma-buf p2p stuff, now integrated into rdma. I think even internally in the kernel it would be a massive pain to untangle our fencing sufficiently to make this all happen without loops. And I'm not even sure whether we could prevent deadlocks by splitting dma_fence up into the userspace sync parts and the kernel internal sync parts since they leak into each another.
I do think it can be rescued with what I call gang scheduling of engines: I.e. when a given engine is running a context (or a group of engines, depending how your hw works) that can cause a page fault, you must flush out all workloads running on the same engine which could block a dma_fence (preempt them, or for non-compute stuff, force their completion). And the other way round, i.e. before you can run a legacy gl workload with a dma_fence on these engines you need to preempt all ctxs that could cause page faults and take them at least out of the hw scheduler queue.
Yuck! But yeah, that would work. A less invasive alternative would be to reserve some compute units for graphics contexts so we can guarantee forward progress for graphics contexts even when all CUs working on compute stuff are stuck on page faults.
Won't this hurt compute workloads? I think we need something were at least pure compute or pure gl/vk workloads run at full performance. And without preempt we can't take anything back when we need it, so would have to always upfront reserve some cores just in case.
Yes, it would hurt proportionally to how many CUs get reserved. On big GPUs with many CUs the impact could be quite small.
That said, I'm not sure it'll work on our hardware. Our CUs can execute multiple wavefronts from different contexts and switch between them with fine granularity. I'd need to check with our HW engineers whether this CU-internal context switching is still possible during page faults on GFXv10.
Just reserving an sdma engine for copy jobs and ptes updates and that stuff is necessary, but not sufficient.
Another approach that Jerome suggested is to track the reverse dependency graph of all dma_fence somehow and make sure that direct reclaim never recurses on an engine you're serving a pagefault for. Possible in theory, but in practice I think not feasible to implement because way too much work to implement.
I agree.
Either way it's imo really nasty to come up with a scheme here that doesn't fail in some corner, or becomes really nasty with inconsistent rules across different drivers and hw :-(
Yeah. The cleanest approach is to avoid DMA fences altogether for device/engines that can get stuck on page faults. A user mode command submission model would do that.
Reserving some compute units for graphics contexts that signal fences but never page fault should also work.
The trouble is you don't just need engines, you need compute resources/cores behind them too (assuming I'm understading correctly how this works on amd hw). Otherwise you end up with a gl context that should complete to resolve the deadlock, but can't because it can't run it's shader because all the shader cores are stuck in compute page faults somewhere.
That's why I suggested reserving some CUs that would never execute compute workloads that can page fault.
Hence the gang scheduling would need to be at a level were you can guarantee full isolation of hw resources, either because you can preempt stuck compute kernels and let gl shaders run, or because of hw core partitiion or something else. If you cant, you need to gang schedule the entire gpu.
Yes.
I think in practice that's not too ugly since for pure compute workloads you're not going to have a desktop running most likely.
We still need legacy contexts for video decoding and post processing. But maybe we can find a fix for that too.
And for developer machines we should be able to push the occasional gfx update through the gpu still without causing too much stutter on the desktop or costing too much perf on the compute side. And pure gl/vk or pure compute workloads should keep running at full performance.
I think it would be acceptable for mostly-compute workloads. It would be bad for desktop workloads with some compute, e.g. games with OpenCL-based physics. We're increasingly relying on KFD for all GPU computing (including OpenCL) in desktop applications. But those could live without GPU page faults until we can build sane hardware.
Regards, Felix
-Daniel
Regards, Felix
Cheers, Daniel
Regards, Felix
Note that the dma_fence_wait is hard requirement, because we need that for mmu notifiers and shrinkers, disallowing that would disable dynamic memory management. Which is the current "ttm is self-limited to 50% of system memory" limitation Christian is trying to lift. So that's really not a restriction we can lift, at least not in upstream where we need to also support old style hardware which doesn't have page fault support and really has no other option to handle memory management than dma_fence_wait.
Thread was here:
https://lore.kernel.org/dri-devel/CAKMK7uGgoeF8LmFBwWh5mW1k4xWjuUh3hdSFpVH1N...
There's a few ways to resolve this (without having preempt-capable hardware), but they're all supremely nasty. -Daniel
Thanks, Felix
> I'll try to look at this more in-depth when I'm catching up on mails. > -Daniel > >> Alex Sierra (12): >> drm/amdgpu: replace per_device_list by array >> drm/amdkfd: helper to convert gpu id and idx >> drm/amdkfd: add xnack enabled flag to kfd_process >> drm/amdkfd: add ioctl to configure and query xnack retries >> drm/amdkfd: invalidate tables on page retry fault >> drm/amdkfd: page table restore through svm API >> drm/amdkfd: SVM API call to restore page tables >> drm/amdkfd: add svm_bo reference for eviction fence >> drm/amdgpu: add param bit flag to create SVM BOs >> drm/amdkfd: add svm_bo eviction mechanism support >> drm/amdgpu: svm bo enable_signal call condition >> drm/amdgpu: add svm_bo eviction to enable_signal cb >> >> Philip Yang (23): >> drm/amdkfd: select kernel DEVICE_PRIVATE option >> drm/amdkfd: add svm ioctl API >> drm/amdkfd: Add SVM API support capability bits >> drm/amdkfd: register svm range >> drm/amdkfd: add svm ioctl GET_ATTR op >> drm/amdgpu: add common HMM get pages function >> drm/amdkfd: validate svm range system memory >> drm/amdkfd: register overlap system memory range >> drm/amdkfd: deregister svm range >> drm/amdgpu: export vm update mapping interface >> drm/amdkfd: map svm range to GPUs >> drm/amdkfd: svm range eviction and restore >> drm/amdkfd: register HMM device private zone >> drm/amdkfd: validate vram svm range from TTM >> drm/amdkfd: support xgmi same hive mapping >> drm/amdkfd: copy memory through gart table >> drm/amdkfd: HMM migrate ram to vram >> drm/amdkfd: HMM migrate vram to ram >> drm/amdgpu: reserve fence slot to update page table >> drm/amdgpu: enable retry fault wptr overflow >> drm/amdkfd: refine migration policy with xnack on >> drm/amdkfd: add svm range validate timestamp >> drm/amdkfd: multiple gpu migrate vram to vram >> >> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 3 + >> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 4 +- >> .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c | 16 +- >> .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 13 +- >> drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c | 83 + >> drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h | 7 + >> drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 5 + >> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 90 +- >> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 47 +- >> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 10 + >> drivers/gpu/drm/amd/amdgpu/vega10_ih.c | 32 +- >> drivers/gpu/drm/amd/amdgpu/vega20_ih.c | 32 +- >> drivers/gpu/drm/amd/amdkfd/Kconfig | 1 + >> drivers/gpu/drm/amd/amdkfd/Makefile | 4 +- >> drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 170 +- >> drivers/gpu/drm/amd/amdkfd/kfd_iommu.c | 8 +- >> drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 866 ++++++ >> drivers/gpu/drm/amd/amdkfd/kfd_migrate.h | 59 + >> drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 52 +- >> drivers/gpu/drm/amd/amdkfd/kfd_process.c | 200 +- >> .../amd/amdkfd/kfd_process_queue_manager.c | 6 +- >> drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 2564 +++++++++++++++++ >> drivers/gpu/drm/amd/amdkfd/kfd_svm.h | 135 + >> drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 1 + >> drivers/gpu/drm/amd/amdkfd/kfd_topology.h | 10 +- >> include/uapi/linux/kfd_ioctl.h | 169 +- >> 26 files changed, 4296 insertions(+), 291 deletions(-) >> create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c >> create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h >> create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.c >> create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.h >> >> -- >> 2.29.2 >> >> _______________________________________________ >> dri-devel mailing list >> dri-devel@lists.freedesktop.org >> https://lists.freedesktop.org/mailman/listinfo/dri-devel
On Fri, Jan 08, 2021 at 12:56:24PM -0500, Felix Kuehling wrote:
Am 2021-01-08 um 11:53 a.m. schrieb Daniel Vetter:
On Fri, Jan 8, 2021 at 5:36 PM Felix Kuehling felix.kuehling@amd.com wrote:
Am 2021-01-08 um 11:06 a.m. schrieb Daniel Vetter:
On Fri, Jan 8, 2021 at 4:58 PM Felix Kuehling felix.kuehling@amd.com wrote:
Am 2021-01-08 um 9:40 a.m. schrieb Daniel Vetter:
On Thu, Jan 07, 2021 at 11:25:41AM -0500, Felix Kuehling wrote: > Am 2021-01-07 um 4:23 a.m. schrieb Daniel Vetter: >> On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote: >>> This is the first version of our HMM based shared virtual memory manager >>> for KFD. There are still a number of known issues that we're working through >>> (see below). This will likely lead to some pretty significant changes in >>> MMU notifier handling and locking on the migration code paths. So don't >>> get hung up on those details yet. >>> >>> But I think this is a good time to start getting feedback. We're pretty >>> confident about the ioctl API, which is both simple and extensible for the >>> future. (see patches 4,16) The user mode side of the API can be found here: >>> https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/fxkamd/hmm-wi... >>> >>> I'd also like another pair of eyes on how we're interfacing with the GPU VM >>> code in amdgpu_vm.c (see patches 12,13), retry page fault handling (24,25), >>> and some retry IRQ handling changes (32). >>> >>> >>> Known issues: >>> * won't work with IOMMU enabled, we need to dma_map all pages properly >>> * still working on some race conditions and random bugs >>> * performance is not great yet >> Still catching up, but I think there's another one for your list: >> >> * hmm gpu context preempt vs page fault handling. I've had a short >> discussion about this one with Christian before the holidays, and also >> some private chats with Jerome. It's nasty since no easy fix, much less >> a good idea what's the best approach here. > Do you have a pointer to that discussion or any more details? Essentially if you're handling an hmm page fault from the gpu, you can deadlock by calling dma_fence_wait on a (chain of, possibly) other command submissions or compute contexts with dma_fence_wait. Which deadlocks if you can't preempt while you have that page fault pending. Two solutions:
- your hw can (at least for compute ctx) preempt even when a page fault is pending
Our GFXv9 GPUs can do this. GFXv10 cannot.
Uh, why did your hw guys drop this :-/
Performance. It's the same reason why the XNACK mode selection API exists (patch 16). When we enable recoverable page fault handling in the compute units on GFXv9, it costs some performance even when no page faults are happening. On GFXv10 that retry fault handling moved out of the compute units, so they don't take the performance hit. But that sacrificed the ability to preempt during page faults. We'll need to work with our hardware teams to restore that capability in a future generation.
Ah yes, you need to stall in more points in the compute cores to make sure you can recover if the page fault gets interrupted.
Maybe my knowledge is outdated, but my understanding is that nvidia can also preempt (but only for compute jobs, since oh dear the pain this would be for all the fixed function stuff). Since gfx10 moved page fault handling further away from compute cores, do you know whether this now means you can do page faults for (some?) fixed function stuff too? Or still only for compute?
Supporting page fault for 3d would be real pain with the corner we're stuck in right now, but better we know about this early than later :-/
- lots of screaming in trying to come up with an alternate solution. They all suck.
My idea for GFXv10 is to avoid preemption for memory management purposes and rely 100% on page faults instead. That is, if the memory manager needs to prevent GPU access to certain memory, just invalidate the GPU page table entries pointing to that memory. No waiting for fences is necessary, except for the SDMA job that invalidates the PTEs, which runs on a special high-priority queue that should never deadlock. That should prevent the CPU getting involved in deadlocks in kernel mode. But you can still deadlock the GPU in user mode if all compute units get stuck in page faults and can't switch to any useful work any more. So it's possible that we won't be able to use GPU page faults on our GFXv10 GPUs.
This only works if _everything_ in the system works like this, since you're defacto breaking the cross-driver contract. As soon as there's some legacy gl workload (userptr) or another driver involved, this approach falls apart.
I think the scenario you have in mind involves a dma_fence that depends on the resolution of a GPU page fault. With our user mode command submission model for compute contexts, there are no DMA fences that get signaled by compute jobs that could get stuck on page faults.
The legacy GL workload would not get GPU page faults. The only way it could get stuck is, if all CUs are stuck on page faults and the command processor can't find any HW resources to execute it on. That's my user mode deadlock scenario below. So yeah, you're right, kernel mode can't avoid getting involved in that unless everything uses user mode command submissions.
If (big if) we switched to user mode command submission for all compute and graphics contexts, and no longer use DMA fences to signal their completion, I think that would solve the problem as far as the kernel is concerned.
We can't throw dma_fence away because it's uapi built into various compositor protocols. Otherwise we could pull a wddm2 like microsoft did on windows and do what you're describing. So completely getting rid of dma_fences (even just limited on newer gpus) is also a decadel effort at least, since that's roughly how long it'll take to sunset and convert everything over.
OK.
The other problem is that we're now building more stuff on top of dma_resv like the dynamic dma-buf p2p stuff, now integrated into rdma. I think even internally in the kernel it would be a massive pain to untangle our fencing sufficiently to make this all happen without loops. And I'm not even sure whether we could prevent deadlocks by splitting dma_fence up into the userspace sync parts and the kernel internal sync parts since they leak into each another.
I do think it can be rescued with what I call gang scheduling of engines: I.e. when a given engine is running a context (or a group of engines, depending how your hw works) that can cause a page fault, you must flush out all workloads running on the same engine which could block a dma_fence (preempt them, or for non-compute stuff, force their completion). And the other way round, i.e. before you can run a legacy gl workload with a dma_fence on these engines you need to preempt all ctxs that could cause page faults and take them at least out of the hw scheduler queue.
Yuck! But yeah, that would work. A less invasive alternative would be to reserve some compute units for graphics contexts so we can guarantee forward progress for graphics contexts even when all CUs working on compute stuff are stuck on page faults.
Won't this hurt compute workloads? I think we need something were at least pure compute or pure gl/vk workloads run at full performance. And without preempt we can't take anything back when we need it, so would have to always upfront reserve some cores just in case.
Yes, it would hurt proportionally to how many CUs get reserved. On big GPUs with many CUs the impact could be quite small.
Also, we could do the reservation only for the time when there's actually a legacy context with normal dma_fence in the scheduler queue. Assuming that reserving/unreserving of CUs isn't too expensive operation. If it's as expensive as a full stall probably not worth the complexity here and just go with a full stall and only run one or the other at a time.
Wrt desktops I'm also somewhat worried that we might end up killing desktop workloads if there's not enough CUs reserved for these and they end up taking too long and anger either tdr or worse the user because the desktop is unuseable when you start a compute job and get a big pile of faults. Probably needs some testing to see how bad it is.
That said, I'm not sure it'll work on our hardware. Our CUs can execute multiple wavefronts from different contexts and switch between them with fine granularity. I'd need to check with our HW engineers whether this CU-internal context switching is still possible during page faults on GFXv10.
You'd need to do the reservation for all contexts/engines which can cause page faults, otherewise it'd leak.
Just reserving an sdma engine for copy jobs and ptes updates and that stuff is necessary, but not sufficient.
Another approach that Jerome suggested is to track the reverse dependency graph of all dma_fence somehow and make sure that direct reclaim never recurses on an engine you're serving a pagefault for. Possible in theory, but in practice I think not feasible to implement because way too much work to implement.
I agree.
Either way it's imo really nasty to come up with a scheme here that doesn't fail in some corner, or becomes really nasty with inconsistent rules across different drivers and hw :-(
Yeah. The cleanest approach is to avoid DMA fences altogether for device/engines that can get stuck on page faults. A user mode command submission model would do that.
Reserving some compute units for graphics contexts that signal fences but never page fault should also work.
The trouble is you don't just need engines, you need compute resources/cores behind them too (assuming I'm understading correctly how this works on amd hw). Otherwise you end up with a gl context that should complete to resolve the deadlock, but can't because it can't run it's shader because all the shader cores are stuck in compute page faults somewhere.
That's why I suggested reserving some CUs that would never execute compute workloads that can page fault.
Hence the gang scheduling would need to be at a level were you can guarantee full isolation of hw resources, either because you can preempt stuck compute kernels and let gl shaders run, or because of hw core partitiion or something else. If you cant, you need to gang schedule the entire gpu.
Yes.
I think in practice that's not too ugly since for pure compute workloads you're not going to have a desktop running most likely.
We still need legacy contexts for video decoding and post processing. But maybe we can find a fix for that too.
Hm I'd expect video workloads to not use page faults (even if they use compute for post processing). Same way that compute in vk/gl would still use all the legacy fencing (which excludes page fault support).
So pure "compute always has to use page fault mode and user sync" I don't think is feasible. And then all the mixed workloads useage should be fine too.
And for developer machines we should be able to push the occasional gfx update through the gpu still without causing too much stutter on the desktop or costing too much perf on the compute side. And pure gl/vk or pure compute workloads should keep running at full performance.
I think it would be acceptable for mostly-compute workloads. It would be bad for desktop workloads with some compute, e.g. games with OpenCL-based physics. We're increasingly relying on KFD for all GPU computing (including OpenCL) in desktop applications. But those could live without GPU page faults until we can build sane hardware.
Uh ... I guess the challenge here is noticing when your opencl should be run in old style mode. I guess you could link them together through some backchannel, so when a gl or vk context is set up you run opencl in the legacy mode without pagefault for full perf together with vk. Still doesn't work if the app sets up ocl before vk/gl :-/ -Daniel
Regards, Felix
-Daniel
Regards, Felix
Cheers, Daniel
Regards, Felix
Note that the dma_fence_wait is hard requirement, because we need that for mmu notifiers and shrinkers, disallowing that would disable dynamic memory management. Which is the current "ttm is self-limited to 50% of system memory" limitation Christian is trying to lift. So that's really not a restriction we can lift, at least not in upstream where we need to also support old style hardware which doesn't have page fault support and really has no other option to handle memory management than dma_fence_wait.
Thread was here:
https://lore.kernel.org/dri-devel/CAKMK7uGgoeF8LmFBwWh5mW1k4xWjuUh3hdSFpVH1N...
There's a few ways to resolve this (without having preempt-capable hardware), but they're all supremely nasty. -Daniel
> Thanks, > Felix > > >> I'll try to look at this more in-depth when I'm catching up on mails. >> -Daniel >> >>> Alex Sierra (12): >>> drm/amdgpu: replace per_device_list by array >>> drm/amdkfd: helper to convert gpu id and idx >>> drm/amdkfd: add xnack enabled flag to kfd_process >>> drm/amdkfd: add ioctl to configure and query xnack retries >>> drm/amdkfd: invalidate tables on page retry fault >>> drm/amdkfd: page table restore through svm API >>> drm/amdkfd: SVM API call to restore page tables >>> drm/amdkfd: add svm_bo reference for eviction fence >>> drm/amdgpu: add param bit flag to create SVM BOs >>> drm/amdkfd: add svm_bo eviction mechanism support >>> drm/amdgpu: svm bo enable_signal call condition >>> drm/amdgpu: add svm_bo eviction to enable_signal cb >>> >>> Philip Yang (23): >>> drm/amdkfd: select kernel DEVICE_PRIVATE option >>> drm/amdkfd: add svm ioctl API >>> drm/amdkfd: Add SVM API support capability bits >>> drm/amdkfd: register svm range >>> drm/amdkfd: add svm ioctl GET_ATTR op >>> drm/amdgpu: add common HMM get pages function >>> drm/amdkfd: validate svm range system memory >>> drm/amdkfd: register overlap system memory range >>> drm/amdkfd: deregister svm range >>> drm/amdgpu: export vm update mapping interface >>> drm/amdkfd: map svm range to GPUs >>> drm/amdkfd: svm range eviction and restore >>> drm/amdkfd: register HMM device private zone >>> drm/amdkfd: validate vram svm range from TTM >>> drm/amdkfd: support xgmi same hive mapping >>> drm/amdkfd: copy memory through gart table >>> drm/amdkfd: HMM migrate ram to vram >>> drm/amdkfd: HMM migrate vram to ram >>> drm/amdgpu: reserve fence slot to update page table >>> drm/amdgpu: enable retry fault wptr overflow >>> drm/amdkfd: refine migration policy with xnack on >>> drm/amdkfd: add svm range validate timestamp >>> drm/amdkfd: multiple gpu migrate vram to vram >>> >>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 3 + >>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 4 +- >>> .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c | 16 +- >>> .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 13 +- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c | 83 + >>> drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h | 7 + >>> drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 5 + >>> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 90 +- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 47 +- >>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 10 + >>> drivers/gpu/drm/amd/amdgpu/vega10_ih.c | 32 +- >>> drivers/gpu/drm/amd/amdgpu/vega20_ih.c | 32 +- >>> drivers/gpu/drm/amd/amdkfd/Kconfig | 1 + >>> drivers/gpu/drm/amd/amdkfd/Makefile | 4 +- >>> drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 170 +- >>> drivers/gpu/drm/amd/amdkfd/kfd_iommu.c | 8 +- >>> drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 866 ++++++ >>> drivers/gpu/drm/amd/amdkfd/kfd_migrate.h | 59 + >>> drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 52 +- >>> drivers/gpu/drm/amd/amdkfd/kfd_process.c | 200 +- >>> .../amd/amdkfd/kfd_process_queue_manager.c | 6 +- >>> drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 2564 +++++++++++++++++ >>> drivers/gpu/drm/amd/amdkfd/kfd_svm.h | 135 + >>> drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 1 + >>> drivers/gpu/drm/amd/amdkfd/kfd_topology.h | 10 +- >>> include/uapi/linux/kfd_ioctl.h | 169 +- >>> 26 files changed, 4296 insertions(+), 291 deletions(-) >>> create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c >>> create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h >>> create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.c >>> create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.h >>> >>> -- >>> 2.29.2 >>> >>> _______________________________________________ >>> dri-devel mailing list >>> dri-devel@lists.freedesktop.org >>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
Am 2021-01-11 um 11:29 a.m. schrieb Daniel Vetter:
On Fri, Jan 08, 2021 at 12:56:24PM -0500, Felix Kuehling wrote:
Am 2021-01-08 um 11:53 a.m. schrieb Daniel Vetter:
On Fri, Jan 8, 2021 at 5:36 PM Felix Kuehling felix.kuehling@amd.com wrote:
Am 2021-01-08 um 11:06 a.m. schrieb Daniel Vetter:
On Fri, Jan 8, 2021 at 4:58 PM Felix Kuehling felix.kuehling@amd.com wrote:
Am 2021-01-08 um 9:40 a.m. schrieb Daniel Vetter: > On Thu, Jan 07, 2021 at 11:25:41AM -0500, Felix Kuehling wrote: >> Am 2021-01-07 um 4:23 a.m. schrieb Daniel Vetter: >>> On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote: >>>> This is the first version of our HMM based shared virtual memory manager >>>> for KFD. There are still a number of known issues that we're working through >>>> (see below). This will likely lead to some pretty significant changes in >>>> MMU notifier handling and locking on the migration code paths. So don't >>>> get hung up on those details yet. >>>> >>>> But I think this is a good time to start getting feedback. We're pretty >>>> confident about the ioctl API, which is both simple and extensible for the >>>> future. (see patches 4,16) The user mode side of the API can be found here: >>>> https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/fxkamd/hmm-wi... >>>> >>>> I'd also like another pair of eyes on how we're interfacing with the GPU VM >>>> code in amdgpu_vm.c (see patches 12,13), retry page fault handling (24,25), >>>> and some retry IRQ handling changes (32). >>>> >>>> >>>> Known issues: >>>> * won't work with IOMMU enabled, we need to dma_map all pages properly >>>> * still working on some race conditions and random bugs >>>> * performance is not great yet >>> Still catching up, but I think there's another one for your list: >>> >>> * hmm gpu context preempt vs page fault handling. I've had a short >>> discussion about this one with Christian before the holidays, and also >>> some private chats with Jerome. It's nasty since no easy fix, much less >>> a good idea what's the best approach here. >> Do you have a pointer to that discussion or any more details? > Essentially if you're handling an hmm page fault from the gpu, you can > deadlock by calling dma_fence_wait on a (chain of, possibly) other command > submissions or compute contexts with dma_fence_wait. Which deadlocks if > you can't preempt while you have that page fault pending. Two solutions: > > - your hw can (at least for compute ctx) preempt even when a page fault is > pending Our GFXv9 GPUs can do this. GFXv10 cannot.
Uh, why did your hw guys drop this :-/
Performance. It's the same reason why the XNACK mode selection API exists (patch 16). When we enable recoverable page fault handling in the compute units on GFXv9, it costs some performance even when no page faults are happening. On GFXv10 that retry fault handling moved out of the compute units, so they don't take the performance hit. But that sacrificed the ability to preempt during page faults. We'll need to work with our hardware teams to restore that capability in a future generation.
Ah yes, you need to stall in more points in the compute cores to make sure you can recover if the page fault gets interrupted.
Maybe my knowledge is outdated, but my understanding is that nvidia can also preempt (but only for compute jobs, since oh dear the pain this would be for all the fixed function stuff). Since gfx10 moved page fault handling further away from compute cores, do you know whether this now means you can do page faults for (some?) fixed function stuff too? Or still only for compute?
I'm not sure.
Supporting page fault for 3d would be real pain with the corner we're stuck in right now, but better we know about this early than later :-/
I know Christian hates the idea. We know that page faults on GPUs can be a huge performance drain because you're stalling potentially so many threads and the CPU can become a bottle neck dealing with all the page faults from many GPU threads. On the compute side, applications will be optimized to avoid them as much as possible, e.g. by pre-faulting or pre-fetching data before it's needed.
But I think you need page faults to make overcommitted memory with user mode command submission not suck.
I do think it can be rescued with what I call gang scheduling of engines: I.e. when a given engine is running a context (or a group of engines, depending how your hw works) that can cause a page fault, you must flush out all workloads running on the same engine which could block a dma_fence (preempt them, or for non-compute stuff, force their completion). And the other way round, i.e. before you can run a legacy gl workload with a dma_fence on these engines you need to preempt all ctxs that could cause page faults and take them at least out of the hw scheduler queue.
Yuck! But yeah, that would work. A less invasive alternative would be to reserve some compute units for graphics contexts so we can guarantee forward progress for graphics contexts even when all CUs working on compute stuff are stuck on page faults.
Won't this hurt compute workloads? I think we need something were at least pure compute or pure gl/vk workloads run at full performance. And without preempt we can't take anything back when we need it, so would have to always upfront reserve some cores just in case.
Yes, it would hurt proportionally to how many CUs get reserved. On big GPUs with many CUs the impact could be quite small.
Also, we could do the reservation only for the time when there's actually a legacy context with normal dma_fence in the scheduler queue. Assuming that reserving/unreserving of CUs isn't too expensive operation. If it's as expensive as a full stall probably not worth the complexity here and just go with a full stall and only run one or the other at a time.
Wrt desktops I'm also somewhat worried that we might end up killing desktop workloads if there's not enough CUs reserved for these and they end up taking too long and anger either tdr or worse the user because the desktop is unuseable when you start a compute job and get a big pile of faults. Probably needs some testing to see how bad it is.
That said, I'm not sure it'll work on our hardware. Our CUs can execute multiple wavefronts from different contexts and switch between them with fine granularity. I'd need to check with our HW engineers whether this CU-internal context switching is still possible during page faults on GFXv10.
You'd need to do the reservation for all contexts/engines which can cause page faults, otherewise it'd leak.
All engines that can page fault and cannot be preempted during faults.
Regards, Felix
Just reserving an sdma engine for copy jobs and ptes updates and that stuff is necessary, but not sufficient.
Another approach that Jerome suggested is to track the reverse dependency graph of all dma_fence somehow and make sure that direct reclaim never recurses on an engine you're serving a pagefault for. Possible in theory, but in practice I think not feasible to implement because way too much work to implement.
I agree.
Either way it's imo really nasty to come up with a scheme here that doesn't fail in some corner, or becomes really nasty with inconsistent rules across different drivers and hw :-(
Yeah. The cleanest approach is to avoid DMA fences altogether for device/engines that can get stuck on page faults. A user mode command submission model would do that.
Reserving some compute units for graphics contexts that signal fences but never page fault should also work.
The trouble is you don't just need engines, you need compute resources/cores behind them too (assuming I'm understading correctly how this works on amd hw). Otherwise you end up with a gl context that should complete to resolve the deadlock, but can't because it can't run it's shader because all the shader cores are stuck in compute page faults somewhere.
That's why I suggested reserving some CUs that would never execute compute workloads that can page fault.
Hence the gang scheduling would need to be at a level were you can guarantee full isolation of hw resources, either because you can preempt stuck compute kernels and let gl shaders run, or because of hw core partitiion or something else. If you cant, you need to gang schedule the entire gpu.
Yes.
I think in practice that's not too ugly since for pure compute workloads you're not going to have a desktop running most likely.
We still need legacy contexts for video decoding and post processing. But maybe we can find a fix for that too.
Hm I'd expect video workloads to not use page faults (even if they use compute for post processing). Same way that compute in vk/gl would still use all the legacy fencing (which excludes page fault support).
So pure "compute always has to use page fault mode and user sync" I don't think is feasible. And then all the mixed workloads useage should be fine too.
And for developer machines we should be able to push the occasional gfx update through the gpu still without causing too much stutter on the desktop or costing too much perf on the compute side. And pure gl/vk or pure compute workloads should keep running at full performance.
I think it would be acceptable for mostly-compute workloads. It would be bad for desktop workloads with some compute, e.g. games with OpenCL-based physics. We're increasingly relying on KFD for all GPU computing (including OpenCL) in desktop applications. But those could live without GPU page faults until we can build sane hardware.
Uh ... I guess the challenge here is noticing when your opencl should be run in old style mode. I guess you could link them together through some backchannel, so when a gl or vk context is set up you run opencl in the legacy mode without pagefault for full perf together with vk. Still doesn't work if the app sets up ocl before vk/gl :-/ -Daniel
Regards, Felix
-Daniel
Regards, Felix
Cheers, Daniel
Regards, Felix
> Note that the dma_fence_wait is hard requirement, because we need that for > mmu notifiers and shrinkers, disallowing that would disable dynamic memory > management. Which is the current "ttm is self-limited to 50% of system > memory" limitation Christian is trying to lift. So that's really not > a restriction we can lift, at least not in upstream where we need to also > support old style hardware which doesn't have page fault support and > really has no other option to handle memory management than > dma_fence_wait. > > Thread was here: > > https://lore.kernel.org/dri-devel/CAKMK7uGgoeF8LmFBwWh5mW1k4xWjuUh3hdSFpVH1N... > > There's a few ways to resolve this (without having preempt-capable > hardware), but they're all supremely nasty. > -Daniel > >> Thanks, >> Felix >> >> >>> I'll try to look at this more in-depth when I'm catching up on mails. >>> -Daniel >>> >>>> Alex Sierra (12): >>>> drm/amdgpu: replace per_device_list by array >>>> drm/amdkfd: helper to convert gpu id and idx >>>> drm/amdkfd: add xnack enabled flag to kfd_process >>>> drm/amdkfd: add ioctl to configure and query xnack retries >>>> drm/amdkfd: invalidate tables on page retry fault >>>> drm/amdkfd: page table restore through svm API >>>> drm/amdkfd: SVM API call to restore page tables >>>> drm/amdkfd: add svm_bo reference for eviction fence >>>> drm/amdgpu: add param bit flag to create SVM BOs >>>> drm/amdkfd: add svm_bo eviction mechanism support >>>> drm/amdgpu: svm bo enable_signal call condition >>>> drm/amdgpu: add svm_bo eviction to enable_signal cb >>>> >>>> Philip Yang (23): >>>> drm/amdkfd: select kernel DEVICE_PRIVATE option >>>> drm/amdkfd: add svm ioctl API >>>> drm/amdkfd: Add SVM API support capability bits >>>> drm/amdkfd: register svm range >>>> drm/amdkfd: add svm ioctl GET_ATTR op >>>> drm/amdgpu: add common HMM get pages function >>>> drm/amdkfd: validate svm range system memory >>>> drm/amdkfd: register overlap system memory range >>>> drm/amdkfd: deregister svm range >>>> drm/amdgpu: export vm update mapping interface >>>> drm/amdkfd: map svm range to GPUs >>>> drm/amdkfd: svm range eviction and restore >>>> drm/amdkfd: register HMM device private zone >>>> drm/amdkfd: validate vram svm range from TTM >>>> drm/amdkfd: support xgmi same hive mapping >>>> drm/amdkfd: copy memory through gart table >>>> drm/amdkfd: HMM migrate ram to vram >>>> drm/amdkfd: HMM migrate vram to ram >>>> drm/amdgpu: reserve fence slot to update page table >>>> drm/amdgpu: enable retry fault wptr overflow >>>> drm/amdkfd: refine migration policy with xnack on >>>> drm/amdkfd: add svm range validate timestamp >>>> drm/amdkfd: multiple gpu migrate vram to vram >>>> >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c | 3 + >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.h | 4 +- >>>> .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_fence.c | 16 +- >>>> .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 13 +- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c | 83 + >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h | 7 + >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_object.h | 5 + >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 90 +- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 47 +- >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 10 + >>>> drivers/gpu/drm/amd/amdgpu/vega10_ih.c | 32 +- >>>> drivers/gpu/drm/amd/amdgpu/vega20_ih.c | 32 +- >>>> drivers/gpu/drm/amd/amdkfd/Kconfig | 1 + >>>> drivers/gpu/drm/amd/amdkfd/Makefile | 4 +- >>>> drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 170 +- >>>> drivers/gpu/drm/amd/amdkfd/kfd_iommu.c | 8 +- >>>> drivers/gpu/drm/amd/amdkfd/kfd_migrate.c | 866 ++++++ >>>> drivers/gpu/drm/amd/amdkfd/kfd_migrate.h | 59 + >>>> drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 52 +- >>>> drivers/gpu/drm/amd/amdkfd/kfd_process.c | 200 +- >>>> .../amd/amdkfd/kfd_process_queue_manager.c | 6 +- >>>> drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 2564 +++++++++++++++++ >>>> drivers/gpu/drm/amd/amdkfd/kfd_svm.h | 135 + >>>> drivers/gpu/drm/amd/amdkfd/kfd_topology.c | 1 + >>>> drivers/gpu/drm/amd/amdkfd/kfd_topology.h | 10 +- >>>> include/uapi/linux/kfd_ioctl.h | 169 +- >>>> 26 files changed, 4296 insertions(+), 291 deletions(-) >>>> create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.c >>>> create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_migrate.h >>>> create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.c >>>> create mode 100644 drivers/gpu/drm/amd/amdkfd/kfd_svm.h >>>> >>>> -- >>>> 2.29.2 >>>> >>>> _______________________________________________ >>>> dri-devel mailing list >>>> dri-devel@lists.freedesktop.org >>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
Am 14.01.21 um 06:34 schrieb Felix Kuehling:
Am 2021-01-11 um 11:29 a.m. schrieb Daniel Vetter:
On Fri, Jan 08, 2021 at 12:56:24PM -0500, Felix Kuehling wrote:
Am 2021-01-08 um 11:53 a.m. schrieb Daniel Vetter:
On Fri, Jan 8, 2021 at 5:36 PM Felix Kuehling felix.kuehling@amd.com wrote:
Am 2021-01-08 um 11:06 a.m. schrieb Daniel Vetter:
On Fri, Jan 8, 2021 at 4:58 PM Felix Kuehling felix.kuehling@amd.com wrote: > Am 2021-01-08 um 9:40 a.m. schrieb Daniel Vetter: >> On Thu, Jan 07, 2021 at 11:25:41AM -0500, Felix Kuehling wrote: >>> Am 2021-01-07 um 4:23 a.m. schrieb Daniel Vetter: >>>> On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote: >>>>> This is the first version of our HMM based shared virtual memory manager >>>>> for KFD. There are still a number of known issues that we're working through >>>>> (see below). This will likely lead to some pretty significant changes in >>>>> MMU notifier handling and locking on the migration code paths. So don't >>>>> get hung up on those details yet. >>>>> >>>>> But I think this is a good time to start getting feedback. We're pretty >>>>> confident about the ioctl API, which is both simple and extensible for the >>>>> future. (see patches 4,16) The user mode side of the API can be found here: >>>>> https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/fxkamd/hmm-wi... >>>>> >>>>> I'd also like another pair of eyes on how we're interfacing with the GPU VM >>>>> code in amdgpu_vm.c (see patches 12,13), retry page fault handling (24,25), >>>>> and some retry IRQ handling changes (32). >>>>> >>>>> >>>>> Known issues: >>>>> * won't work with IOMMU enabled, we need to dma_map all pages properly >>>>> * still working on some race conditions and random bugs >>>>> * performance is not great yet >>>> Still catching up, but I think there's another one for your list: >>>> >>>> * hmm gpu context preempt vs page fault handling. I've had a short >>>> discussion about this one with Christian before the holidays, and also >>>> some private chats with Jerome. It's nasty since no easy fix, much less >>>> a good idea what's the best approach here. >>> Do you have a pointer to that discussion or any more details? >> Essentially if you're handling an hmm page fault from the gpu, you can >> deadlock by calling dma_fence_wait on a (chain of, possibly) other command >> submissions or compute contexts with dma_fence_wait. Which deadlocks if >> you can't preempt while you have that page fault pending. Two solutions: >> >> - your hw can (at least for compute ctx) preempt even when a page fault is >> pending > Our GFXv9 GPUs can do this. GFXv10 cannot. Uh, why did your hw guys drop this :-/
Performance. It's the same reason why the XNACK mode selection API exists (patch 16). When we enable recoverable page fault handling in the compute units on GFXv9, it costs some performance even when no page faults are happening. On GFXv10 that retry fault handling moved out of the compute units, so they don't take the performance hit. But that sacrificed the ability to preempt during page faults. We'll need to work with our hardware teams to restore that capability in a future generation.
Ah yes, you need to stall in more points in the compute cores to make sure you can recover if the page fault gets interrupted.
Maybe my knowledge is outdated, but my understanding is that nvidia can also preempt (but only for compute jobs, since oh dear the pain this would be for all the fixed function stuff). Since gfx10 moved page fault handling further away from compute cores, do you know whether this now means you can do page faults for (some?) fixed function stuff too? Or still only for compute?
I'm not sure.
Supporting page fault for 3d would be real pain with the corner we're stuck in right now, but better we know about this early than later :-/
I know Christian hates the idea.
Well I don't hate the idea. I just don't think that this will ever work correctly and performant.
A big part of the additional fun is that we currently have a mix of HMM capable engines (3D, compute, DMA) and not HMM capable engines (display, multimedia etc..).
We know that page faults on GPUs can be a huge performance drain because you're stalling potentially so many threads and the CPU can become a bottle neck dealing with all the page faults from many GPU threads. On the compute side, applications will be optimized to avoid them as much as possible, e.g. by pre-faulting or pre-fetching data before it's needed.
But I think you need page faults to make overcommitted memory with user mode command submission not suck.
Yeah, completely agree.
The only short term alternative I see is to have an IOCTL telling the kernel which memory is currently in use. And that is complete nonsense cause it kills the advantage why we want user mode command submission in the first place.
Regards, Christian.
I do think it can be rescued with what I call gang scheduling of engines: I.e. when a given engine is running a context (or a group of engines, depending how your hw works) that can cause a page fault, you must flush out all workloads running on the same engine which could block a dma_fence (preempt them, or for non-compute stuff, force their completion). And the other way round, i.e. before you can run a legacy gl workload with a dma_fence on these engines you need to preempt all ctxs that could cause page faults and take them at least out of the hw scheduler queue.
Yuck! But yeah, that would work. A less invasive alternative would be to reserve some compute units for graphics contexts so we can guarantee forward progress for graphics contexts even when all CUs working on compute stuff are stuck on page faults.
Won't this hurt compute workloads? I think we need something were at least pure compute or pure gl/vk workloads run at full performance. And without preempt we can't take anything back when we need it, so would have to always upfront reserve some cores just in case.
Yes, it would hurt proportionally to how many CUs get reserved. On big GPUs with many CUs the impact could be quite small.
Also, we could do the reservation only for the time when there's actually a legacy context with normal dma_fence in the scheduler queue. Assuming that reserving/unreserving of CUs isn't too expensive operation. If it's as expensive as a full stall probably not worth the complexity here and just go with a full stall and only run one or the other at a time.
Wrt desktops I'm also somewhat worried that we might end up killing desktop workloads if there's not enough CUs reserved for these and they end up taking too long and anger either tdr or worse the user because the desktop is unuseable when you start a compute job and get a big pile of faults. Probably needs some testing to see how bad it is.
That said, I'm not sure it'll work on our hardware. Our CUs can execute multiple wavefronts from different contexts and switch between them with fine granularity. I'd need to check with our HW engineers whether this CU-internal context switching is still possible during page faults on GFXv10.
You'd need to do the reservation for all contexts/engines which can cause page faults, otherewise it'd leak.
All engines that can page fault and cannot be preempted during faults.
Regards, Felix
On Fri, Jan 08, 2021 at 03:40:07PM +0100, Daniel Vetter wrote:
On Thu, Jan 07, 2021 at 11:25:41AM -0500, Felix Kuehling wrote:
Am 2021-01-07 um 4:23 a.m. schrieb Daniel Vetter:
On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote:
This is the first version of our HMM based shared virtual memory manager for KFD. There are still a number of known issues that we're working through (see below). This will likely lead to some pretty significant changes in MMU notifier handling and locking on the migration code paths. So don't get hung up on those details yet.
But I think this is a good time to start getting feedback. We're pretty confident about the ioctl API, which is both simple and extensible for the future. (see patches 4,16) The user mode side of the API can be found here: https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/fxkamd/hmm-wi...
I'd also like another pair of eyes on how we're interfacing with the GPU VM code in amdgpu_vm.c (see patches 12,13), retry page fault handling (24,25), and some retry IRQ handling changes (32).
Known issues:
- won't work with IOMMU enabled, we need to dma_map all pages properly
- still working on some race conditions and random bugs
- performance is not great yet
Still catching up, but I think there's another one for your list:
- hmm gpu context preempt vs page fault handling. I've had a short discussion about this one with Christian before the holidays, and also some private chats with Jerome. It's nasty since no easy fix, much less a good idea what's the best approach here.
Do you have a pointer to that discussion or any more details?
Essentially if you're handling an hmm page fault from the gpu, you can deadlock by calling dma_fence_wait on a (chain of, possibly) other command submissions or compute contexts with dma_fence_wait. Which deadlocks if you can't preempt while you have that page fault pending. Two solutions:
your hw can (at least for compute ctx) preempt even when a page fault is pending
lots of screaming in trying to come up with an alternate solution. They all suck.
Note that the dma_fence_wait is hard requirement, because we need that for mmu notifiers and shrinkers, disallowing that would disable dynamic memory management. Which is the current "ttm is self-limited to 50% of system memory" limitation Christian is trying to lift. So that's really not a restriction we can lift, at least not in upstream where we need to also support old style hardware which doesn't have page fault support and really has no other option to handle memory management than dma_fence_wait.
Thread was here:
https://lore.kernel.org/dri-devel/CAKMK7uGgoeF8LmFBwWh5mW1k4xWjuUh3hdSFpVH1N...
There's a few ways to resolve this (without having preempt-capable hardware), but they're all supremely nasty. -Daniel
I had a new idea, i wanted to think more about it but have not yet, anyway here it is. Adding a new callback to dma fence which ask the question can it dead lock ? Any time a GPU driver has pending page fault (ie something calling into the mm) it answer yes, otherwise no. The GPU shrinker would ask the question before waiting on any dma-fence and back of if it gets yes. Shrinker can still try many dma buf object for which it does not get a yes on associated fence.
This does not solve the mmu notifier case, for this you would just invalidate the gem userptr object (with a flag but not releasing the page refcount) but you would not wait for the GPU (ie no dma fence wait in that code path anymore). The userptr API never really made the contract that it will always be in sync with the mm view of the world so if different page get remapped to same virtual address while GPU is still working with the old pages it should not be an issue (it would not be in our usage of userptr for compositor and what not).
Maybe i overlook something there.
Cheers, Jérôme
On Wed, Jan 13, 2021 at 5:56 PM Jerome Glisse jglisse@redhat.com wrote:
On Fri, Jan 08, 2021 at 03:40:07PM +0100, Daniel Vetter wrote:
On Thu, Jan 07, 2021 at 11:25:41AM -0500, Felix Kuehling wrote:
Am 2021-01-07 um 4:23 a.m. schrieb Daniel Vetter:
On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote:
This is the first version of our HMM based shared virtual memory manager for KFD. There are still a number of known issues that we're working through (see below). This will likely lead to some pretty significant changes in MMU notifier handling and locking on the migration code paths. So don't get hung up on those details yet.
But I think this is a good time to start getting feedback. We're pretty confident about the ioctl API, which is both simple and extensible for the future. (see patches 4,16) The user mode side of the API can be found here: https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/fxkamd/hmm-wi...
I'd also like another pair of eyes on how we're interfacing with the GPU VM code in amdgpu_vm.c (see patches 12,13), retry page fault handling (24,25), and some retry IRQ handling changes (32).
Known issues:
- won't work with IOMMU enabled, we need to dma_map all pages properly
- still working on some race conditions and random bugs
- performance is not great yet
Still catching up, but I think there's another one for your list:
- hmm gpu context preempt vs page fault handling. I've had a short discussion about this one with Christian before the holidays, and also some private chats with Jerome. It's nasty since no easy fix, much less a good idea what's the best approach here.
Do you have a pointer to that discussion or any more details?
Essentially if you're handling an hmm page fault from the gpu, you can deadlock by calling dma_fence_wait on a (chain of, possibly) other command submissions or compute contexts with dma_fence_wait. Which deadlocks if you can't preempt while you have that page fault pending. Two solutions:
your hw can (at least for compute ctx) preempt even when a page fault is pending
lots of screaming in trying to come up with an alternate solution. They all suck.
Note that the dma_fence_wait is hard requirement, because we need that for mmu notifiers and shrinkers, disallowing that would disable dynamic memory management. Which is the current "ttm is self-limited to 50% of system memory" limitation Christian is trying to lift. So that's really not a restriction we can lift, at least not in upstream where we need to also support old style hardware which doesn't have page fault support and really has no other option to handle memory management than dma_fence_wait.
Thread was here:
https://lore.kernel.org/dri-devel/CAKMK7uGgoeF8LmFBwWh5mW1k4xWjuUh3hdSFpVH1N...
There's a few ways to resolve this (without having preempt-capable hardware), but they're all supremely nasty. -Daniel
I had a new idea, i wanted to think more about it but have not yet, anyway here it is. Adding a new callback to dma fence which ask the question can it dead lock ? Any time a GPU driver has pending page fault (ie something calling into the mm) it answer yes, otherwise no. The GPU shrinker would ask the question before waiting on any dma-fence and back of if it gets yes. Shrinker can still try many dma buf object for which it does not get a yes on associated fence.
Having that answer on a given fence isn't enough, you still need to forward that information through the entire dependency graph, across drivers. That's the hard part, since that dependency graph is very implicit in the code, and we'd need to first roll it out across all drivers.
This does not solve the mmu notifier case, for this you would just invalidate the gem userptr object (with a flag but not releasing the page refcount) but you would not wait for the GPU (ie no dma fence wait in that code path anymore). The userptr API never really made the contract that it will always be in sync with the mm view of the world so if different page get remapped to same virtual address while GPU is still working with the old pages it should not be an issue (it would not be in our usage of userptr for compositor and what not).
Maybe i overlook something there.
tbh I'm never really clear on how much exactly we need, and whether maybe the new pin/unpin api should fix it all. -Daniel
On Wed, Jan 13, 2021 at 09:31:11PM +0100, Daniel Vetter wrote:
On Wed, Jan 13, 2021 at 5:56 PM Jerome Glisse jglisse@redhat.com wrote:
On Fri, Jan 08, 2021 at 03:40:07PM +0100, Daniel Vetter wrote:
On Thu, Jan 07, 2021 at 11:25:41AM -0500, Felix Kuehling wrote:
Am 2021-01-07 um 4:23 a.m. schrieb Daniel Vetter:
On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote:
This is the first version of our HMM based shared virtual memory manager for KFD. There are still a number of known issues that we're working through (see below). This will likely lead to some pretty significant changes in MMU notifier handling and locking on the migration code paths. So don't get hung up on those details yet.
But I think this is a good time to start getting feedback. We're pretty confident about the ioctl API, which is both simple and extensible for the future. (see patches 4,16) The user mode side of the API can be found here: https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/fxkamd/hmm-wi...
I'd also like another pair of eyes on how we're interfacing with the GPU VM code in amdgpu_vm.c (see patches 12,13), retry page fault handling (24,25), and some retry IRQ handling changes (32).
Known issues:
- won't work with IOMMU enabled, we need to dma_map all pages properly
- still working on some race conditions and random bugs
- performance is not great yet
Still catching up, but I think there's another one for your list:
- hmm gpu context preempt vs page fault handling. I've had a short discussion about this one with Christian before the holidays, and also some private chats with Jerome. It's nasty since no easy fix, much less a good idea what's the best approach here.
Do you have a pointer to that discussion or any more details?
Essentially if you're handling an hmm page fault from the gpu, you can deadlock by calling dma_fence_wait on a (chain of, possibly) other command submissions or compute contexts with dma_fence_wait. Which deadlocks if you can't preempt while you have that page fault pending. Two solutions:
your hw can (at least for compute ctx) preempt even when a page fault is pending
lots of screaming in trying to come up with an alternate solution. They all suck.
Note that the dma_fence_wait is hard requirement, because we need that for mmu notifiers and shrinkers, disallowing that would disable dynamic memory management. Which is the current "ttm is self-limited to 50% of system memory" limitation Christian is trying to lift. So that's really not a restriction we can lift, at least not in upstream where we need to also support old style hardware which doesn't have page fault support and really has no other option to handle memory management than dma_fence_wait.
Thread was here:
https://lore.kernel.org/dri-devel/CAKMK7uGgoeF8LmFBwWh5mW1k4xWjuUh3hdSFpVH1N...
There's a few ways to resolve this (without having preempt-capable hardware), but they're all supremely nasty. -Daniel
I had a new idea, i wanted to think more about it but have not yet, anyway here it is. Adding a new callback to dma fence which ask the question can it dead lock ? Any time a GPU driver has pending page fault (ie something calling into the mm) it answer yes, otherwise no. The GPU shrinker would ask the question before waiting on any dma-fence and back of if it gets yes. Shrinker can still try many dma buf object for which it does not get a yes on associated fence.
Having that answer on a given fence isn't enough, you still need to forward that information through the entire dependency graph, across drivers. That's the hard part, since that dependency graph is very implicit in the code, and we'd need to first roll it out across all drivers.
Here i am saying do not wait on fence for which you are not sure. Only wait on fence for which you are 100% certain you can not dead lock. So if you can never be sure on dma fence then never wait on dma-fence in the shrinker. However most driver should have enough information in their shrinker to know if it is safe to wait on fence internal to their device driver (and also know if any of those fence has implicit outside dependency). So first implementation would be to say always deadlock and then having each driver build confidence into what it can ascertain.
This does not solve the mmu notifier case, for this you would just invalidate the gem userptr object (with a flag but not releasing the page refcount) but you would not wait for the GPU (ie no dma fence wait in that code path anymore). The userptr API never really made the contract that it will always be in sync with the mm view of the world so if different page get remapped to same virtual address while GPU is still working with the old pages it should not be an issue (it would not be in our usage of userptr for compositor and what not).
Maybe i overlook something there.
tbh I'm never really clear on how much exactly we need, and whether maybe the new pin/unpin api should fix it all.
pin/unpin is not a solution it is to fix something with GUP (where we need to know if a page is GUPed or not). GUP should die longterm so anything using GUP (pin/unpin falls into that) should die longterm. Pining memory is bad period (it just breaks too much mm and it is unsolvable for things like mremap, splice, ...).
Cheers, Jérôme
On Thu, Jan 14, 2021 at 4:27 AM Jerome Glisse jglisse@redhat.com wrote:
On Wed, Jan 13, 2021 at 09:31:11PM +0100, Daniel Vetter wrote:
On Wed, Jan 13, 2021 at 5:56 PM Jerome Glisse jglisse@redhat.com wrote:
On Fri, Jan 08, 2021 at 03:40:07PM +0100, Daniel Vetter wrote:
On Thu, Jan 07, 2021 at 11:25:41AM -0500, Felix Kuehling wrote:
Am 2021-01-07 um 4:23 a.m. schrieb Daniel Vetter:
On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote: > This is the first version of our HMM based shared virtual memory manager > for KFD. There are still a number of known issues that we're working through > (see below). This will likely lead to some pretty significant changes in > MMU notifier handling and locking on the migration code paths. So don't > get hung up on those details yet. > > But I think this is a good time to start getting feedback. We're pretty > confident about the ioctl API, which is both simple and extensible for the > future. (see patches 4,16) The user mode side of the API can be found here: > https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/fxkamd/hmm-wi... > > I'd also like another pair of eyes on how we're interfacing with the GPU VM > code in amdgpu_vm.c (see patches 12,13), retry page fault handling (24,25), > and some retry IRQ handling changes (32). > > > Known issues: > * won't work with IOMMU enabled, we need to dma_map all pages properly > * still working on some race conditions and random bugs > * performance is not great yet Still catching up, but I think there's another one for your list:
- hmm gpu context preempt vs page fault handling. I've had a short discussion about this one with Christian before the holidays, and also some private chats with Jerome. It's nasty since no easy fix, much less a good idea what's the best approach here.
Do you have a pointer to that discussion or any more details?
Essentially if you're handling an hmm page fault from the gpu, you can deadlock by calling dma_fence_wait on a (chain of, possibly) other command submissions or compute contexts with dma_fence_wait. Which deadlocks if you can't preempt while you have that page fault pending. Two solutions:
your hw can (at least for compute ctx) preempt even when a page fault is pending
lots of screaming in trying to come up with an alternate solution. They all suck.
Note that the dma_fence_wait is hard requirement, because we need that for mmu notifiers and shrinkers, disallowing that would disable dynamic memory management. Which is the current "ttm is self-limited to 50% of system memory" limitation Christian is trying to lift. So that's really not a restriction we can lift, at least not in upstream where we need to also support old style hardware which doesn't have page fault support and really has no other option to handle memory management than dma_fence_wait.
Thread was here:
https://lore.kernel.org/dri-devel/CAKMK7uGgoeF8LmFBwWh5mW1k4xWjuUh3hdSFpVH1N...
There's a few ways to resolve this (without having preempt-capable hardware), but they're all supremely nasty. -Daniel
I had a new idea, i wanted to think more about it but have not yet, anyway here it is. Adding a new callback to dma fence which ask the question can it dead lock ? Any time a GPU driver has pending page fault (ie something calling into the mm) it answer yes, otherwise no. The GPU shrinker would ask the question before waiting on any dma-fence and back of if it gets yes. Shrinker can still try many dma buf object for which it does not get a yes on associated fence.
Having that answer on a given fence isn't enough, you still need to forward that information through the entire dependency graph, across drivers. That's the hard part, since that dependency graph is very implicit in the code, and we'd need to first roll it out across all drivers.
Here i am saying do not wait on fence for which you are not sure. Only wait on fence for which you are 100% certain you can not dead lock. So if you can never be sure on dma fence then never wait on dma-fence in the shrinker. However most driver should have enough information in their shrinker to know if it is safe to wait on fence internal to their device driver (and also know if any of those fence has implicit outside dependency). So first implementation would be to say always deadlock and then having each driver build confidence into what it can ascertain.
I just don't think that actually works in practice:
- on a single gpu you can't wait for vk/gl due to shared CUs, so only sdma and uvd are left (or whatever else pure fixed function)
- for multi-gpu you get the guessing game of what leaks across gpus and what doesn't. With p2p dma-buf we're now leaking dma_fence across gpus even when there's no implicit syncing by userspace (although for amdgpu this is tricky since iirc it still lacks the flag to let userspace decide this, so this is more for other drivers).
- you don't just need to guarantee that there's no dma_fence dependency going back to you, you also need to make sure there's no other depedency chain through locks or whatever that closes the loop. And since your proposal here is against the dma_fence lockdep annotations we have now, lockdep won't help you (and let's be honest, review doesn't catch this stuff either, so it's up to hangs in production to catch this stuff)
- you still need the full dependency graph within the driver, and only i915 scheduler has that afaik. And I'm not sure implementing that was a bright idea
- assuming it's a deadlock by default means all gl/vk memory is pinned. That's not nice, plus in additional you need hacks like ttm's "max 50% of system memory" to paper over the worst fallout, which Christian is trying to lift. I really do think we need to be able to move towards more dynamic memory management, not less.
So in the end you're essentially disabling shrinking/eviction of other gpu tasks, and I don't think that works. I really think the only two realistic options are - guarantee forward progress of other dma_fence (hw preemption, reserved CUs, or whatever else you have) - guarantee there's not a single offending dma_fence active in the system that could cause problems
Hand-waving that in theory we could track the dependencies and that in theory we could do some deadlock avoidance of some sorts about that just doesn't look like a pragmatic&practical solution to me here. It feels about as realistic as just creating a completely new memory management model that sidesteps the entire dma_fence issues we have due to mixing up kernel memory management and userspace sync fences in one thing.
Cheers, Daniel
This does not solve the mmu notifier case, for this you would just invalidate the gem userptr object (with a flag but not releasing the page refcount) but you would not wait for the GPU (ie no dma fence wait in that code path anymore). The userptr API never really made the contract that it will always be in sync with the mm view of the world so if different page get remapped to same virtual address while GPU is still working with the old pages it should not be an issue (it would not be in our usage of userptr for compositor and what not).
Maybe i overlook something there.
tbh I'm never really clear on how much exactly we need, and whether maybe the new pin/unpin api should fix it all.
pin/unpin is not a solution it is to fix something with GUP (where we need to know if a page is GUPed or not). GUP should die longterm so anything using GUP (pin/unpin falls into that) should die longterm. Pining memory is bad period (it just breaks too much mm and it is unsolvable for things like mremap, splice, ...).
Cheers, Jérôme
On Thu, Jan 14, 2021 at 10:26 AM Daniel Vetter daniel@ffwll.ch wrote:
On Thu, Jan 14, 2021 at 4:27 AM Jerome Glisse jglisse@redhat.com wrote:
On Wed, Jan 13, 2021 at 09:31:11PM +0100, Daniel Vetter wrote:
On Wed, Jan 13, 2021 at 5:56 PM Jerome Glisse jglisse@redhat.com wrote:
On Fri, Jan 08, 2021 at 03:40:07PM +0100, Daniel Vetter wrote:
On Thu, Jan 07, 2021 at 11:25:41AM -0500, Felix Kuehling wrote:
Am 2021-01-07 um 4:23 a.m. schrieb Daniel Vetter: > On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote: >> This is the first version of our HMM based shared virtual memory manager >> for KFD. There are still a number of known issues that we're working through >> (see below). This will likely lead to some pretty significant changes in >> MMU notifier handling and locking on the migration code paths. So don't >> get hung up on those details yet. >> >> But I think this is a good time to start getting feedback. We're pretty >> confident about the ioctl API, which is both simple and extensible for the >> future. (see patches 4,16) The user mode side of the API can be found here: >> https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/fxkamd/hmm-wi... >> >> I'd also like another pair of eyes on how we're interfacing with the GPU VM >> code in amdgpu_vm.c (see patches 12,13), retry page fault handling (24,25), >> and some retry IRQ handling changes (32). >> >> >> Known issues: >> * won't work with IOMMU enabled, we need to dma_map all pages properly >> * still working on some race conditions and random bugs >> * performance is not great yet > Still catching up, but I think there's another one for your list: > > * hmm gpu context preempt vs page fault handling. I've had a short > discussion about this one with Christian before the holidays, and also > some private chats with Jerome. It's nasty since no easy fix, much less > a good idea what's the best approach here.
Do you have a pointer to that discussion or any more details?
Essentially if you're handling an hmm page fault from the gpu, you can deadlock by calling dma_fence_wait on a (chain of, possibly) other command submissions or compute contexts with dma_fence_wait. Which deadlocks if you can't preempt while you have that page fault pending. Two solutions:
your hw can (at least for compute ctx) preempt even when a page fault is pending
lots of screaming in trying to come up with an alternate solution. They all suck.
Note that the dma_fence_wait is hard requirement, because we need that for mmu notifiers and shrinkers, disallowing that would disable dynamic memory management. Which is the current "ttm is self-limited to 50% of system memory" limitation Christian is trying to lift. So that's really not a restriction we can lift, at least not in upstream where we need to also support old style hardware which doesn't have page fault support and really has no other option to handle memory management than dma_fence_wait.
Thread was here:
https://lore.kernel.org/dri-devel/CAKMK7uGgoeF8LmFBwWh5mW1k4xWjuUh3hdSFpVH1N...
There's a few ways to resolve this (without having preempt-capable hardware), but they're all supremely nasty. -Daniel
I had a new idea, i wanted to think more about it but have not yet, anyway here it is. Adding a new callback to dma fence which ask the question can it dead lock ? Any time a GPU driver has pending page fault (ie something calling into the mm) it answer yes, otherwise no. The GPU shrinker would ask the question before waiting on any dma-fence and back of if it gets yes. Shrinker can still try many dma buf object for which it does not get a yes on associated fence.
Having that answer on a given fence isn't enough, you still need to forward that information through the entire dependency graph, across drivers. That's the hard part, since that dependency graph is very implicit in the code, and we'd need to first roll it out across all drivers.
Here i am saying do not wait on fence for which you are not sure. Only wait on fence for which you are 100% certain you can not dead lock. So if you can never be sure on dma fence then never wait on dma-fence in the shrinker. However most driver should have enough information in their shrinker to know if it is safe to wait on fence internal to their device driver (and also know if any of those fence has implicit outside dependency). So first implementation would be to say always deadlock and then having each driver build confidence into what it can ascertain.
I just don't think that actually works in practice:
- on a single gpu you can't wait for vk/gl due to shared CUs, so only
sdma and uvd are left (or whatever else pure fixed function)
- for multi-gpu you get the guessing game of what leaks across gpus
and what doesn't. With p2p dma-buf we're now leaking dma_fence across gpus even when there's no implicit syncing by userspace (although for amdgpu this is tricky since iirc it still lacks the flag to let userspace decide this, so this is more for other drivers).
- you don't just need to guarantee that there's no dma_fence
dependency going back to you, you also need to make sure there's no other depedency chain through locks or whatever that closes the loop. And since your proposal here is against the dma_fence lockdep annotations we have now, lockdep won't help you (and let's be honest, review doesn't catch this stuff either, so it's up to hangs in production to catch this stuff)
- you still need the full dependency graph within the driver, and only
i915 scheduler has that afaik. And I'm not sure implementing that was a bright idea
- assuming it's a deadlock by default means all gl/vk memory is
pinned. That's not nice, plus in additional you need hacks like ttm's "max 50% of system memory" to paper over the worst fallout, which Christian is trying to lift. I really do think we need to be able to move towards more dynamic memory management, not less.
Forgot one issue:
- somehow you need to transport the knowledge that you're in the gpu fault repair path of a specific engine down to shrinkers/mmu notifiers and all that. And it needs to be fairly specific, otherwise it just amounts again to "no more dma_fence_wait allowed".
-Daniel
So in the end you're essentially disabling shrinking/eviction of other gpu tasks, and I don't think that works. I really think the only two realistic options are
- guarantee forward progress of other dma_fence (hw preemption,
reserved CUs, or whatever else you have)
- guarantee there's not a single offending dma_fence active in the
system that could cause problems
Hand-waving that in theory we could track the dependencies and that in theory we could do some deadlock avoidance of some sorts about that just doesn't look like a pragmatic&practical solution to me here. It feels about as realistic as just creating a completely new memory management model that sidesteps the entire dma_fence issues we have due to mixing up kernel memory management and userspace sync fences in one thing.
Cheers, Daniel
This does not solve the mmu notifier case, for this you would just invalidate the gem userptr object (with a flag but not releasing the page refcount) but you would not wait for the GPU (ie no dma fence wait in that code path anymore). The userptr API never really made the contract that it will always be in sync with the mm view of the world so if different page get remapped to same virtual address while GPU is still working with the old pages it should not be an issue (it would not be in our usage of userptr for compositor and what not).
Maybe i overlook something there.
tbh I'm never really clear on how much exactly we need, and whether maybe the new pin/unpin api should fix it all.
pin/unpin is not a solution it is to fix something with GUP (where we need to know if a page is GUPed or not). GUP should die longterm so anything using GUP (pin/unpin falls into that) should die longterm. Pining memory is bad period (it just breaks too much mm and it is unsolvable for things like mremap, splice, ...).
Cheers, Jérôme
-- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch
Am 13.01.21 um 17:56 schrieb Jerome Glisse:
On Fri, Jan 08, 2021 at 03:40:07PM +0100, Daniel Vetter wrote:
On Thu, Jan 07, 2021 at 11:25:41AM -0500, Felix Kuehling wrote:
Am 2021-01-07 um 4:23 a.m. schrieb Daniel Vetter:
On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote:
This is the first version of our HMM based shared virtual memory manager for KFD. There are still a number of known issues that we're working through (see below). This will likely lead to some pretty significant changes in MMU notifier handling and locking on the migration code paths. So don't get hung up on those details yet.
But I think this is a good time to start getting feedback. We're pretty confident about the ioctl API, which is both simple and extensible for the future. (see patches 4,16) The user mode side of the API can be found here: https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/fxkamd/hmm-wi...
I'd also like another pair of eyes on how we're interfacing with the GPU VM code in amdgpu_vm.c (see patches 12,13), retry page fault handling (24,25), and some retry IRQ handling changes (32).
Known issues:
- won't work with IOMMU enabled, we need to dma_map all pages properly
- still working on some race conditions and random bugs
- performance is not great yet
Still catching up, but I think there's another one for your list:
- hmm gpu context preempt vs page fault handling. I've had a short discussion about this one with Christian before the holidays, and also some private chats with Jerome. It's nasty since no easy fix, much less a good idea what's the best approach here.
Do you have a pointer to that discussion or any more details?
Essentially if you're handling an hmm page fault from the gpu, you can deadlock by calling dma_fence_wait on a (chain of, possibly) other command submissions or compute contexts with dma_fence_wait. Which deadlocks if you can't preempt while you have that page fault pending. Two solutions:
your hw can (at least for compute ctx) preempt even when a page fault is pending
lots of screaming in trying to come up with an alternate solution. They all suck.
Note that the dma_fence_wait is hard requirement, because we need that for mmu notifiers and shrinkers, disallowing that would disable dynamic memory management. Which is the current "ttm is self-limited to 50% of system memory" limitation Christian is trying to lift. So that's really not a restriction we can lift, at least not in upstream where we need to also support old style hardware which doesn't have page fault support and really has no other option to handle memory management than dma_fence_wait.
Thread was here:
https://lore.kernel.org/dri-devel/CAKMK7uGgoeF8LmFBwWh5mW1k4xWjuUh3hdSFpVH1N...
There's a few ways to resolve this (without having preempt-capable hardware), but they're all supremely nasty. -Daniel
I had a new idea, i wanted to think more about it but have not yet, anyway here it is. Adding a new callback to dma fence which ask the question can it dead lock ? Any time a GPU driver has pending page fault (ie something calling into the mm) it answer yes, otherwise no. The GPU shrinker would ask the question before waiting on any dma-fence and back of if it gets yes. Shrinker can still try many dma buf object for which it does not get a yes on associated fence.
This does not solve the mmu notifier case, for this you would just invalidate the gem userptr object (with a flag but not releasing the page refcount) but you would not wait for the GPU (ie no dma fence wait in that code path anymore). The userptr API never really made the contract that it will always be in sync with the mm view of the world so if different page get remapped to same virtual address while GPU is still working with the old pages it should not be an issue (it would not be in our usage of userptr for compositor and what not).
The current working idea in my mind goes into a similar direction.
But instead of a callback I'm adding a complete new class of HMM fences.
Waiting in the MMU notfier, scheduler, TTM etc etc is only allowed for the dma_fences and HMM fences are ignored in container objects.
When you handle an implicit or explicit synchronization request from userspace you need to block for HMM fences to complete before taking any resource locks.
Regards, Christian.
Maybe i overlook something there.
Cheers, Jérôme
amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
On Thu, Jan 14, 2021 at 11:49 AM Christian König ckoenig.leichtzumerken@gmail.com wrote:
Am 13.01.21 um 17:56 schrieb Jerome Glisse:
On Fri, Jan 08, 2021 at 03:40:07PM +0100, Daniel Vetter wrote:
On Thu, Jan 07, 2021 at 11:25:41AM -0500, Felix Kuehling wrote:
Am 2021-01-07 um 4:23 a.m. schrieb Daniel Vetter:
On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote:
This is the first version of our HMM based shared virtual memory manager for KFD. There are still a number of known issues that we're working through (see below). This will likely lead to some pretty significant changes in MMU notifier handling and locking on the migration code paths. So don't get hung up on those details yet.
But I think this is a good time to start getting feedback. We're pretty confident about the ioctl API, which is both simple and extensible for the future. (see patches 4,16) The user mode side of the API can be found here: https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/blob/fxkamd/hmm-wi...
I'd also like another pair of eyes on how we're interfacing with the GPU VM code in amdgpu_vm.c (see patches 12,13), retry page fault handling (24,25), and some retry IRQ handling changes (32).
Known issues:
- won't work with IOMMU enabled, we need to dma_map all pages properly
- still working on some race conditions and random bugs
- performance is not great yet
Still catching up, but I think there's another one for your list:
- hmm gpu context preempt vs page fault handling. I've had a short discussion about this one with Christian before the holidays, and also some private chats with Jerome. It's nasty since no easy fix, much less a good idea what's the best approach here.
Do you have a pointer to that discussion or any more details?
Essentially if you're handling an hmm page fault from the gpu, you can deadlock by calling dma_fence_wait on a (chain of, possibly) other command submissions or compute contexts with dma_fence_wait. Which deadlocks if you can't preempt while you have that page fault pending. Two solutions:
your hw can (at least for compute ctx) preempt even when a page fault is pending
lots of screaming in trying to come up with an alternate solution. They all suck.
Note that the dma_fence_wait is hard requirement, because we need that for mmu notifiers and shrinkers, disallowing that would disable dynamic memory management. Which is the current "ttm is self-limited to 50% of system memory" limitation Christian is trying to lift. So that's really not a restriction we can lift, at least not in upstream where we need to also support old style hardware which doesn't have page fault support and really has no other option to handle memory management than dma_fence_wait.
Thread was here:
https://lore.kernel.org/dri-devel/CAKMK7uGgoeF8LmFBwWh5mW1k4xWjuUh3hdSFpVH1N...
There's a few ways to resolve this (without having preempt-capable hardware), but they're all supremely nasty. -Daniel
I had a new idea, i wanted to think more about it but have not yet, anyway here it is. Adding a new callback to dma fence which ask the question can it dead lock ? Any time a GPU driver has pending page fault (ie something calling into the mm) it answer yes, otherwise no. The GPU shrinker would ask the question before waiting on any dma-fence and back of if it gets yes. Shrinker can still try many dma buf object for which it does not get a yes on associated fence.
This does not solve the mmu notifier case, for this you would just invalidate the gem userptr object (with a flag but not releasing the page refcount) but you would not wait for the GPU (ie no dma fence wait in that code path anymore). The userptr API never really made the contract that it will always be in sync with the mm view of the world so if different page get remapped to same virtual address while GPU is still working with the old pages it should not be an issue (it would not be in our usage of userptr for compositor and what not).
The current working idea in my mind goes into a similar direction.
But instead of a callback I'm adding a complete new class of HMM fences.
Waiting in the MMU notfier, scheduler, TTM etc etc is only allowed for the dma_fences and HMM fences are ignored in container objects.
When you handle an implicit or explicit synchronization request from userspace you need to block for HMM fences to complete before taking any resource locks.
Isnt' that what I call gang scheduling? I.e. you either run in HMM mode, or in legacy fencing mode (whether implicit or explicit doesn't really matter I think). By forcing that split we avoid the problem, but it means occasionally full stalls on mixed workloads.
But that's not what Jerome wants (afaiui at least), I think his idea is to track the reverse dependencies of all the fences floating around, and then skip evicting an object if you have to wait for any fence that is problematic for the current calling context. And I don't think that's very feasible in practice.
So what kind of hmm fences do you have in mind here? -Daniel
Regards, Christian.
Maybe i overlook something there.
Cheers, Jérôme
amd-gfx mailing list amd-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/amd-gfx
Am 14.01.21 um 12:52 schrieb Daniel Vetter:
[SNIP]
I had a new idea, i wanted to think more about it but have not yet, anyway here it is. Adding a new callback to dma fence which ask the question can it dead lock ? Any time a GPU driver has pending page fault (ie something calling into the mm) it answer yes, otherwise no. The GPU shrinker would ask the question before waiting on any dma-fence and back of if it gets yes. Shrinker can still try many dma buf object for which it does not get a yes on associated fence.
This does not solve the mmu notifier case, for this you would just invalidate the gem userptr object (with a flag but not releasing the page refcount) but you would not wait for the GPU (ie no dma fence wait in that code path anymore). The userptr API never really made the contract that it will always be in sync with the mm view of the world so if different page get remapped to same virtual address while GPU is still working with the old pages it should not be an issue (it would not be in our usage of userptr for compositor and what not).
The current working idea in my mind goes into a similar direction.
But instead of a callback I'm adding a complete new class of HMM fences.
Waiting in the MMU notfier, scheduler, TTM etc etc is only allowed for the dma_fences and HMM fences are ignored in container objects.
When you handle an implicit or explicit synchronization request from userspace you need to block for HMM fences to complete before taking any resource locks.
Isnt' that what I call gang scheduling? I.e. you either run in HMM mode, or in legacy fencing mode (whether implicit or explicit doesn't really matter I think). By forcing that split we avoid the problem, but it means occasionally full stalls on mixed workloads.
But that's not what Jerome wants (afaiui at least), I think his idea is to track the reverse dependencies of all the fences floating around, and then skip evicting an object if you have to wait for any fence that is problematic for the current calling context. And I don't think that's very feasible in practice.
So what kind of hmm fences do you have in mind here?
It's a bit more relaxed than your gang schedule.
See the requirements are as follow:
1. dma_fences never depend on hmm_fences. 2. hmm_fences can never preempt dma_fences. 3. dma_fences must be able to preempt hmm_fences or we always reserve enough hardware resources (CUs) to guarantee forward progress of dma_fences.
Critical sections are MMU notifiers, page faults, GPU schedulers and dma_reservation object locks.
4. It is valid to wait for a dma_fences in critical sections. 5. It is not valid to wait for hmm_fences in critical sections.
Fence creation either happens during command submission or by adding something like a barrier or signal command to your userspace queue.
6. If we have an hmm_fence as implicit or explicit dependency for creating a dma_fence we must wait for that before taking any locks or reserving resources. 7. If we have a dma_fence as implicit or explicit dependency for creating an hmm_fence we can wait later on. So busy waiting or special WAIT hardware commands are valid.
This prevents hard cuts, e.g. can mix hmm_fences and dma_fences at the same time on the hardware.
In other words we can have a high priority gfx queue running jobs based on dma_fences and a low priority compute queue running jobs based on hmm_fences.
Only when we switch from hmm_fence to dma_fence we need to block the submission until all the necessary resources (both memory as well as CUs) are available.
This is somewhat an extension to your gang submit idea.
Regards, Christian.
-Daniel
On Thu, Jan 14, 2021 at 2:37 PM Christian König christian.koenig@amd.com wrote:
Am 14.01.21 um 12:52 schrieb Daniel Vetter:
[SNIP]
I had a new idea, i wanted to think more about it but have not yet, anyway here it is. Adding a new callback to dma fence which ask the question can it dead lock ? Any time a GPU driver has pending page fault (ie something calling into the mm) it answer yes, otherwise no. The GPU shrinker would ask the question before waiting on any dma-fence and back of if it gets yes. Shrinker can still try many dma buf object for which it does not get a yes on associated fence.
This does not solve the mmu notifier case, for this you would just invalidate the gem userptr object (with a flag but not releasing the page refcount) but you would not wait for the GPU (ie no dma fence wait in that code path anymore). The userptr API never really made the contract that it will always be in sync with the mm view of the world so if different page get remapped to same virtual address while GPU is still working with the old pages it should not be an issue (it would not be in our usage of userptr for compositor and what not).
The current working idea in my mind goes into a similar direction.
But instead of a callback I'm adding a complete new class of HMM fences.
Waiting in the MMU notfier, scheduler, TTM etc etc is only allowed for the dma_fences and HMM fences are ignored in container objects.
When you handle an implicit or explicit synchronization request from userspace you need to block for HMM fences to complete before taking any resource locks.
Isnt' that what I call gang scheduling? I.e. you either run in HMM mode, or in legacy fencing mode (whether implicit or explicit doesn't really matter I think). By forcing that split we avoid the problem, but it means occasionally full stalls on mixed workloads.
But that's not what Jerome wants (afaiui at least), I think his idea is to track the reverse dependencies of all the fences floating around, and then skip evicting an object if you have to wait for any fence that is problematic for the current calling context. And I don't think that's very feasible in practice.
So what kind of hmm fences do you have in mind here?
It's a bit more relaxed than your gang schedule.
See the requirements are as follow:
- dma_fences never depend on hmm_fences.
- hmm_fences can never preempt dma_fences.
- dma_fences must be able to preempt hmm_fences or we always reserve
enough hardware resources (CUs) to guarantee forward progress of dma_fences.
Critical sections are MMU notifiers, page faults, GPU schedulers and dma_reservation object locks.
- It is valid to wait for a dma_fences in critical sections.
- It is not valid to wait for hmm_fences in critical sections.
Fence creation either happens during command submission or by adding something like a barrier or signal command to your userspace queue.
- If we have an hmm_fence as implicit or explicit dependency for
creating a dma_fence we must wait for that before taking any locks or reserving resources. 7. If we have a dma_fence as implicit or explicit dependency for creating an hmm_fence we can wait later on. So busy waiting or special WAIT hardware commands are valid.
This prevents hard cuts, e.g. can mix hmm_fences and dma_fences at the same time on the hardware.
In other words we can have a high priority gfx queue running jobs based on dma_fences and a low priority compute queue running jobs based on hmm_fences.
Only when we switch from hmm_fence to dma_fence we need to block the submission until all the necessary resources (both memory as well as CUs) are available.
This is somewhat an extension to your gang submit idea.
Either I'm missing something, or this is just exactly what we documented already with userspace fences in general, and how you can't have a dma_fence depend upon a userspace (or hmm_fence).
My gang scheduling idea is really just an alternative for what you have listed as item 3 above. Instead of requiring preempt or requiring guaranteed forward progress of some other sorts we flush out any pending dma_fence request. But _only_ those which would get stalled by the job we're running, so high-priority sdma requests we need in the kernel to shuffle buffers around are still all ok. This would be needed if you're hw can't preempt, and you also have shared engines between compute and gfx, so reserving CUs won't solve the problem either.
What I don't mean with my gang scheduling is a completely exclusive mode between hmm_fence and dma_fence, since that would prevent us from using copy engines and dma_fence in the kernel to shuffle memory around for hmm jobs. And that would suck, even on compute-only workloads. Maybe I should rename "gang scheduling" to "engine flush" or something like that.
I think the basics of userspace or hmm_fence or whatever we'll call it we've documented already here:
https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_f...
I think the only thing missing is clarifying a bit what you have under item 3, i.e. how do we make sure there's no accidental hidden dependency between hmm_fence and dma_fence. Maybe a subsection about gpu page fault handling?
Or are we still talking past each another a bit here? -Daniel
Regards, Christian.
-Daniel
Am 14.01.21 um 14:57 schrieb Daniel Vetter:
On Thu, Jan 14, 2021 at 2:37 PM Christian König christian.koenig@amd.com wrote:
Am 14.01.21 um 12:52 schrieb Daniel Vetter:
[SNIP]
I had a new idea, i wanted to think more about it but have not yet, anyway here it is. Adding a new callback to dma fence which ask the question can it dead lock ? Any time a GPU driver has pending page fault (ie something calling into the mm) it answer yes, otherwise no. The GPU shrinker would ask the question before waiting on any dma-fence and back of if it gets yes. Shrinker can still try many dma buf object for which it does not get a yes on associated fence.
This does not solve the mmu notifier case, for this you would just invalidate the gem userptr object (with a flag but not releasing the page refcount) but you would not wait for the GPU (ie no dma fence wait in that code path anymore). The userptr API never really made the contract that it will always be in sync with the mm view of the world so if different page get remapped to same virtual address while GPU is still working with the old pages it should not be an issue (it would not be in our usage of userptr for compositor and what not).
The current working idea in my mind goes into a similar direction.
But instead of a callback I'm adding a complete new class of HMM fences.
Waiting in the MMU notfier, scheduler, TTM etc etc is only allowed for the dma_fences and HMM fences are ignored in container objects.
When you handle an implicit or explicit synchronization request from userspace you need to block for HMM fences to complete before taking any resource locks.
Isnt' that what I call gang scheduling? I.e. you either run in HMM mode, or in legacy fencing mode (whether implicit or explicit doesn't really matter I think). By forcing that split we avoid the problem, but it means occasionally full stalls on mixed workloads.
But that's not what Jerome wants (afaiui at least), I think his idea is to track the reverse dependencies of all the fences floating around, and then skip evicting an object if you have to wait for any fence that is problematic for the current calling context. And I don't think that's very feasible in practice.
So what kind of hmm fences do you have in mind here?
It's a bit more relaxed than your gang schedule.
See the requirements are as follow:
- dma_fences never depend on hmm_fences.
- hmm_fences can never preempt dma_fences.
- dma_fences must be able to preempt hmm_fences or we always reserve
enough hardware resources (CUs) to guarantee forward progress of dma_fences.
Critical sections are MMU notifiers, page faults, GPU schedulers and dma_reservation object locks.
- It is valid to wait for a dma_fences in critical sections.
- It is not valid to wait for hmm_fences in critical sections.
Fence creation either happens during command submission or by adding something like a barrier or signal command to your userspace queue.
- If we have an hmm_fence as implicit or explicit dependency for
creating a dma_fence we must wait for that before taking any locks or reserving resources. 7. If we have a dma_fence as implicit or explicit dependency for creating an hmm_fence we can wait later on. So busy waiting or special WAIT hardware commands are valid.
This prevents hard cuts, e.g. can mix hmm_fences and dma_fences at the same time on the hardware.
In other words we can have a high priority gfx queue running jobs based on dma_fences and a low priority compute queue running jobs based on hmm_fences.
Only when we switch from hmm_fence to dma_fence we need to block the submission until all the necessary resources (both memory as well as CUs) are available.
This is somewhat an extension to your gang submit idea.
Either I'm missing something, or this is just exactly what we documented already with userspace fences in general, and how you can't have a dma_fence depend upon a userspace (or hmm_fence).
My gang scheduling idea is really just an alternative for what you have listed as item 3 above. Instead of requiring preempt or requiring guaranteed forward progress of some other sorts we flush out any pending dma_fence request. But _only_ those which would get stalled by the job we're running, so high-priority sdma requests we need in the kernel to shuffle buffers around are still all ok. This would be needed if you're hw can't preempt, and you also have shared engines between compute and gfx, so reserving CUs won't solve the problem either.
What I don't mean with my gang scheduling is a completely exclusive mode between hmm_fence and dma_fence, since that would prevent us from using copy engines and dma_fence in the kernel to shuffle memory around for hmm jobs. And that would suck, even on compute-only workloads. Maybe I should rename "gang scheduling" to "engine flush" or something like that.
Yeah, "engine flush" makes it much more clearer.
What I wanted to emphasis is that we have to mix dma_fences and hmm_fences running at the same time on the same hardware fighting over the same resources.
E.g. even on the newest hardware multimedia engines can't handle page faults, so video decoding/encoding will still produce dma_fences.
I think the basics of userspace or hmm_fence or whatever we'll call it we've documented already here:
https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_f...
This talks about the restrictions we have for dma_fences and why infinite fences (even as hmm_fence) will never work.
But it doesn't talk about how to handle implicit or explicit dependencies with something like hmm_fences.
In other words my proposal above allows for hmm_fences to show up in dma_reservation objects and are used together with all this explicit synchronization we still have with only a medium amount of work :)
I think the only thing missing is clarifying a bit what you have under item 3, i.e. how do we make sure there's no accidental hidden dependency between hmm_fence and dma_fence. Maybe a subsection about gpu page fault handling?
The real improvement is item 6. The problem with it is that it requires auditing all occasions when we create dma_fences so that we don't accidentally depend on an HMM fence.
Regards, Christian.
Or are we still talking past each another a bit here? -Daniel
Regards, Christian.
-Daniel
On Thu, Jan 14, 2021 at 3:13 PM Christian König ckoenig.leichtzumerken@gmail.com wrote:
Am 14.01.21 um 14:57 schrieb Daniel Vetter:
On Thu, Jan 14, 2021 at 2:37 PM Christian König christian.koenig@amd.com wrote:
Am 14.01.21 um 12:52 schrieb Daniel Vetter:
[SNIP]
I had a new idea, i wanted to think more about it but have not yet, anyway here it is. Adding a new callback to dma fence which ask the question can it dead lock ? Any time a GPU driver has pending page fault (ie something calling into the mm) it answer yes, otherwise no. The GPU shrinker would ask the question before waiting on any dma-fence and back of if it gets yes. Shrinker can still try many dma buf object for which it does not get a yes on associated fence.
This does not solve the mmu notifier case, for this you would just invalidate the gem userptr object (with a flag but not releasing the page refcount) but you would not wait for the GPU (ie no dma fence wait in that code path anymore). The userptr API never really made the contract that it will always be in sync with the mm view of the world so if different page get remapped to same virtual address while GPU is still working with the old pages it should not be an issue (it would not be in our usage of userptr for compositor and what not).
The current working idea in my mind goes into a similar direction.
But instead of a callback I'm adding a complete new class of HMM fences.
Waiting in the MMU notfier, scheduler, TTM etc etc is only allowed for the dma_fences and HMM fences are ignored in container objects.
When you handle an implicit or explicit synchronization request from userspace you need to block for HMM fences to complete before taking any resource locks.
Isnt' that what I call gang scheduling? I.e. you either run in HMM mode, or in legacy fencing mode (whether implicit or explicit doesn't really matter I think). By forcing that split we avoid the problem, but it means occasionally full stalls on mixed workloads.
But that's not what Jerome wants (afaiui at least), I think his idea is to track the reverse dependencies of all the fences floating around, and then skip evicting an object if you have to wait for any fence that is problematic for the current calling context. And I don't think that's very feasible in practice.
So what kind of hmm fences do you have in mind here?
It's a bit more relaxed than your gang schedule.
See the requirements are as follow:
- dma_fences never depend on hmm_fences.
- hmm_fences can never preempt dma_fences.
- dma_fences must be able to preempt hmm_fences or we always reserve
enough hardware resources (CUs) to guarantee forward progress of dma_fences.
Critical sections are MMU notifiers, page faults, GPU schedulers and dma_reservation object locks.
- It is valid to wait for a dma_fences in critical sections.
- It is not valid to wait for hmm_fences in critical sections.
Fence creation either happens during command submission or by adding something like a barrier or signal command to your userspace queue.
- If we have an hmm_fence as implicit or explicit dependency for
creating a dma_fence we must wait for that before taking any locks or reserving resources. 7. If we have a dma_fence as implicit or explicit dependency for creating an hmm_fence we can wait later on. So busy waiting or special WAIT hardware commands are valid.
This prevents hard cuts, e.g. can mix hmm_fences and dma_fences at the same time on the hardware.
In other words we can have a high priority gfx queue running jobs based on dma_fences and a low priority compute queue running jobs based on hmm_fences.
Only when we switch from hmm_fence to dma_fence we need to block the submission until all the necessary resources (both memory as well as CUs) are available.
This is somewhat an extension to your gang submit idea.
Either I'm missing something, or this is just exactly what we documented already with userspace fences in general, and how you can't have a dma_fence depend upon a userspace (or hmm_fence).
My gang scheduling idea is really just an alternative for what you have listed as item 3 above. Instead of requiring preempt or requiring guaranteed forward progress of some other sorts we flush out any pending dma_fence request. But _only_ those which would get stalled by the job we're running, so high-priority sdma requests we need in the kernel to shuffle buffers around are still all ok. This would be needed if you're hw can't preempt, and you also have shared engines between compute and gfx, so reserving CUs won't solve the problem either.
What I don't mean with my gang scheduling is a completely exclusive mode between hmm_fence and dma_fence, since that would prevent us from using copy engines and dma_fence in the kernel to shuffle memory around for hmm jobs. And that would suck, even on compute-only workloads. Maybe I should rename "gang scheduling" to "engine flush" or something like that.
Yeah, "engine flush" makes it much more clearer.
What I wanted to emphasis is that we have to mix dma_fences and hmm_fences running at the same time on the same hardware fighting over the same resources.
E.g. even on the newest hardware multimedia engines can't handle page faults, so video decoding/encoding will still produce dma_fences.
Well we also have to mix them so the kernel can shovel data around using copy engines. Plus we have to mix it at the overall subsystem level because I'm not sure SoC-class gpus will ever get here, definitely aren't yet there for sure.
I think the basics of userspace or hmm_fence or whatever we'll call it we've documented already here:
https://dri.freedesktop.org/docs/drm/driver-api/dma-buf.html?highlight=dma_f...
This talks about the restrictions we have for dma_fences and why infinite fences (even as hmm_fence) will never work.
But it doesn't talk about how to handle implicit or explicit dependencies with something like hmm_fences.
In other words my proposal above allows for hmm_fences to show up in dma_reservation objects and are used together with all this explicit synchronization we still have with only a medium amount of work :)
Oh. I don't think we should put any hmm_fence or other infinite fence into a dma_resv object. At least not into the current dma_resv object, because then we have that infinite fences problem everywhere, and very hard to audit.
What we could do is add new hmm_fence only slots for implicit sync, but I think consensus is that implicit sync is bad, never do it again. Last time around (for timeline syncobj) we've also pushed the waiting on cross-over to userspace, and I think that's the right option, so we need userspace to understand the hmm fence anyway. At that point we might as well bite the bullet and do another round of wayland/dri protocols.
So from that pov I think the kernel should at most deal with an hmm_fence for cross-process communication and maybe some standard wait primitives (for userspace to use, not for the kernel).
The only use case this would forbid is using page faults for legacy implicit/explicit dma_fence synced workloads, and I think that's perfectly ok to not allow. Especially since the motivation here for all this is compute, and compute doesn't pass around dma_fences anyway.
I think the only thing missing is clarifying a bit what you have under item 3, i.e. how do we make sure there's no accidental hidden dependency between hmm_fence and dma_fence. Maybe a subsection about gpu page fault handling?
The real improvement is item 6. The problem with it is that it requires auditing all occasions when we create dma_fences so that we don't accidentally depend on an HMM fence.
We have that rule already, it's the "dma_fence must not depend upon an infinite fence anywhere" rule we documented last summer. So that doesn't feel new. -Daniel
Regards, Christian.
Or are we still talking past each another a bit here? -Daniel
Regards, Christian.
-Daniel
Am 14.01.21 um 15:23 schrieb Daniel Vetter:
On Thu, Jan 14, 2021 at 3:13 PM Christian König ckoenig.leichtzumerken@gmail.com wrote:
Am 14.01.21 um 14:57 schrieb Daniel Vetter:
On Thu, Jan 14, 2021 at 2:37 PM Christian König christian.koenig@amd.com wrote:
Am 14.01.21 um 12:52 schrieb Daniel Vetter:
[SNIP]
> I had a new idea, i wanted to think more about it but have not yet, > anyway here it is. Adding a new callback to dma fence which ask the > question can it dead lock ? Any time a GPU driver has pending page > fault (ie something calling into the mm) it answer yes, otherwise > no. The GPU shrinker would ask the question before waiting on any > dma-fence and back of if it gets yes. Shrinker can still try many > dma buf object for which it does not get a yes on associated fence. > > This does not solve the mmu notifier case, for this you would just > invalidate the gem userptr object (with a flag but not releasing the > page refcount) but you would not wait for the GPU (ie no dma fence > wait in that code path anymore). The userptr API never really made > the contract that it will always be in sync with the mm view of the > world so if different page get remapped to same virtual address > while GPU is still working with the old pages it should not be an > issue (it would not be in our usage of userptr for compositor and > what not). The current working idea in my mind goes into a similar direction.
But instead of a callback I'm adding a complete new class of HMM fences.
Waiting in the MMU notfier, scheduler, TTM etc etc is only allowed for the dma_fences and HMM fences are ignored in container objects.
When you handle an implicit or explicit synchronization request from userspace you need to block for HMM fences to complete before taking any resource locks.
Isnt' that what I call gang scheduling? I.e. you either run in HMM mode, or in legacy fencing mode (whether implicit or explicit doesn't really matter I think). By forcing that split we avoid the problem, but it means occasionally full stalls on mixed workloads.
But that's not what Jerome wants (afaiui at least), I think his idea is to track the reverse dependencies of all the fences floating around, and then skip evicting an object if you have to wait for any fence that is problematic for the current calling context. And I don't think that's very feasible in practice.
So what kind of hmm fences do you have in mind here?
It's a bit more relaxed than your gang schedule.
See the requirements are as follow:
- dma_fences never depend on hmm_fences.
- hmm_fences can never preempt dma_fences.
- dma_fences must be able to preempt hmm_fences or we always reserve
enough hardware resources (CUs) to guarantee forward progress of dma_fences.
Critical sections are MMU notifiers, page faults, GPU schedulers and dma_reservation object locks.
- It is valid to wait for a dma_fences in critical sections.
- It is not valid to wait for hmm_fences in critical sections.
Fence creation either happens during command submission or by adding something like a barrier or signal command to your userspace queue.
- If we have an hmm_fence as implicit or explicit dependency for
creating a dma_fence we must wait for that before taking any locks or reserving resources. 7. If we have a dma_fence as implicit or explicit dependency for creating an hmm_fence we can wait later on. So busy waiting or special WAIT hardware commands are valid.
This prevents hard cuts, e.g. can mix hmm_fences and dma_fences at the same time on the hardware.
In other words we can have a high priority gfx queue running jobs based on dma_fences and a low priority compute queue running jobs based on hmm_fences.
Only when we switch from hmm_fence to dma_fence we need to block the submission until all the necessary resources (both memory as well as CUs) are available.
This is somewhat an extension to your gang submit idea.
Either I'm missing something, or this is just exactly what we documented already with userspace fences in general, and how you can't have a dma_fence depend upon a userspace (or hmm_fence).
My gang scheduling idea is really just an alternative for what you have listed as item 3 above. Instead of requiring preempt or requiring guaranteed forward progress of some other sorts we flush out any pending dma_fence request. But _only_ those which would get stalled by the job we're running, so high-priority sdma requests we need in the kernel to shuffle buffers around are still all ok. This would be needed if you're hw can't preempt, and you also have shared engines between compute and gfx, so reserving CUs won't solve the problem either.
What I don't mean with my gang scheduling is a completely exclusive mode between hmm_fence and dma_fence, since that would prevent us from using copy engines and dma_fence in the kernel to shuffle memory around for hmm jobs. And that would suck, even on compute-only workloads. Maybe I should rename "gang scheduling" to "engine flush" or something like that.
Yeah, "engine flush" makes it much more clearer.
What I wanted to emphasis is that we have to mix dma_fences and hmm_fences running at the same time on the same hardware fighting over the same resources.
E.g. even on the newest hardware multimedia engines can't handle page faults, so video decoding/encoding will still produce dma_fences.
Well we also have to mix them so the kernel can shovel data around using copy engines. Plus we have to mix it at the overall subsystem level because I'm not sure SoC-class gpus will ever get here, definitely aren't yet there for sure.
I think the basics of userspace or hmm_fence or whatever we'll call it we've documented already here:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdri.freede...
This talks about the restrictions we have for dma_fences and why infinite fences (even as hmm_fence) will never work.
But it doesn't talk about how to handle implicit or explicit dependencies with something like hmm_fences.
In other words my proposal above allows for hmm_fences to show up in dma_reservation objects and are used together with all this explicit synchronization we still have with only a medium amount of work :)
Oh. I don't think we should put any hmm_fence or other infinite fence into a dma_resv object. At least not into the current dma_resv object, because then we have that infinite fences problem everywhere, and very hard to audit.
Yes, exactly. That's why this rules how to mix them or rather not mix them.
What we could do is add new hmm_fence only slots for implicit sync,
Yeah, we would have them separated to the dma_fence objects.
but I think consensus is that implicit sync is bad, never do it again. Last time around (for timeline syncobj) we've also pushed the waiting on cross-over to userspace, and I think that's the right option, so we need userspace to understand the hmm fence anyway. At that point we might as well bite the bullet and do another round of wayland/dri protocols.
As you said I don't see this happening in the next 5 years either.
So I think we have to somehow solve this in the kernel or we will go in circles all the time.
So from that pov I think the kernel should at most deal with an hmm_fence for cross-process communication and maybe some standard wait primitives (for userspace to use, not for the kernel).
The only use case this would forbid is using page faults for legacy implicit/explicit dma_fence synced workloads, and I think that's perfectly ok to not allow. Especially since the motivation here for all this is compute, and compute doesn't pass around dma_fences anyway.
As Alex said we will rather soon see this for gfx as well and we most likely will see combinations of old dma_fence based integrated graphics with new dedicated GPUs.
So I don't think we can say we reduce the problem to compute and don't support anything else.
Regards, Christian.
I think the only thing missing is clarifying a bit what you have under item 3, i.e. how do we make sure there's no accidental hidden dependency between hmm_fence and dma_fence. Maybe a subsection about gpu page fault handling?
The real improvement is item 6. The problem with it is that it requires auditing all occasions when we create dma_fences so that we don't accidentally depend on an HMM fence.
We have that rule already, it's the "dma_fence must not depend upon an infinite fence anywhere" rule we documented last summer. So that doesn't feel new. -Daniel
Regards, Christian.
Or are we still talking past each another a bit here? -Daniel
Regards, Christian.
-Daniel
On Thu, Jan 14, 2021 at 4:08 PM Christian König christian.koenig@amd.com wrote:
Am 14.01.21 um 15:23 schrieb Daniel Vetter:
On Thu, Jan 14, 2021 at 3:13 PM Christian König ckoenig.leichtzumerken@gmail.com wrote:
Am 14.01.21 um 14:57 schrieb Daniel Vetter:
On Thu, Jan 14, 2021 at 2:37 PM Christian König christian.koenig@amd.com wrote:
Am 14.01.21 um 12:52 schrieb Daniel Vetter:
[SNIP] >> I had a new idea, i wanted to think more about it but have not yet, >> anyway here it is. Adding a new callback to dma fence which ask the >> question can it dead lock ? Any time a GPU driver has pending page >> fault (ie something calling into the mm) it answer yes, otherwise >> no. The GPU shrinker would ask the question before waiting on any >> dma-fence and back of if it gets yes. Shrinker can still try many >> dma buf object for which it does not get a yes on associated fence. >> >> This does not solve the mmu notifier case, for this you would just >> invalidate the gem userptr object (with a flag but not releasing the >> page refcount) but you would not wait for the GPU (ie no dma fence >> wait in that code path anymore). The userptr API never really made >> the contract that it will always be in sync with the mm view of the >> world so if different page get remapped to same virtual address >> while GPU is still working with the old pages it should not be an >> issue (it would not be in our usage of userptr for compositor and >> what not). > The current working idea in my mind goes into a similar direction. > > But instead of a callback I'm adding a complete new class of HMM fences. > > Waiting in the MMU notfier, scheduler, TTM etc etc is only allowed for > the dma_fences and HMM fences are ignored in container objects. > > When you handle an implicit or explicit synchronization request from > userspace you need to block for HMM fences to complete before taking any > resource locks. Isnt' that what I call gang scheduling? I.e. you either run in HMM mode, or in legacy fencing mode (whether implicit or explicit doesn't really matter I think). By forcing that split we avoid the problem, but it means occasionally full stalls on mixed workloads.
But that's not what Jerome wants (afaiui at least), I think his idea is to track the reverse dependencies of all the fences floating around, and then skip evicting an object if you have to wait for any fence that is problematic for the current calling context. And I don't think that's very feasible in practice.
So what kind of hmm fences do you have in mind here?
It's a bit more relaxed than your gang schedule.
See the requirements are as follow:
- dma_fences never depend on hmm_fences.
- hmm_fences can never preempt dma_fences.
- dma_fences must be able to preempt hmm_fences or we always reserve
enough hardware resources (CUs) to guarantee forward progress of dma_fences.
Critical sections are MMU notifiers, page faults, GPU schedulers and dma_reservation object locks.
- It is valid to wait for a dma_fences in critical sections.
- It is not valid to wait for hmm_fences in critical sections.
Fence creation either happens during command submission or by adding something like a barrier or signal command to your userspace queue.
- If we have an hmm_fence as implicit or explicit dependency for
creating a dma_fence we must wait for that before taking any locks or reserving resources. 7. If we have a dma_fence as implicit or explicit dependency for creating an hmm_fence we can wait later on. So busy waiting or special WAIT hardware commands are valid.
This prevents hard cuts, e.g. can mix hmm_fences and dma_fences at the same time on the hardware.
In other words we can have a high priority gfx queue running jobs based on dma_fences and a low priority compute queue running jobs based on hmm_fences.
Only when we switch from hmm_fence to dma_fence we need to block the submission until all the necessary resources (both memory as well as CUs) are available.
This is somewhat an extension to your gang submit idea.
Either I'm missing something, or this is just exactly what we documented already with userspace fences in general, and how you can't have a dma_fence depend upon a userspace (or hmm_fence).
My gang scheduling idea is really just an alternative for what you have listed as item 3 above. Instead of requiring preempt or requiring guaranteed forward progress of some other sorts we flush out any pending dma_fence request. But _only_ those which would get stalled by the job we're running, so high-priority sdma requests we need in the kernel to shuffle buffers around are still all ok. This would be needed if you're hw can't preempt, and you also have shared engines between compute and gfx, so reserving CUs won't solve the problem either.
What I don't mean with my gang scheduling is a completely exclusive mode between hmm_fence and dma_fence, since that would prevent us from using copy engines and dma_fence in the kernel to shuffle memory around for hmm jobs. And that would suck, even on compute-only workloads. Maybe I should rename "gang scheduling" to "engine flush" or something like that.
Yeah, "engine flush" makes it much more clearer.
What I wanted to emphasis is that we have to mix dma_fences and hmm_fences running at the same time on the same hardware fighting over the same resources.
E.g. even on the newest hardware multimedia engines can't handle page faults, so video decoding/encoding will still produce dma_fences.
Well we also have to mix them so the kernel can shovel data around using copy engines. Plus we have to mix it at the overall subsystem level because I'm not sure SoC-class gpus will ever get here, definitely aren't yet there for sure.
I think the basics of userspace or hmm_fence or whatever we'll call it we've documented already here:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdri.freede...
This talks about the restrictions we have for dma_fences and why infinite fences (even as hmm_fence) will never work.
But it doesn't talk about how to handle implicit or explicit dependencies with something like hmm_fences.
In other words my proposal above allows for hmm_fences to show up in dma_reservation objects and are used together with all this explicit synchronization we still have with only a medium amount of work :)
Oh. I don't think we should put any hmm_fence or other infinite fence into a dma_resv object. At least not into the current dma_resv object, because then we have that infinite fences problem everywhere, and very hard to audit.
Yes, exactly. That's why this rules how to mix them or rather not mix them.
What we could do is add new hmm_fence only slots for implicit sync,
Yeah, we would have them separated to the dma_fence objects.
but I think consensus is that implicit sync is bad, never do it again. Last time around (for timeline syncobj) we've also pushed the waiting on cross-over to userspace, and I think that's the right option, so we need userspace to understand the hmm fence anyway. At that point we might as well bite the bullet and do another round of wayland/dri protocols.
As you said I don't see this happening in the next 5 years either.
Well I guess we'll need to get started with that then, when you guys need it.
So I think we have to somehow solve this in the kernel or we will go in circles all the time.
So from that pov I think the kernel should at most deal with an hmm_fence for cross-process communication and maybe some standard wait primitives (for userspace to use, not for the kernel).
The only use case this would forbid is using page faults for legacy implicit/explicit dma_fence synced workloads, and I think that's perfectly ok to not allow. Especially since the motivation here for all this is compute, and compute doesn't pass around dma_fences anyway.
As Alex said we will rather soon see this for gfx as well and we most likely will see combinations of old dma_fence based integrated graphics with new dedicated GPUs.
So I don't think we can say we reduce the problem to compute and don't support anything else.
I'm not against pagefaults for gfx, just in pushing the magic into the kernel. I don't think that works, because it means we add stall points where usespace, especially vk userspace, really doesn't want it. So same way like timeline syncobj, we need to push the compat work into userspace.
There's going to be a few stall points: - fully new stack, we wait for the userspace fence in the atomic commit path (which we can, if we're really careful, since we pin all buffers upfront and so there's no risk) - userspace fencing gpu in the client, compositor protocol can pass around userspace fences, but the compositor still uses dma_fence for itself. There's some stalling in the compositor, which it does already anyway when it's collecting new frames from clients - userspace fencing gpu in the client, but no compositor protocol: We wait in the swapchain, but in a separate thread so that nothing blocks that shouldn't block
If we instead go with "magic waits in the kernel behind userspace's back", like what your item 6 would imply, then we're not really solving anything.
For actual implementation I think the best would be an extension of drm_syncobj. Those already have at least conceptually future/infinite fences, and we already have fd passing, so "just" need some protocol to pass them around. Plus we could use the same uapi for timeline syncobj using dma_fence as for hmm_fence, so also easier to transition for userspace to the new world since don't need the new hw capability to roll out the new uapi and protocols.
That's not that hard to roll out, and technically a lot better than hacking up dma_resv and hoping we don't end up stalling in wrong places, which sounds very "eeeek" to me :-)
Cheers, Daniel
Regards, Christian.
I think the only thing missing is clarifying a bit what you have under item 3, i.e. how do we make sure there's no accidental hidden dependency between hmm_fence and dma_fence. Maybe a subsection about gpu page fault handling?
The real improvement is item 6. The problem with it is that it requires auditing all occasions when we create dma_fences so that we don't accidentally depend on an HMM fence.
We have that rule already, it's the "dma_fence must not depend upon an infinite fence anywhere" rule we documented last summer. So that doesn't feel new. -Daniel
Regards, Christian.
Or are we still talking past each another a bit here? -Daniel
Regards, Christian.
-Daniel
Am 14.01.21 um 16:40 schrieb Daniel Vetter:
[SNIP]
So I think we have to somehow solve this in the kernel or we will go in circles all the time.
So from that pov I think the kernel should at most deal with an hmm_fence for cross-process communication and maybe some standard wait primitives (for userspace to use, not for the kernel).
The only use case this would forbid is using page faults for legacy implicit/explicit dma_fence synced workloads, and I think that's perfectly ok to not allow. Especially since the motivation here for all this is compute, and compute doesn't pass around dma_fences anyway.
As Alex said we will rather soon see this for gfx as well and we most likely will see combinations of old dma_fence based integrated graphics with new dedicated GPUs.
So I don't think we can say we reduce the problem to compute and don't support anything else.
I'm not against pagefaults for gfx, just in pushing the magic into the kernel. I don't think that works, because it means we add stall points where usespace, especially vk userspace, really doesn't want it. So same way like timeline syncobj, we need to push the compat work into userspace.
There's going to be a few stall points:
- fully new stack, we wait for the userspace fence in the atomic
commit path (which we can, if we're really careful, since we pin all buffers upfront and so there's no risk)
- userspace fencing gpu in the client, compositor protocol can pass
around userspace fences, but the compositor still uses dma_fence for itself. There's some stalling in the compositor, which it does already anyway when it's collecting new frames from clients
- userspace fencing gpu in the client, but no compositor protocol: We
wait in the swapchain, but in a separate thread so that nothing blocks that shouldn't block
If we instead go with "magic waits in the kernel behind userspace's back", like what your item 6 would imply, then we're not really solving anything.
For actual implementation I think the best would be an extension of drm_syncobj. Those already have at least conceptually future/infinite fences, and we already have fd passing, so "just" need some protocol to pass them around. Plus we could use the same uapi for timeline syncobj using dma_fence as for hmm_fence, so also easier to transition for userspace to the new world since don't need the new hw capability to roll out the new uapi and protocols.
That's not that hard to roll out, and technically a lot better than hacking up dma_resv and hoping we don't end up stalling in wrong places, which sounds very "eeeek" to me :-)
Yeah, that's what I totally agree upon :)
My idea was just the last resort since we are mixing userspace sync and memory management so creative here.
Stalling in userspace will probably get some push back as well, but maybe not as much as stalling in the kernel.
Ok if we can at least remove implicit sync from the picture then the question remains how do we integrate HMM into drm_syncobj then?
Regards, Christian.
Cheers, Daniel
On Thu, Jan 14, 2021 at 5:01 PM Christian König christian.koenig@amd.com wrote:
Am 14.01.21 um 16:40 schrieb Daniel Vetter:
[SNIP]
So I think we have to somehow solve this in the kernel or we will go in circles all the time.
So from that pov I think the kernel should at most deal with an hmm_fence for cross-process communication and maybe some standard wait primitives (for userspace to use, not for the kernel).
The only use case this would forbid is using page faults for legacy implicit/explicit dma_fence synced workloads, and I think that's perfectly ok to not allow. Especially since the motivation here for all this is compute, and compute doesn't pass around dma_fences anyway.
As Alex said we will rather soon see this for gfx as well and we most likely will see combinations of old dma_fence based integrated graphics with new dedicated GPUs.
So I don't think we can say we reduce the problem to compute and don't support anything else.
I'm not against pagefaults for gfx, just in pushing the magic into the kernel. I don't think that works, because it means we add stall points where usespace, especially vk userspace, really doesn't want it. So same way like timeline syncobj, we need to push the compat work into userspace.
There's going to be a few stall points:
- fully new stack, we wait for the userspace fence in the atomic
commit path (which we can, if we're really careful, since we pin all buffers upfront and so there's no risk)
- userspace fencing gpu in the client, compositor protocol can pass
around userspace fences, but the compositor still uses dma_fence for itself. There's some stalling in the compositor, which it does already anyway when it's collecting new frames from clients
- userspace fencing gpu in the client, but no compositor protocol: We
wait in the swapchain, but in a separate thread so that nothing blocks that shouldn't block
If we instead go with "magic waits in the kernel behind userspace's back", like what your item 6 would imply, then we're not really solving anything.
For actual implementation I think the best would be an extension of drm_syncobj. Those already have at least conceptually future/infinite fences, and we already have fd passing, so "just" need some protocol to pass them around. Plus we could use the same uapi for timeline syncobj using dma_fence as for hmm_fence, so also easier to transition for userspace to the new world since don't need the new hw capability to roll out the new uapi and protocols.
That's not that hard to roll out, and technically a lot better than hacking up dma_resv and hoping we don't end up stalling in wrong places, which sounds very "eeeek" to me :-)
Yeah, that's what I totally agree upon :)
My idea was just the last resort since we are mixing userspace sync and memory management so creative here.
Stalling in userspace will probably get some push back as well, but maybe not as much as stalling in the kernel.
I guess we need to have last-resort stalling in the kernel, but no more than what we do with drm_syncobj future fences right now. Like when anything asks for a dma_fence out of an hmm_fence drm_syncob, we just stall until the hmm_fence is signalled, and then create a dma_fence that's already signalled and return that to the caller. Obviously this shouldn't happen, since anyone who's timeline aware will check whether the fence has at least materialized first and stall somewhere more useful for that first.
Ok if we can at least remove implicit sync from the picture then the question remains how do we integrate HMM into drm_syncobj then?
From an uapi pov probably just an ioctl to create an hmm drm_syncobj,
and a syncobj ioctl to query whether it's a hmm_fence or dma_fence syncobj, so that userspace can be a bit more clever with where it should stall - for an hmm_fence the stall will most likely be directly on the gpu in many cases (so the ioctl should also give us all the details about that if it's an hmm fence).
I think the real work is going through all the hardware and trying to figure out what the common ground for userspace fences are. Stuff like can they be in system memory, or need something special (wc maybe, but I hope system memory should be fine for everyone), and how you count, wrap and compare. I also have no idea how/if we can optimized cpu waits across different drivers.
Plus ideally we get some actual wayland protocol going for passing drm_syncobj around, so we can test it. -Daniel
Am 14.01.21 um 17:36 schrieb Daniel Vetter:
On Thu, Jan 14, 2021 at 5:01 PM Christian König christian.koenig@amd.com wrote:
Am 14.01.21 um 16:40 schrieb Daniel Vetter:
[SNIP]
So I think we have to somehow solve this in the kernel or we will go in circles all the time.
So from that pov I think the kernel should at most deal with an hmm_fence for cross-process communication and maybe some standard wait primitives (for userspace to use, not for the kernel).
The only use case this would forbid is using page faults for legacy implicit/explicit dma_fence synced workloads, and I think that's perfectly ok to not allow. Especially since the motivation here for all this is compute, and compute doesn't pass around dma_fences anyway.
As Alex said we will rather soon see this for gfx as well and we most likely will see combinations of old dma_fence based integrated graphics with new dedicated GPUs.
So I don't think we can say we reduce the problem to compute and don't support anything else.
I'm not against pagefaults for gfx, just in pushing the magic into the kernel. I don't think that works, because it means we add stall points where usespace, especially vk userspace, really doesn't want it. So same way like timeline syncobj, we need to push the compat work into userspace.
There's going to be a few stall points:
- fully new stack, we wait for the userspace fence in the atomic
commit path (which we can, if we're really careful, since we pin all buffers upfront and so there's no risk)
- userspace fencing gpu in the client, compositor protocol can pass
around userspace fences, but the compositor still uses dma_fence for itself. There's some stalling in the compositor, which it does already anyway when it's collecting new frames from clients
- userspace fencing gpu in the client, but no compositor protocol: We
wait in the swapchain, but in a separate thread so that nothing blocks that shouldn't block
If we instead go with "magic waits in the kernel behind userspace's back", like what your item 6 would imply, then we're not really solving anything.
For actual implementation I think the best would be an extension of drm_syncobj. Those already have at least conceptually future/infinite fences, and we already have fd passing, so "just" need some protocol to pass them around. Plus we could use the same uapi for timeline syncobj using dma_fence as for hmm_fence, so also easier to transition for userspace to the new world since don't need the new hw capability to roll out the new uapi and protocols.
That's not that hard to roll out, and technically a lot better than hacking up dma_resv and hoping we don't end up stalling in wrong places, which sounds very "eeeek" to me :-)
Yeah, that's what I totally agree upon :)
My idea was just the last resort since we are mixing userspace sync and memory management so creative here.
Stalling in userspace will probably get some push back as well, but maybe not as much as stalling in the kernel.
I guess we need to have last-resort stalling in the kernel, but no more than what we do with drm_syncobj future fences right now. Like when anything asks for a dma_fence out of an hmm_fence drm_syncob, we just stall until the hmm_fence is signalled, and then create a dma_fence that's already signalled and return that to the caller.
Good idea. BTW: We should somehow teach lockdep that this materialization of any future fence should not happen while holding a reservation lock?
Obviously this shouldn't happen, since anyone who's timeline aware will check whether the fence has at least materialized first and stall somewhere more useful for that first.
Well if I'm not completely mistaken it should help with existing stuff like an implicit fence for atomic modeset etc...
Ok if we can at least remove implicit sync from the picture then the question remains how do we integrate HMM into drm_syncobj then?
From an uapi pov probably just an ioctl to create an hmm drm_syncobj, and a syncobj ioctl to query whether it's a hmm_fence or dma_fence syncobj, so that userspace can be a bit more clever with where it should stall - for an hmm_fence the stall will most likely be directly on the gpu in many cases (so the ioctl should also give us all the details about that if it's an hmm fence).
I think the real work is going through all the hardware and trying to figure out what the common ground for userspace fences are. Stuff like can they be in system memory, or need something special (wc maybe, but I hope system memory should be fine for everyone), and how you count, wrap and compare. I also have no idea how/if we can optimized cpu waits across different drivers.
I think that this is absolutely hardware dependent. E.g. for example AMD will probably have handles, so that the hardware scheduler can counter problems like priority inversion.
What we should probably do is to handle this similar to how DMA-buf is handled - if it's the same driver and device the drm_syncobj we can use the same handle for both sides.
If it's different driver or device we go through some CPU round trip for the signaling.
Plus ideally we get some actual wayland protocol going for passing drm_syncobj around, so we can test it.
And DRI3 :)
Christian.
-Daniel
On Thu, Jan 14, 2021 at 08:08:06PM +0100, Christian König wrote:
Am 14.01.21 um 17:36 schrieb Daniel Vetter:
On Thu, Jan 14, 2021 at 5:01 PM Christian König christian.koenig@amd.com wrote:
Am 14.01.21 um 16:40 schrieb Daniel Vetter:
[SNIP]
So I think we have to somehow solve this in the kernel or we will go in circles all the time.
So from that pov I think the kernel should at most deal with an hmm_fence for cross-process communication and maybe some standard wait primitives (for userspace to use, not for the kernel).
The only use case this would forbid is using page faults for legacy implicit/explicit dma_fence synced workloads, and I think that's perfectly ok to not allow. Especially since the motivation here for all this is compute, and compute doesn't pass around dma_fences anyway.
As Alex said we will rather soon see this for gfx as well and we most likely will see combinations of old dma_fence based integrated graphics with new dedicated GPUs.
So I don't think we can say we reduce the problem to compute and don't support anything else.
I'm not against pagefaults for gfx, just in pushing the magic into the kernel. I don't think that works, because it means we add stall points where usespace, especially vk userspace, really doesn't want it. So same way like timeline syncobj, we need to push the compat work into userspace.
There's going to be a few stall points:
- fully new stack, we wait for the userspace fence in the atomic
commit path (which we can, if we're really careful, since we pin all buffers upfront and so there's no risk)
- userspace fencing gpu in the client, compositor protocol can pass
around userspace fences, but the compositor still uses dma_fence for itself. There's some stalling in the compositor, which it does already anyway when it's collecting new frames from clients
- userspace fencing gpu in the client, but no compositor protocol: We
wait in the swapchain, but in a separate thread so that nothing blocks that shouldn't block
If we instead go with "magic waits in the kernel behind userspace's back", like what your item 6 would imply, then we're not really solving anything.
For actual implementation I think the best would be an extension of drm_syncobj. Those already have at least conceptually future/infinite fences, and we already have fd passing, so "just" need some protocol to pass them around. Plus we could use the same uapi for timeline syncobj using dma_fence as for hmm_fence, so also easier to transition for userspace to the new world since don't need the new hw capability to roll out the new uapi and protocols.
That's not that hard to roll out, and technically a lot better than hacking up dma_resv and hoping we don't end up stalling in wrong places, which sounds very "eeeek" to me :-)
Yeah, that's what I totally agree upon :)
My idea was just the last resort since we are mixing userspace sync and memory management so creative here.
Stalling in userspace will probably get some push back as well, but maybe not as much as stalling in the kernel.
I guess we need to have last-resort stalling in the kernel, but no more than what we do with drm_syncobj future fences right now. Like when anything asks for a dma_fence out of an hmm_fence drm_syncob, we just stall until the hmm_fence is signalled, and then create a dma_fence that's already signalled and return that to the caller.
Good idea. BTW: We should somehow teach lockdep that this materialization of any future fence should not happen while holding a reservation lock?
Good idea, should be easy to add (although the explanation why it works needs a comment).
Obviously this shouldn't happen, since anyone who's timeline aware will check whether the fence has at least materialized first and stall somewhere more useful for that first.
Well if I'm not completely mistaken it should help with existing stuff like an implicit fence for atomic modeset etc...
Modeset is special: - we fully pin buffers before we even start waiting. That means the loop can't close, since no one can try to evict our pinned buffer and would hence end up waiting on our hmm fence. We also only unpin the after everything is done.
- there's out-fences, but as long as we require that the in and out fences are of the same type that should be all fine. Also since the explicit in/out fence stuff is there already it shouldn't be too hard to add support for syncobj fences without touching a lot of drivers - all the ones that use the atomic commit helpers should Just Work.
Ok if we can at least remove implicit sync from the picture then the question remains how do we integrate HMM into drm_syncobj then?
From an uapi pov probably just an ioctl to create an hmm drm_syncobj, and a syncobj ioctl to query whether it's a hmm_fence or dma_fence syncobj, so that userspace can be a bit more clever with where it should stall - for an hmm_fence the stall will most likely be directly on the gpu in many cases (so the ioctl should also give us all the details about that if it's an hmm fence).
I think the real work is going through all the hardware and trying to figure out what the common ground for userspace fences are. Stuff like can they be in system memory, or need something special (wc maybe, but I hope system memory should be fine for everyone), and how you count, wrap and compare. I also have no idea how/if we can optimized cpu waits across different drivers.
I think that this is absolutely hardware dependent. E.g. for example AMD will probably have handles, so that the hardware scheduler can counter problems like priority inversion.
What we should probably do is to handle this similar to how DMA-buf is handled - if it's the same driver and device the drm_syncobj we can use the same handle for both sides.
If it's different driver or device we go through some CPU round trip for the signaling.
I think we should try to be slightly more standardized, dma-buf was a bit much free-for all. But maybe that's not possible really, since we tried this with dma-fence and ended up with exactly the situation you're describing for hmm fences.
Plus ideally we get some actual wayland protocol going for passing drm_syncobj around, so we can test it.
And DRI3 :)
Yeah. Well probably Present extension, since that's the thing that's doing the flipping. At least we only have to really care about XWayland for that, with this time horizon at least. -Daniel
On Thu, Jan 14, 2021 at 02:37:36PM +0100, Christian König wrote:
Am 14.01.21 um 12:52 schrieb Daniel Vetter:
[SNIP]
I had a new idea, i wanted to think more about it but have not yet, anyway here it is. Adding a new callback to dma fence which ask the question can it dead lock ? Any time a GPU driver has pending page fault (ie something calling into the mm) it answer yes, otherwise no. The GPU shrinker would ask the question before waiting on any dma-fence and back of if it gets yes. Shrinker can still try many dma buf object for which it does not get a yes on associated fence.
This does not solve the mmu notifier case, for this you would just invalidate the gem userptr object (with a flag but not releasing the page refcount) but you would not wait for the GPU (ie no dma fence wait in that code path anymore). The userptr API never really made the contract that it will always be in sync with the mm view of the world so if different page get remapped to same virtual address while GPU is still working with the old pages it should not be an issue (it would not be in our usage of userptr for compositor and what not).
The current working idea in my mind goes into a similar direction.
But instead of a callback I'm adding a complete new class of HMM fences.
Waiting in the MMU notfier, scheduler, TTM etc etc is only allowed for the dma_fences and HMM fences are ignored in container objects.
When you handle an implicit or explicit synchronization request from userspace you need to block for HMM fences to complete before taking any resource locks.
Isnt' that what I call gang scheduling? I.e. you either run in HMM mode, or in legacy fencing mode (whether implicit or explicit doesn't really matter I think). By forcing that split we avoid the problem, but it means occasionally full stalls on mixed workloads.
But that's not what Jerome wants (afaiui at least), I think his idea is to track the reverse dependencies of all the fences floating around, and then skip evicting an object if you have to wait for any fence that is problematic for the current calling context. And I don't think that's very feasible in practice.
So what kind of hmm fences do you have in mind here?
It's a bit more relaxed than your gang schedule.
See the requirements are as follow:
- dma_fences never depend on hmm_fences.
- hmm_fences can never preempt dma_fences.
- dma_fences must be able to preempt hmm_fences or we always reserve enough
hardware resources (CUs) to guarantee forward progress of dma_fences.
Critical sections are MMU notifiers, page faults, GPU schedulers and dma_reservation object locks.
- It is valid to wait for a dma_fences in critical sections.
- It is not valid to wait for hmm_fences in critical sections.
Fence creation either happens during command submission or by adding something like a barrier or signal command to your userspace queue.
- If we have an hmm_fence as implicit or explicit dependency for creating a
dma_fence we must wait for that before taking any locks or reserving resources. 7. If we have a dma_fence as implicit or explicit dependency for creating an hmm_fence we can wait later on. So busy waiting or special WAIT hardware commands are valid.
This prevents hard cuts, e.g. can mix hmm_fences and dma_fences at the same time on the hardware.
In other words we can have a high priority gfx queue running jobs based on dma_fences and a low priority compute queue running jobs based on hmm_fences.
Only when we switch from hmm_fence to dma_fence we need to block the submission until all the necessary resources (both memory as well as CUs) are available.
This is somewhat an extension to your gang submit idea.
What is hmm_fence ? You should not have fence with hmm at all. So i am kind of scare now.
Cheers, Jérôme
Am 2021-01-14 um 11:51 a.m. schrieb Jerome Glisse:
On Thu, Jan 14, 2021 at 02:37:36PM +0100, Christian König wrote:
Am 14.01.21 um 12:52 schrieb Daniel Vetter:
[SNIP]
I had a new idea, i wanted to think more about it but have not yet, anyway here it is. Adding a new callback to dma fence which ask the question can it dead lock ? Any time a GPU driver has pending page fault (ie something calling into the mm) it answer yes, otherwise no. The GPU shrinker would ask the question before waiting on any dma-fence and back of if it gets yes. Shrinker can still try many dma buf object for which it does not get a yes on associated fence.
This does not solve the mmu notifier case, for this you would just invalidate the gem userptr object (with a flag but not releasing the page refcount) but you would not wait for the GPU (ie no dma fence wait in that code path anymore). The userptr API never really made the contract that it will always be in sync with the mm view of the world so if different page get remapped to same virtual address while GPU is still working with the old pages it should not be an issue (it would not be in our usage of userptr for compositor and what not).
The current working idea in my mind goes into a similar direction.
But instead of a callback I'm adding a complete new class of HMM fences.
Waiting in the MMU notfier, scheduler, TTM etc etc is only allowed for the dma_fences and HMM fences are ignored in container objects.
When you handle an implicit or explicit synchronization request from userspace you need to block for HMM fences to complete before taking any resource locks.
Isnt' that what I call gang scheduling? I.e. you either run in HMM mode, or in legacy fencing mode (whether implicit or explicit doesn't really matter I think). By forcing that split we avoid the problem, but it means occasionally full stalls on mixed workloads.
But that's not what Jerome wants (afaiui at least), I think his idea is to track the reverse dependencies of all the fences floating around, and then skip evicting an object if you have to wait for any fence that is problematic for the current calling context. And I don't think that's very feasible in practice.
So what kind of hmm fences do you have in mind here?
It's a bit more relaxed than your gang schedule.
See the requirements are as follow:
- dma_fences never depend on hmm_fences.
- hmm_fences can never preempt dma_fences.
- dma_fences must be able to preempt hmm_fences or we always reserve enough
hardware resources (CUs) to guarantee forward progress of dma_fences.
Critical sections are MMU notifiers, page faults, GPU schedulers and dma_reservation object locks.
- It is valid to wait for a dma_fences in critical sections.
- It is not valid to wait for hmm_fences in critical sections.
Fence creation either happens during command submission or by adding something like a barrier or signal command to your userspace queue.
- If we have an hmm_fence as implicit or explicit dependency for creating a
dma_fence we must wait for that before taking any locks or reserving resources. 7. If we have a dma_fence as implicit or explicit dependency for creating an hmm_fence we can wait later on. So busy waiting or special WAIT hardware commands are valid.
This prevents hard cuts, e.g. can mix hmm_fences and dma_fences at the same time on the hardware.
In other words we can have a high priority gfx queue running jobs based on dma_fences and a low priority compute queue running jobs based on hmm_fences.
Only when we switch from hmm_fence to dma_fence we need to block the submission until all the necessary resources (both memory as well as CUs) are available.
This is somewhat an extension to your gang submit idea.
What is hmm_fence ? You should not have fence with hmm at all. So i am kind of scare now.
I kind of had the same question trying to follow Christian and Daniel's discussion. I think an HMM fence would be any fence resulting from the completion of a user mode operation in a context with HMM-based memory management that may stall indefinitely due to page faults.
But on a hardware engine that cannot preempt page-faulting work and has not reserved resources to guarantee forward progress for kernel jobs, I think all fences will need to be HMM fences, because any work submitted to such an engine can stall by getting stuck behind a stalled user mode operation.
So for example, you have a DMA engine that can preempt during page faults, but a graphics engine that cannot. Then work submitted to the DMA engine can use dma_fence. But work submitted to the graphics engine must use hmm_fence. To avoid deadlocks, dma_fences must never depend on hmm_fences and resolution of page faults must never depend on hmm_fences.
Regards, Felix
Cheers, Jérôme
Am 14.01.21 um 22:13 schrieb Felix Kuehling:
Am 2021-01-14 um 11:51 a.m. schrieb Jerome Glisse:
On Thu, Jan 14, 2021 at 02:37:36PM +0100, Christian König wrote:
Am 14.01.21 um 12:52 schrieb Daniel Vetter:
[SNIP]
I had a new idea, i wanted to think more about it but have not yet, anyway here it is. Adding a new callback to dma fence which ask the question can it dead lock ? Any time a GPU driver has pending page fault (ie something calling into the mm) it answer yes, otherwise no. The GPU shrinker would ask the question before waiting on any dma-fence and back of if it gets yes. Shrinker can still try many dma buf object for which it does not get a yes on associated fence.
This does not solve the mmu notifier case, for this you would just invalidate the gem userptr object (with a flag but not releasing the page refcount) but you would not wait for the GPU (ie no dma fence wait in that code path anymore). The userptr API never really made the contract that it will always be in sync with the mm view of the world so if different page get remapped to same virtual address while GPU is still working with the old pages it should not be an issue (it would not be in our usage of userptr for compositor and what not).
The current working idea in my mind goes into a similar direction.
But instead of a callback I'm adding a complete new class of HMM fences.
Waiting in the MMU notfier, scheduler, TTM etc etc is only allowed for the dma_fences and HMM fences are ignored in container objects.
When you handle an implicit or explicit synchronization request from userspace you need to block for HMM fences to complete before taking any resource locks.
Isnt' that what I call gang scheduling? I.e. you either run in HMM mode, or in legacy fencing mode (whether implicit or explicit doesn't really matter I think). By forcing that split we avoid the problem, but it means occasionally full stalls on mixed workloads.
But that's not what Jerome wants (afaiui at least), I think his idea is to track the reverse dependencies of all the fences floating around, and then skip evicting an object if you have to wait for any fence that is problematic for the current calling context. And I don't think that's very feasible in practice.
So what kind of hmm fences do you have in mind here?
It's a bit more relaxed than your gang schedule.
See the requirements are as follow:
- dma_fences never depend on hmm_fences.
- hmm_fences can never preempt dma_fences.
- dma_fences must be able to preempt hmm_fences or we always reserve enough
hardware resources (CUs) to guarantee forward progress of dma_fences.
Critical sections are MMU notifiers, page faults, GPU schedulers and dma_reservation object locks.
- It is valid to wait for a dma_fences in critical sections.
- It is not valid to wait for hmm_fences in critical sections.
Fence creation either happens during command submission or by adding something like a barrier or signal command to your userspace queue.
- If we have an hmm_fence as implicit or explicit dependency for creating a
dma_fence we must wait for that before taking any locks or reserving resources. 7. If we have a dma_fence as implicit or explicit dependency for creating an hmm_fence we can wait later on. So busy waiting or special WAIT hardware commands are valid.
This prevents hard cuts, e.g. can mix hmm_fences and dma_fences at the same time on the hardware.
In other words we can have a high priority gfx queue running jobs based on dma_fences and a low priority compute queue running jobs based on hmm_fences.
Only when we switch from hmm_fence to dma_fence we need to block the submission until all the necessary resources (both memory as well as CUs) are available.
This is somewhat an extension to your gang submit idea.
What is hmm_fence ? You should not have fence with hmm at all. So i am kind of scare now.
I kind of had the same question trying to follow Christian and Daniel's discussion. I think an HMM fence would be any fence resulting from the completion of a user mode operation in a context with HMM-based memory management that may stall indefinitely due to page faults.
It was more of a placeholder for something which can be used for inter process synchronization.
But on a hardware engine that cannot preempt page-faulting work and has not reserved resources to guarantee forward progress for kernel jobs, I think all fences will need to be HMM fences, because any work submitted to such an engine can stall by getting stuck behind a stalled user mode operation.
So for example, you have a DMA engine that can preempt during page faults, but a graphics engine that cannot. Then work submitted to the DMA engine can use dma_fence. But work submitted to the graphics engine must use hmm_fence. To avoid deadlocks, dma_fences must never depend on hmm_fences and resolution of page faults must never depend on hmm_fences.
Yeah, it's a bit more complicated but in general that fits.
Regards, Christian.
Regards, Felix
Cheers, Jérôme
On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote:
This is the first version of our HMM based shared virtual memory manager for KFD. There are still a number of known issues that we're working through (see below). This will likely lead to some pretty significant changes in MMU notifier handling and locking on the migration code paths. So don't get hung up on those details yet.
[...]
Known issues:
- won't work with IOMMU enabled, we need to dma_map all pages properly
- still working on some race conditions and random bugs
- performance is not great yet
What would those changes looks like ? Seeing the issue below i do not see how they inter-play with mmu notifier. Can you elaborate.
Cheers, Jérôme
Am 2021-01-13 um 11:47 a.m. schrieb Jerome Glisse:
On Wed, Jan 06, 2021 at 10:00:52PM -0500, Felix Kuehling wrote:
This is the first version of our HMM based shared virtual memory manager for KFD. There are still a number of known issues that we're working through (see below). This will likely lead to some pretty significant changes in MMU notifier handling and locking on the migration code paths. So don't get hung up on those details yet.
[...]
Known issues:
- won't work with IOMMU enabled, we need to dma_map all pages properly
- still working on some race conditions and random bugs
- performance is not great yet
What would those changes looks like ? Seeing the issue below i do not see how they inter-play with mmu notifier. Can you elaborate.
We currently have some race conditions when multiple threads are causing migrations concurrently (e.g. CPU page faults, GPU page faults, memory evictions, and explicit prefetch by the application).
In the current patch series we set up one MMU range notifier for the entire address space because we had trouble setting up MMU notifiers for specific address ranges. There are situations where we want to free or free/resize/reallocate MMU range notifiers, but we can't due to the locking context we're in:
* MMU release notifier when a virtual address range is unmapped * CPU page fault handler
In both these situations we may need to split virtual address ranges because we only want to free or migrate a part of it. If we have per-address range notifiers we also need to free or create notifiers, which is not possible in those contexts. On the other hand, using a single range notifier for everything causes unnecessary serialization.
We're reworking all of this to have per-address range notifiers that are updated with a deferred mechanism in workers. I finally figured out how to do that in a clean way, hopefully without races or deadlocks, which should also address the other race conditions we had with concurrent migration triggers. Philip is working on the implementation.
Regards, Felix
Cheers, Jérôme
dri-devel@lists.freedesktop.org