[PATCH RFC 0/3] MAP_POPULATE for device memory

List overview All Threads
Download

newer

older

RE: a null derefrence at [v7, 9/9]...

[PATCH RFC v2] mm: Add...

Jarkko Sakkinen

6 Mar 2022 6 Mar '22

5:32 a.m.

For device memory (aka VM_IO | VM_PFNMAP) MAP_POPULATE does nothing. Allow to use that for initializing the device memory by providing a new callback f_ops->populate() for the purpose.

SGX patches are provided to show the callback in context.

An obvious alternative is a ioctl but it is less elegant and requires two syscalls (mmap + ioctl) per memory range, instead of just one (mmap).

Jarkko Sakkinen (3): mm: Add f_ops->populate() x86/sgx: Export sgx_encl_page_alloc() x86/sgx: Implement EAUG population with MAP_POPULATE

-- 2.35.1

Show replies by date

Jarkko Sakkinen

6 Mar 6 Mar

5:32 a.m.

New subject: [PATCH RFC 1/3] mm: Add f_ops->populate()

Sometimes you might want to use MAP_POPULATE to ask a device driver to initialize the device memory in some specific manner. SGX driver can use this to request more memory by issuing ENCLS[EAUG] x86 opcode for each page in the address range.

Add f_ops->populate() with the same parameters as f_ops->mmap() and make it conditionally called inside call_mmap(). Update call sites accodingly. --- Signed-off-by: Jarkko Sakkinen jarkko@kernel.org v3: - if (!ret && do_populate && file->f_op->populate) + if (!ret && do_populate && file->f_op->populate && + !!(vma->vm_flags & (VM_IO | VM_PFNMAP))) (reported by Matthew Wilcox) v2: - if (!ret && do_populate) + if (!ret && do_populate && file->f_op->populate) (reported by Jan Harkes) --- arch/mips/kernel/vdso.c | 2 +- drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c | 2 +- fs/coda/file.c | 2 +- fs/overlayfs/file.c | 2 +- include/linux/fs.h | 12 ++++++++++-- include/linux/mm.h | 2 +- ipc/shm.c | 2 +- mm/mmap.c | 10 +++++----- mm/nommu.c | 4 ++-- 9 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c index 3d0cf471f2fe..89f3f3da9abd 100644 --- a/arch/mips/kernel/vdso.c +++ b/arch/mips/kernel/vdso.c @@ -102,7 +102,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) base = mmap_region(NULL, STACK_TOP, PAGE_SIZE, VM_READ | VM_EXEC | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC, - 0, NULL); + 0, NULL, false); if (IS_ERR_VALUE(base)) { ret = base; goto out; diff --git a/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c b/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c index 1b526039a60d..4c71f64d6a79 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c @@ -107,7 +107,7 @@ static int i915_gem_dmabuf_mmap(struct dma_buf *dma_buf, struct vm_area_struct * if (!obj->base.filp) return -ENODEV;

- ret = call_mmap(obj->base.filp, vma); + ret = call_mmap(obj->base.filp, vma, false); if (ret) return ret;

diff --git a/fs/coda/file.c b/fs/coda/file.c index 29dd87be2fb8..e14f312fdbf8 100644 --- a/fs/coda/file.c +++ b/fs/coda/file.c @@ -173,7 +173,7 @@ coda_file_mmap(struct file *coda_file, struct vm_area_struct *vma) spin_unlock(&cii->c_lock);

vma->vm_file = get_file(host_file); - ret = call_mmap(vma->vm_file, vma); + ret = call_mmap(vma->vm_file, vma, false);

if (ret) { /* if call_mmap fails, our caller will put host_file so we diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c index fa125feed0ff..b963a9397e80 100644 --- a/fs/overlayfs/file.c +++ b/fs/overlayfs/file.c @@ -503,7 +503,7 @@ static int ovl_mmap(struct file *file, struct vm_area_struct *vma) vma_set_file(vma, realfile);

old_cred = ovl_override_creds(file_inode(file)->i_sb); - ret = call_mmap(vma->vm_file, vma); + ret = call_mmap(vma->vm_file, vma, false); revert_creds(old_cred); ovl_file_accessed(file);

diff --git a/include/linux/fs.h b/include/linux/fs.h index e2d892b201b0..2909e2d14af8 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -42,6 +42,7 @@ #include <linux/mount.h> #include <linux/cred.h> #include <linux/mnt_idmapping.h> +#include <linux/mm.h>

#include <asm/byteorder.h> #include <uapi/linux/fs.h> @@ -1993,6 +1994,7 @@ struct file_operations { long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); long (*compat_ioctl) (struct file *, unsigned int, unsigned long); int (*mmap) (struct file *, struct vm_area_struct *); + int (*populate)(struct file *, struct vm_area_struct *); unsigned long mmap_supported_flags; int (*open) (struct inode *, struct file *); int (*flush) (struct file *, fl_owner_t id); @@ -2074,9 +2076,15 @@ static inline ssize_t call_write_iter(struct file *file, struct kiocb *kio, return file->f_op->write_iter(kio, iter); }

-static inline int call_mmap(struct file *file, struct vm_area_struct *vma) +static inline int call_mmap(struct file *file, struct vm_area_struct *vma, bool do_populate) { - return file->f_op->mmap(file, vma); + int ret = file->f_op->mmap(file, vma); + + if (!ret && do_populate && file->f_op->populate && + !!(vma->vm_flags & (VM_IO | VM_PFNMAP))) + ret = file->f_op->populate(file, vma); + + return ret; }

extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *); diff --git a/include/linux/mm.h b/include/linux/mm.h index 213cc569b192..6c8c036f423b 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2683,7 +2683,7 @@ extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned lo

extern unsigned long mmap_region(struct file *file, unsigned long addr, unsigned long len, vm_flags_t vm_flags, unsigned long pgoff, - struct list_head *uf); + struct list_head *uf, bool do_populate); extern unsigned long do_mmap(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, unsigned long flags, unsigned long pgoff, unsigned long *populate, struct list_head *uf); diff --git a/ipc/shm.c b/ipc/shm.c index b3048ebd5c31..89b28f32acf0 100644 --- a/ipc/shm.c +++ b/ipc/shm.c @@ -587,7 +587,7 @@ static int shm_mmap(struct file *file, struct vm_area_struct *vma) if (ret) return ret;

- ret = call_mmap(sfd->file, vma); + ret = call_mmap(sfd->file, vma, do_populate); if (ret) { shm_close(vma); return ret; diff --git a/mm/mmap.c b/mm/mmap.c index 1e8fdb0b51ed..5eca79957d4c 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1413,6 +1413,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr, unsigned long flags, unsigned long pgoff, unsigned long *populate, struct list_head *uf) { + bool do_populate = (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE; struct mm_struct *mm = current->mm; vm_flags_t vm_flags; int pkey = 0; @@ -1579,10 +1580,9 @@ unsigned long do_mmap(struct file *file, unsigned long addr, vm_flags |= VM_NORESERVE; }

- addr = mmap_region(file, addr, len, vm_flags, pgoff, uf); + addr = mmap_region(file, addr, len, vm_flags, pgoff, uf, do_populate); if (!IS_ERR_VALUE(addr) && - ((vm_flags & VM_LOCKED) || - (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE)) + ((vm_flags & VM_LOCKED) || do_populate)) *populate = len; return addr; } @@ -1721,7 +1721,7 @@ static inline int accountable_mapping(struct file *file, vm_flags_t vm_flags)

unsigned long mmap_region(struct file *file, unsigned long addr, unsigned long len, vm_flags_t vm_flags, unsigned long pgoff, - struct list_head *uf) + struct list_head *uf, bool do_populate) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma, *prev, *merge; @@ -1790,7 +1790,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr, }

vma->vm_file = get_file(file); - error = call_mmap(file, vma); + error = call_mmap(file, vma, do_populate); if (error) goto unmap_and_free_vma;

diff --git a/mm/nommu.c b/mm/nommu.c index 55a9e48a7a02..a3c20b803c27 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -941,7 +941,7 @@ static int do_mmap_shared_file(struct vm_area_struct *vma) { int ret;

- ret = call_mmap(vma->vm_file, vma); + ret = call_mmap(vma->vm_file, vma, false); if (ret == 0) { vma->vm_region->vm_top = vma->vm_region->vm_end; return 0; @@ -972,7 +972,7 @@ static int do_mmap_private(struct vm_area_struct *vma, * - VM_MAYSHARE will be set if it may attempt to share */ if (capabilities & NOMMU_MAP_DIRECT) { - ret = call_mmap(vma->vm_file, vma); + ret = call_mmap(vma->vm_file, vma, false); if (ret == 0) { /* shouldn't return success if we're not sharing */ BUG_ON(!(vma->vm_flags & VM_MAYSHARE));

-- 2.35.1

Greg Kroah-Hartman

10:01 a.m.

New subject: [PATCH RFC 1/3] mm: Add f_ops->populate()

On Sun, Mar 06, 2022 at 07:32:05AM +0200, Jarkko Sakkinen wrote:

...

Sometimes you might want to use MAP_POPULATE to ask a device driver to initialize the device memory in some specific manner. SGX driver can use this to request more memory by issuing ENCLS[EAUG] x86 opcode for each page in the address range.

Add f_ops->populate() with the same parameters as f_ops->mmap() and make it conditionally called inside call_mmap(). Update call sites accodingly.

Signed-off-by: Jarkko Sakkinen jarkko@kernel.org v3:
  if (!ret && do_populate && file->f_op->populate)
  if (!ret && do_populate && file->f_op->populate &&
      !!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
(reported by Matthew Wilcox) v2:
  if (!ret && do_populate)
  if (!ret && do_populate && file->f_op->populate)
(reported by Jan Harkes)

arch/mips/kernel/vdso.c | 2 +- drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c | 2 +- fs/coda/file.c | 2 +- fs/overlayfs/file.c | 2 +- include/linux/fs.h | 12 ++++++++++-- include/linux/mm.h | 2 +- ipc/shm.c | 2 +- mm/mmap.c | 10 +++++----- mm/nommu.c | 4 ++-- 9 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c index 3d0cf471f2fe..89f3f3da9abd 100644 --- a/arch/mips/kernel/vdso.c +++ b/arch/mips/kernel/vdso.c @@ -102,7 +102,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) base = mmap_region(NULL, STACK_TOP, PAGE_SIZE, VM_READ | VM_EXEC | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC,
		0, NULL);
		0, NULL, false);
if (IS_ERR_VALUE(base)) { ret = base; goto out;
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c b/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c index 1b526039a60d..4c71f64d6a79 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c @@ -107,7 +107,7 @@ static int i915_gem_dmabuf_mmap(struct dma_buf *dma_buf, struct vm_area_struct * if (!obj->base.filp) return -ENODEV;

ret = call_mmap(obj->base.filp, vma);

ret = call_mmap(obj->base.filp, vma, false); if (ret) return ret;

diff --git a/fs/coda/file.c b/fs/coda/file.c index 29dd87be2fb8..e14f312fdbf8 100644 --- a/fs/coda/file.c +++ b/fs/coda/file.c @@ -173,7 +173,7 @@ coda_file_mmap(struct file *coda_file, struct vm_area_struct *vma) spin_unlock(&cii->c_lock);

vma->vm_file = get_file(host_file);

ret = call_mmap(vma->vm_file, vma);

ret = call_mmap(vma->vm_file, vma, false);

if (ret) { /* if call_mmap fails, our caller will put host_file so we

diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c index fa125feed0ff..b963a9397e80 100644 --- a/fs/overlayfs/file.c +++ b/fs/overlayfs/file.c @@ -503,7 +503,7 @@ static int ovl_mmap(struct file *file, struct vm_area_struct *vma) vma_set_file(vma, realfile);

old_cred = ovl_override_creds(file_inode(file)->i_sb);

ret = call_mmap(vma->vm_file, vma);

ret = call_mmap(vma->vm_file, vma, false); revert_creds(old_cred); ovl_file_accessed(file);

diff --git a/include/linux/fs.h b/include/linux/fs.h index e2d892b201b0..2909e2d14af8 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -42,6 +42,7 @@ #include <linux/mount.h> #include <linux/cred.h> #include <linux/mnt_idmapping.h> +#include <linux/mm.h>

#include <asm/byteorder.h> #include <uapi/linux/fs.h> @@ -1993,6 +1994,7 @@ struct file_operations { long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); long (*compat_ioctl) (struct file *, unsigned int, unsigned long); int (*mmap) (struct file *, struct vm_area_struct *);

int (*populate)(struct file *, struct vm_area_struct *); unsigned long mmap_supported_flags; int (*open) (struct inode *, struct file *); int (*flush) (struct file *, fl_owner_t id);

@@ -2074,9 +2076,15 @@ static inline ssize_t call_write_iter(struct file *file, struct kiocb *kio, return file->f_op->write_iter(kio, iter); }

-static inline int call_mmap(struct file *file, struct vm_area_struct *vma) +static inline int call_mmap(struct file *file, struct vm_area_struct *vma, bool do_populate) {

return file->f_op->mmap(file, vma);
int ret = file->f_op->mmap(file, vma);

if (!ret && do_populate && file->f_op->populate &&
   !!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
ret = file->f_op->populate(file, vma);
return ret;
}

extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *); diff --git a/include/linux/mm.h b/include/linux/mm.h index 213cc569b192..6c8c036f423b 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2683,7 +2683,7 @@ extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned lo

extern unsigned long mmap_region(struct file *file, unsigned long addr, unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,

struct list_head *uf);

struct list_head *uf, bool do_populate);

As I have said many times before, don't add random boolean flags to function arguments, as they provide no hint as to what they do at all. When you read the code, you then have to go back and look up the function definition here and see what exactly it means and the flow is broken.

Make function names mean something obvious, for this, if it really is a good idea to have this new flag (and I doubt it, but that's not my call), then make this a mmap_region_populate() call to make it obvious it is something different than the notmal mmap_region() call.

But as is, this is pretty horrid, don't you agree?

thanks,

greg k-h

Jarkko Sakkinen

5:02 p.m.

New subject: [PATCH RFC 1/3] mm: Add f_ops->populate()

On Sun, Mar 06, 2022 at 11:01:36AM +0100, Greg Kroah-Hartman wrote:

...

On Sun, Mar 06, 2022 at 07:32:05AM +0200, Jarkko Sakkinen wrote:

...
Sometimes you might want to use MAP_POPULATE to ask a device driver to initialize the device memory in some specific manner. SGX driver can use this to request more memory by issuing ENCLS[EAUG] x86 opcode for each page in the address range.

Add f_ops->populate() with the same parameters as f_ops->mmap() and make it conditionally called inside call_mmap(). Update call sites accodingly.

Signed-off-by: Jarkko Sakkinen jarkko@kernel.org v3:
  if (!ret && do_populate && file->f_op->populate)
  if (!ret && do_populate && file->f_op->populate &&
      !!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
(reported by Matthew Wilcox) v2:
  if (!ret && do_populate)
  if (!ret && do_populate && file->f_op->populate)
(reported by Jan Harkes)

arch/mips/kernel/vdso.c | 2 +- drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c | 2 +- fs/coda/file.c | 2 +- fs/overlayfs/file.c | 2 +- include/linux/fs.h | 12 ++++++++++-- include/linux/mm.h | 2 +- ipc/shm.c | 2 +- mm/mmap.c | 10 +++++----- mm/nommu.c | 4 ++-- 9 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c index 3d0cf471f2fe..89f3f3da9abd 100644 --- a/arch/mips/kernel/vdso.c +++ b/arch/mips/kernel/vdso.c @@ -102,7 +102,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) base = mmap_region(NULL, STACK_TOP, PAGE_SIZE, VM_READ | VM_EXEC | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC,
		0, NULL);
		0, NULL, false);
if (IS_ERR_VALUE(base)) { ret = base; goto out;
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c b/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c index 1b526039a60d..4c71f64d6a79 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c @@ -107,7 +107,7 @@ static int i915_gem_dmabuf_mmap(struct dma_buf *dma_buf, struct vm_area_struct * if (!obj->base.filp) return -ENODEV;

ret = call_mmap(obj->base.filp, vma);

ret = call_mmap(obj->base.filp, vma, false); if (ret) return ret;

diff --git a/fs/coda/file.c b/fs/coda/file.c index 29dd87be2fb8..e14f312fdbf8 100644 --- a/fs/coda/file.c +++ b/fs/coda/file.c @@ -173,7 +173,7 @@ coda_file_mmap(struct file *coda_file, struct vm_area_struct *vma) spin_unlock(&cii->c_lock);

vma->vm_file = get_file(host_file);

ret = call_mmap(vma->vm_file, vma);

ret = call_mmap(vma->vm_file, vma, false);

if (ret) { /* if call_mmap fails, our caller will put host_file so we

diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c index fa125feed0ff..b963a9397e80 100644 --- a/fs/overlayfs/file.c +++ b/fs/overlayfs/file.c @@ -503,7 +503,7 @@ static int ovl_mmap(struct file *file, struct vm_area_struct *vma) vma_set_file(vma, realfile);

old_cred = ovl_override_creds(file_inode(file)->i_sb);

ret = call_mmap(vma->vm_file, vma);

ret = call_mmap(vma->vm_file, vma, false); revert_creds(old_cred); ovl_file_accessed(file);

diff --git a/include/linux/fs.h b/include/linux/fs.h index e2d892b201b0..2909e2d14af8 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -42,6 +42,7 @@ #include <linux/mount.h> #include <linux/cred.h> #include <linux/mnt_idmapping.h> +#include <linux/mm.h>

#include <asm/byteorder.h> #include <uapi/linux/fs.h> @@ -1993,6 +1994,7 @@ struct file_operations { long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); long (*compat_ioctl) (struct file *, unsigned int, unsigned long); int (*mmap) (struct file *, struct vm_area_struct *);

int (*populate)(struct file *, struct vm_area_struct *); unsigned long mmap_supported_flags; int (*open) (struct inode *, struct file *); int (*flush) (struct file *, fl_owner_t id);

@@ -2074,9 +2076,15 @@ static inline ssize_t call_write_iter(struct file *file, struct kiocb *kio, return file->f_op->write_iter(kio, iter); }

-static inline int call_mmap(struct file *file, struct vm_area_struct *vma) +static inline int call_mmap(struct file *file, struct vm_area_struct *vma, bool do_populate) {

return file->f_op->mmap(file, vma);
int ret = file->f_op->mmap(file, vma);

if (!ret && do_populate && file->f_op->populate &&
   !!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
ret = file->f_op->populate(file, vma);
return ret;
}

extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *); diff --git a/include/linux/mm.h b/include/linux/mm.h index 213cc569b192..6c8c036f423b 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2683,7 +2683,7 @@ extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned lo

extern unsigned long mmap_region(struct file *file, unsigned long addr, unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,

struct list_head *uf);

struct list_head *uf, bool do_populate);
As I have said many times before, don't add random boolean flags to function arguments, as they provide no hint as to what they do at all. When you read the code, you then have to go back and look up the function definition here and see what exactly it means and the flow is broken.

Make function names mean something obvious, for this, if it really is a good idea to have this new flag (and I doubt it, but that's not my call), then make this a mmap_region_populate() call to make it obvious it is something different than the notmal mmap_region() call.

I can create:

* mmap_region_populate() * call_mmap_populate()

This would localize the changes and leave out those boolean parameters.

...

But as is, this is pretty horrid, don't you agree?

So can I conclude from this that in general having populate available for device memory is something horrid, or just the implementation path?

That's the main reason why I made this RFC patch set, to get clear answer to that question. I.e. if it is in general sense a bad idea, I'll just create ioctl. If it is the implementation, I'll try to improve it.

Otherwise, I don't know whether or not it is good idea to include such patch into the main SGX2 patch set. No means enforcibl tryy to push support IO memory populate.

...

thanks,

greg k-h

BR, Jarkko

Jarkko Sakkinen

5:03 p.m.

New subject: [PATCH RFC 1/3] mm: Add f_ops->populate()

On Sun, Mar 06, 2022 at 07:03:00PM +0200, Jarkko Sakkinen wrote:

...

On Sun, Mar 06, 2022 at 11:01:36AM +0100, Greg Kroah-Hartman wrote:

...
On Sun, Mar 06, 2022 at 07:32:05AM +0200, Jarkko Sakkinen wrote:

...
Sometimes you might want to use MAP_POPULATE to ask a device driver to initialize the device memory in some specific manner. SGX driver can use this to request more memory by issuing ENCLS[EAUG] x86 opcode for each page in the address range.

Add f_ops->populate() with the same parameters as f_ops->mmap() and make it conditionally called inside call_mmap(). Update call sites accodingly.

Signed-off-by: Jarkko Sakkinen jarkko@kernel.org v3:
  if (!ret && do_populate && file->f_op->populate)
  if (!ret && do_populate && file->f_op->populate &&
      !!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
(reported by Matthew Wilcox) v2:
  if (!ret && do_populate)
  if (!ret && do_populate && file->f_op->populate)
(reported by Jan Harkes)

arch/mips/kernel/vdso.c | 2 +- drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c | 2 +- fs/coda/file.c | 2 +- fs/overlayfs/file.c | 2 +- include/linux/fs.h | 12 ++++++++++-- include/linux/mm.h | 2 +- ipc/shm.c | 2 +- mm/mmap.c | 10 +++++----- mm/nommu.c | 4 ++-- 9 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c index 3d0cf471f2fe..89f3f3da9abd 100644 --- a/arch/mips/kernel/vdso.c +++ b/arch/mips/kernel/vdso.c @@ -102,7 +102,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp) base = mmap_region(NULL, STACK_TOP, PAGE_SIZE, VM_READ | VM_EXEC | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC,
		0, NULL);
		0, NULL, false);
if (IS_ERR_VALUE(base)) { ret = base; goto out;
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c b/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c index 1b526039a60d..4c71f64d6a79 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c @@ -107,7 +107,7 @@ static int i915_gem_dmabuf_mmap(struct dma_buf *dma_buf, struct vm_area_struct * if (!obj->base.filp) return -ENODEV;

ret = call_mmap(obj->base.filp, vma);

ret = call_mmap(obj->base.filp, vma, false); if (ret) return ret;

diff --git a/fs/coda/file.c b/fs/coda/file.c index 29dd87be2fb8..e14f312fdbf8 100644 --- a/fs/coda/file.c +++ b/fs/coda/file.c @@ -173,7 +173,7 @@ coda_file_mmap(struct file *coda_file, struct vm_area_struct *vma) spin_unlock(&cii->c_lock);

vma->vm_file = get_file(host_file);

ret = call_mmap(vma->vm_file, vma);

ret = call_mmap(vma->vm_file, vma, false);

if (ret) { /* if call_mmap fails, our caller will put host_file so we

diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c index fa125feed0ff..b963a9397e80 100644 --- a/fs/overlayfs/file.c +++ b/fs/overlayfs/file.c @@ -503,7 +503,7 @@ static int ovl_mmap(struct file *file, struct vm_area_struct *vma) vma_set_file(vma, realfile);

old_cred = ovl_override_creds(file_inode(file)->i_sb);

ret = call_mmap(vma->vm_file, vma);

ret = call_mmap(vma->vm_file, vma, false); revert_creds(old_cred); ovl_file_accessed(file);

diff --git a/include/linux/fs.h b/include/linux/fs.h index e2d892b201b0..2909e2d14af8 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -42,6 +42,7 @@ #include <linux/mount.h> #include <linux/cred.h> #include <linux/mnt_idmapping.h> +#include <linux/mm.h>

#include <asm/byteorder.h> #include <uapi/linux/fs.h> @@ -1993,6 +1994,7 @@ struct file_operations { long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); long (*compat_ioctl) (struct file *, unsigned int, unsigned long); int (*mmap) (struct file *, struct vm_area_struct *);

int (*populate)(struct file *, struct vm_area_struct *); unsigned long mmap_supported_flags; int (*open) (struct inode *, struct file *); int (*flush) (struct file *, fl_owner_t id);

@@ -2074,9 +2076,15 @@ static inline ssize_t call_write_iter(struct file *file, struct kiocb *kio, return file->f_op->write_iter(kio, iter); }

-static inline int call_mmap(struct file *file, struct vm_area_struct *vma) +static inline int call_mmap(struct file *file, struct vm_area_struct *vma, bool do_populate) {

return file->f_op->mmap(file, vma);
int ret = file->f_op->mmap(file, vma);

if (!ret && do_populate && file->f_op->populate &&
   !!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
ret = file->f_op->populate(file, vma);
return ret;
}

extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *); diff --git a/include/linux/mm.h b/include/linux/mm.h index 213cc569b192..6c8c036f423b 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2683,7 +2683,7 @@ extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned lo

extern unsigned long mmap_region(struct file *file, unsigned long addr, unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,

struct list_head *uf);

struct list_head *uf, bool do_populate);
As I have said many times before, don't add random boolean flags to function arguments, as they provide no hint as to what they do at all. When you read the code, you then have to go back and look up the function definition here and see what exactly it means and the flow is broken.

Make function names mean something obvious, for this, if it really is a good idea to have this new flag (and I doubt it, but that's not my call), then make this a mmap_region_populate() call to make it obvious it is something different than the notmal mmap_region() call.
I can create:

mmap_region_populate()

call_mmap_populate()

This would localize the changes and leave out those boolean parameters.

...
But as is, this is pretty horrid, don't you agree?

So can I conclude from this that in general having populate available for device memory is something horrid, or just the implementation path?

That's the main reason why I made this RFC patch set, to get clear answer to that question. I.e. if it is in general sense a bad idea, I'll just create ioctl. If it is the implementation, I'll try to improve it.

Otherwise, I don't know whether or not it is good idea to include such patch into the main SGX2 patch set. No means enforcibl tryy to push support

~~~~~ intention

BR, Jarkko

Matthew Wilcox

10:43 p.m.

New subject: [PATCH RFC 1/3] mm: Add f_ops->populate()

On Sun, Mar 06, 2022 at 07:02:57PM +0200, Jarkko Sakkinen wrote:

...

So can I conclude from this that in general having populate available for device memory is something horrid, or just the implementation path?

You haven't even attempted to explain what the problem is you're trying to solve. You've shown up with some terrible code and said "Hey, is this a good idea". No, no, it's not.

Jarkko Sakkinen

7 Mar 7 Mar

1:16 p.m.

New subject: [PATCH RFC 1/3] mm: Add f_ops->populate()

On Sun, Mar 06, 2022 at 10:43:31PM +0000, Matthew Wilcox wrote:

...

On Sun, Mar 06, 2022 at 07:02:57PM +0200, Jarkko Sakkinen wrote:

...
So can I conclude from this that in general having populate available for device memory is something horrid, or just the implementation path?

You haven't even attempted to explain what the problem is you're trying to solve. You've shown up with some terrible code and said "Hey, is this a good idea". No, no, it's not.

The problem is that in order to include memory to enclave, which is essentially a reserved address range processes virtual address space there's two steps into it:

1. Host side (kernel) does ENCLS[EAUG] to request a new page to be added to the enclave. 2. Enclave accepts request with ENCLU[EACCEPT] or ENCLU[EACCEPTCOPY].

In the current SGX2 patch set this taken care by the page fault handler. I.e. the enclave calls ENCLU[EACCEPT] for an empty address and the #PF handler then does EAUG for a single page.

So if you want to process a batch of pages this generates O(n) round-trips.

So if there was a way pre-do a batch of EAUG's, that would allow to load data to the enclave without causing page faults happening constantly.

One solution for this simply add ioctl:

https://lore.kernel.org/linux-sgx/YiLRBglTEbu8cHP9@iki.fi/T/#m195ec84bf85614...

But in practice when you wanted to use it, you would setup the parameters so that they match the mmap() range. So for pratical user space API having mmap() take care of this would be much more lean option.

BR, Jarkko

Jarkko Sakkinen

1:26 p.m.

New subject: [PATCH RFC 1/3] mm: Add f_ops->populate()

On Mon, Mar 07, 2022 at 03:16:57PM +0200, Jarkko Sakkinen wrote:

...

On Sun, Mar 06, 2022 at 10:43:31PM +0000, Matthew Wilcox wrote:

...
On Sun, Mar 06, 2022 at 07:02:57PM +0200, Jarkko Sakkinen wrote:

...
So can I conclude from this that in general having populate available for device memory is something horrid, or just the implementation path?

You haven't even attempted to explain what the problem is you're trying to solve. You've shown up with some terrible code and said "Hey, is this a good idea". No, no, it's not.

The problem is that in order to include memory to enclave, which is essentially a reserved address range processes virtual address space there's two steps into it:

Host side (kernel) does ENCLS[EAUG] to request a new page to be added to the enclave.

Enclave accepts request with ENCLU[EACCEPT] or ENCLU[EACCEPTCOPY].

In the current SGX2 patch set this taken care by the page fault handler. I.e. the enclave calls ENCLU[EACCEPT] for an empty address and the #PF handler then does EAUG for a single page.

So if you want to process a batch of pages this generates O(n) round-trips.

So if there was a way pre-do a batch of EAUG's, that would allow to load data to the enclave without causing page faults happening constantly.

One solution for this simply add ioctl:

https://lore.kernel.org/linux-sgx/YiLRBglTEbu8cHP9@iki.fi/T/#m195ec84bf85614...

But in practice when you wanted to use it, you would setup the parameters so that they match the mmap() range. So for pratical user space API having mmap() take care of this would be much more lean option.

For something like Graphene [1] the lazy #PF based option is probably a way to go. For wasm runtime that we're doing in Enarx [2] we get better performance by having something like this. I.e. we most of the time take as much as we use.

[1] https://github.com/gramineproject/graphene [2] https://enarx.dev/

BR, Jarkko

Jarkko Sakkinen

6 Mar 6 Mar

5:32 a.m.

New subject: [PATCH RFC 2/3] x86/sgx: Export sgx_encl_page_alloc()

Move sgx_encl_page_alloc() to encl.c and export it so that it can be used in the implementation for MAP_POPULATE, which requires to allocate new enclave pages.

Signed-off-by: Jarkko Sakkinen jarkko@kernel.org --- arch/x86/kernel/cpu/sgx/encl.c | 38 +++++++++++++++++++++++++++++++++ arch/x86/kernel/cpu/sgx/encl.h | 3 +++ arch/x86/kernel/cpu/sgx/ioctl.c | 38 --------------------------------- 3 files changed, 41 insertions(+), 38 deletions(-)

diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c index 89aeed798ffb..79e39bd99c09 100644 --- a/arch/x86/kernel/cpu/sgx/encl.c +++ b/arch/x86/kernel/cpu/sgx/encl.c @@ -914,6 +914,44 @@ int sgx_encl_test_and_clear_young(struct mm_struct *mm, return ret; }

+struct sgx_encl_page *sgx_encl_page_alloc(struct sgx_encl *encl, + unsigned long offset, + u64 secinfo_flags) +{ + struct sgx_encl_page *encl_page; + unsigned long prot; + + encl_page = kzalloc(sizeof(*encl_page), GFP_KERNEL); + if (!encl_page) + return ERR_PTR(-ENOMEM); + + encl_page->desc = encl->base + offset; + encl_page->encl = encl; + + prot = _calc_vm_trans(secinfo_flags, SGX_SECINFO_R, PROT_READ) | + _calc_vm_trans(secinfo_flags, SGX_SECINFO_W, PROT_WRITE) | + _calc_vm_trans(secinfo_flags, SGX_SECINFO_X, PROT_EXEC); + + /* + * TCS pages must always RW set for CPU access while the SECINFO + * permissions are *always* zero - the CPU ignores the user provided + * values and silently overwrites them with zero permissions. + */ + if ((secinfo_flags & SGX_SECINFO_PAGE_TYPE_MASK) == SGX_SECINFO_TCS) + prot |= PROT_READ | PROT_WRITE; + + /* Calculate maximum of the VM flags for the page. */ + encl_page->vm_max_prot_bits = calc_vm_prot_bits(prot, 0); + + /* + * At time of allocation, the runtime protection bits are the same + * as the maximum protection bits. + */ + encl_page->vm_run_prot_bits = encl_page->vm_max_prot_bits; + + return encl_page; +} + /** * sgx_zap_enclave_ptes() - remove PTEs mapping the address from enclave * @encl: the enclave diff --git a/arch/x86/kernel/cpu/sgx/encl.h b/arch/x86/kernel/cpu/sgx/encl.h index 1b6ce1da7c92..3df0d3faf3a1 100644 --- a/arch/x86/kernel/cpu/sgx/encl.h +++ b/arch/x86/kernel/cpu/sgx/encl.h @@ -113,6 +113,9 @@ int sgx_encl_get_backing(struct sgx_encl *encl, unsigned long page_index, void sgx_encl_put_backing(struct sgx_backing *backing, bool do_write); int sgx_encl_test_and_clear_young(struct mm_struct *mm, struct sgx_encl_page *page); +struct sgx_encl_page *sgx_encl_page_alloc(struct sgx_encl *encl, + unsigned long offset, + u64 secinfo_flags); void sgx_zap_enclave_ptes(struct sgx_encl *encl, unsigned long addr); struct sgx_epc_page *sgx_alloc_va_page(void); unsigned int sgx_alloc_va_slot(struct sgx_va_page *va_page); diff --git a/arch/x86/kernel/cpu/sgx/ioctl.c b/arch/x86/kernel/cpu/sgx/ioctl.c index d8c3c07badb3..3e3ca27a6f72 100644 --- a/arch/x86/kernel/cpu/sgx/ioctl.c +++ b/arch/x86/kernel/cpu/sgx/ioctl.c @@ -169,44 +169,6 @@ static long sgx_ioc_enclave_create(struct sgx_encl *encl, void __user *arg) return ret; }

-static struct sgx_encl_page *sgx_encl_page_alloc(struct sgx_encl *encl, - unsigned long offset, - u64 secinfo_flags) -{ - struct sgx_encl_page *encl_page; - unsigned long prot; - - encl_page = kzalloc(sizeof(*encl_page), GFP_KERNEL); - if (!encl_page) - return ERR_PTR(-ENOMEM); - - encl_page->desc = encl->base + offset; - encl_page->encl = encl; - - prot = _calc_vm_trans(secinfo_flags, SGX_SECINFO_R, PROT_READ) | - _calc_vm_trans(secinfo_flags, SGX_SECINFO_W, PROT_WRITE) | - _calc_vm_trans(secinfo_flags, SGX_SECINFO_X, PROT_EXEC); - - /* - * TCS pages must always RW set for CPU access while the SECINFO - * permissions are *always* zero - the CPU ignores the user provided - * values and silently overwrites them with zero permissions. - */ - if ((secinfo_flags & SGX_SECINFO_PAGE_TYPE_MASK) == SGX_SECINFO_TCS) - prot |= PROT_READ | PROT_WRITE; - - /* Calculate maximum of the VM flags for the page. */ - encl_page->vm_max_prot_bits = calc_vm_prot_bits(prot, 0); - - /* - * At time of allocation, the runtime protection bits are the same - * as the maximum protection bits. - */ - encl_page->vm_run_prot_bits = encl_page->vm_max_prot_bits; - - return encl_page; -} - static int sgx_validate_secinfo(struct sgx_secinfo *secinfo) { u64 perm = secinfo->flags & SGX_SECINFO_PERMISSION_MASK;

-- 2.35.1

Jarkko Sakkinen

5:32 a.m.

New subject: [PATCH RFC 3/3] x86/sgx: Implement EAUG population with MAP_POPULATE

With SGX1 an enclave needs to be created with its maximum memory demands pre-allocated. Pages cannot be added to an enclave after it is initialized. SGX2 introduces a new function, ENCLS[EAUG] for adding pages to an initialized enclave.

Add support for dynamically adding pages to an initialized enclave with mmap() by populating pages with EAUG. Use f_ops->populate() callback to achieve this behaviour.

Signed-off-by: Jarkko Sakkinen jarkko@kernel.org --- arch/x86/kernel/cpu/sgx/driver.c | 129 +++++++++++++++++++++++++++++++ 1 file changed, 129 insertions(+)

diff --git a/arch/x86/kernel/cpu/sgx/driver.c b/arch/x86/kernel/cpu/sgx/driver.c index aa9b8b868867..0e97e7476076 100644 --- a/arch/x86/kernel/cpu/sgx/driver.c +++ b/arch/x86/kernel/cpu/sgx/driver.c @@ -9,6 +9,7 @@ #include <asm/traps.h> #include "driver.h" #include "encl.h" +#include "encls.h"

u64 sgx_attributes_reserved_mask; u64 sgx_xfrm_reserved_mask = ~0x3; @@ -101,6 +102,133 @@ static int sgx_mmap(struct file *file, struct vm_area_struct *vma) return 0; }

+static int sgx_encl_augment_page(struct sgx_encl *encl, unsigned long offset) +{ + struct sgx_pageinfo pginfo = {0}; + struct sgx_encl_page *encl_page; + struct sgx_epc_page *epc_page; + struct sgx_va_page *va_page; + u64 secinfo_flags; + int ret; + + /* + * Ignore internal permission checking for dynamically added pages. + * They matter only for data added during the pre-initialization phase. + * The enclave decides the permissions by the means of EACCEPT, + * EACCEPTCOPY and EMODPE. + */ + secinfo_flags = SGX_SECINFO_R | SGX_SECINFO_W | SGX_SECINFO_X; + encl_page = sgx_encl_page_alloc(encl, offset, secinfo_flags); + if (IS_ERR(encl_page)) + return PTR_ERR(encl_page); + + epc_page = sgx_alloc_epc_page(encl_page, true); + if (IS_ERR(epc_page)) { + ret = PTR_ERR(epc_page); + goto err_alloc_epc_page; + } + + va_page = sgx_encl_grow(encl); + if (IS_ERR(va_page)) { + ret = PTR_ERR(va_page); + goto err_grow; + } + + mutex_lock(&encl->lock); + + /* + * Adding to encl->va_pages must be done under encl->lock. Ditto for + * deleting (via sgx_encl_shrink()) in the error path. + */ + if (va_page) + list_add(&va_page->list, &encl->va_pages); + + /* + * Insert prior to EADD in case of OOM. EADD modifies MRENCLAVE, i.e. + * can't be gracefully unwound, while failure on EADD/EXTEND is limited + * to userspace errors (or kernel/hardware bugs). + */ + ret = xa_insert(&encl->page_array, PFN_DOWN(encl_page->desc), + encl_page, GFP_KERNEL); + + /* + * If ret == -EBUSY then page was created in another flow while + * running without encl->lock + */ + if (ret) + goto err_xa_insert; + + pginfo.secs = (unsigned long)sgx_get_epc_virt_addr(encl->secs.epc_page); + pginfo.addr = encl_page->desc & PAGE_MASK; + pginfo.metadata = 0; + + ret = __eaug(&pginfo, sgx_get_epc_virt_addr(epc_page)); + if (ret) + goto err_eaug; + + encl_page->encl = encl; + encl_page->epc_page = epc_page; + encl_page->type = SGX_PAGE_TYPE_REG; + encl->secs_child_cnt++; + + sgx_mark_page_reclaimable(encl_page->epc_page); + + mutex_unlock(&encl->lock); + + return 0; + +err_eaug: + xa_erase(&encl->page_array, PFN_DOWN(encl_page->desc)); + +err_xa_insert: + sgx_encl_shrink(encl, va_page); + mutex_unlock(&encl->lock); + +err_grow: + sgx_encl_free_epc_page(epc_page); + +err_alloc_epc_page: + kfree(encl_page); + + return VM_FAULT_SIGBUS; +} + +/* + * Add new pages to the enclave sequentially with ENCLS[EAUG]. Note that + * sgx_mmap() validates that the given VMA is within the enclave range. Calling + * here sgx_encl_may_map() second time would too time consuming. + */ +static int sgx_populate(struct file *file, struct vm_area_struct *vma) +{ + unsigned long length = vma->vm_end - vma->vm_start; + struct sgx_encl *encl = file->private_data; + unsigned long start = encl->base - vma->vm_start; + unsigned long pos; + int ret; + + /* EAUG works only for initialized enclaves. */ + if (!test_bit(SGX_ENCL_INITIALIZED, &encl->flags)) + return -EINVAL; + + for (pos = 0 ; pos < length; pos += PAGE_SIZE) { + if (signal_pending(current)) { + if (!pos) + ret = -ERESTARTSYS; + + break; + } + + if (need_resched()) + cond_resched(); + + ret = sgx_encl_augment_page(encl, start + pos); + if (ret) + break; + } + + return ret; +} + static unsigned long sgx_get_unmapped_area(struct file *file, unsigned long addr, unsigned long len, @@ -133,6 +261,7 @@ static const struct file_operations sgx_encl_fops = { .compat_ioctl = sgx_compat_ioctl, #endif .mmap = sgx_mmap, + .populate = sgx_populate, .get_unmapped_area = sgx_get_unmapped_area, };

-- 2.35.1

David Laight

8:30 a.m.

From: Jarkko Sakkinen

...

Sent: 06 March 2022 05:32

For device memory (aka VM_IO | VM_PFNMAP) MAP_POPULATE does nothing. Allow to use that for initializing the device memory by providing a new callback f_ops->populate() for the purpose.

SGX patches are provided to show the callback in context.

An obvious alternative is a ioctl but it is less elegant and requires two syscalls (mmap + ioctl) per memory range, instead of just one (mmap).

Is this all about trying to stop the vm_operations_struct.fault() function being called?

It is pretty easy to ensure the mappings are setup in the driver's mmap() function. Then the fault() function can just return -VM_FAULT_SIGBUS;

If it is actually device memory you just need to call vm_iomap_memory() That quite nicely mmap()s PCIe memory space into a user process.

Mapping driver memory is slightly more difficult. For buffers allocated with dma_alloc_coherent() you can probably use dma_mmap_coherent(). But I have a loop calling remap_pfn_range() because the buffer area is made of multiple 16kB kernel buffers that need to be mapped to contiguous user pages.

David

- Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)

'Jarkko Sakkinen'

4:52 p.m.

On Sun, Mar 06, 2022 at 08:30:14AM +0000, David Laight wrote:

...

From: Jarkko Sakkinen

...
Sent: 06 March 2022 05:32

For device memory (aka VM_IO | VM_PFNMAP) MAP_POPULATE does nothing. Allow to use that for initializing the device memory by providing a new callback f_ops->populate() for the purpose.

SGX patches are provided to show the callback in context.

An obvious alternative is a ioctl but it is less elegant and requires two syscalls (mmap + ioctl) per memory range, instead of just one (mmap).

Is this all about trying to stop the vm_operations_struct.fault() function being called?

In SGX protected memory is actually encrypted normal memory and CPU access control semantics (marked as reserved, e.g. struct page's).

In SGX you need call ENCLS[EAUG] outside the protected memory to add new pages to the protected memory. Then when CPU is executing inside this protected memory, also known as enclaves, it commits the memory as part of the enclave either with ENCLU[EACCEPT] or ENCLU[EACCEPTCOPY].

So the point is not prevent page faults but to prepare the memory for pending state so that the enclave can then accept them without round-trips, and in some cases thus improve performance (in our case in enarx.dev platform that we are developing).

In fact, #PF handler in SGX driver in the current SGX2 patch set also does EAUG on-demand. Optimal is to have both routes available. And said, this can be of course also implemented as ioctl.

BR, Jarkko

Matthew Wilcox

11:33 a.m.

On Sun, Mar 06, 2022 at 07:32:04AM +0200, Jarkko Sakkinen wrote:

...

For device memory (aka VM_IO | VM_PFNMAP) MAP_POPULATE does nothing. Allow to use that for initializing the device memory by providing a new callback f_ops->populate() for the purpose.

As I said, NAK.

David Hildenbrand

7 Mar 7 Mar

10:12 a.m.

On 06.03.22 06:32, Jarkko Sakkinen wrote:

...

For device memory (aka VM_IO | VM_PFNMAP) MAP_POPULATE does nothing. Allow to use that for initializing the device memory by providing a new callback f_ops->populate() for the purpose.

SGX patches are provided to show the callback in context.

An obvious alternative is a ioctl but it is less elegant and requires two syscalls (mmap + ioctl) per memory range, instead of just one (mmap).

What about extending MADV_POPULATE_READ | MADV_POPULATE_WRITE to support VM_IO | VM_PFNMAP (as well?) ?

-- Thanks, David / dhildenb

Jarkko Sakkinen

2:22 p.m.

On Mon, Mar 07, 2022 at 11:12:44AM +0100, David Hildenbrand wrote:

...

On 06.03.22 06:32, Jarkko Sakkinen wrote:

...
For device memory (aka VM_IO | VM_PFNMAP) MAP_POPULATE does nothing. Allow to use that for initializing the device memory by providing a new callback f_ops->populate() for the purpose.

SGX patches are provided to show the callback in context.

An obvious alternative is a ioctl but it is less elegant and requires two syscalls (mmap + ioctl) per memory range, instead of just one (mmap).

What about extending MADV_POPULATE_READ | MADV_POPULATE_WRITE to support VM_IO | VM_PFNMAP (as well?) ?

What would be a proper point to bind that behaviour? For mmap/mprotect it'd be probably populate_vma_page_range() because that would span both mmap() and mprotect() (Dave's suggestion in this thread).

For MAP_POPULATE I did not have hard proof to show that it would be used by other drivers but for madvice() you can find at least a few ioctl based implementations:

$ git grep -e madv --and ( -e ioc ) drivers/ drivers/gpu/drm/i915/gem/i915_gem_ioctls.h:int i915_gem_madvise_ioctl(struct drm_device *dev, void *data, drivers/gpu/drm/i915/i915_driver.c: DRM_IOCTL_DEF_DRV(I915_GEM_MADVISE, i915_gem_madvise_ioctl, DRM_RENDER_ALLOW), drivers/gpu/drm/i915/i915_gem.c:i915_gem_madvise_ioctl(struct drm_device *dev, void *data, drivers/gpu/drm/msm/msm_drv.c:static int msm_ioctl_gem_madvise(struct drm_device *dev, void *data, drivers/gpu/drm/msm/msm_drv.c: DRM_IOCTL_DEF_DRV(MSM_GEM_MADVISE, msm_ioctl_gem_madvise, DRM_RENDER_ALLOW), drivers/gpu/drm/panfrost/panfrost_drv.c:static int panfrost_ioctl_madvise(struct drm_device *dev, void *data, drivers/gpu/drm/vc4/vc4_drv.c: DRM_IOCTL_DEF_DRV(VC4_GEM_MADVISE, vc4_gem_madvise_ioctl, DRM_RENDER_ALLOW), drivers/gpu/drm/vc4/vc4_drv.h:int vc4_gem_madvise_ioctl(struct drm_device *dev, void *data, drivers/gpu/drm/vc4/vc4_gem.c:int vc4_gem_madvise_ioctl(struct drm_device *dev, void *data,

IMHO this also provides supportive claim for MAP_POPULATE, and yeah, I agree that to be consistent implementation, both madvice() and MAP_POPULATE should work.

...

-- Thanks,

David / dhildenb

BR, Jarkko

David Hildenbrand

2:33 p.m.

On 07.03.22 15:22, Jarkko Sakkinen wrote:

...

On Mon, Mar 07, 2022 at 11:12:44AM +0100, David Hildenbrand wrote:

...
On 06.03.22 06:32, Jarkko Sakkinen wrote:

...
For device memory (aka VM_IO | VM_PFNMAP) MAP_POPULATE does nothing. Allow to use that for initializing the device memory by providing a new callback f_ops->populate() for the purpose.

SGX patches are provided to show the callback in context.

An obvious alternative is a ioctl but it is less elegant and requires two syscalls (mmap + ioctl) per memory range, instead of just one (mmap).

What about extending MADV_POPULATE_READ | MADV_POPULATE_WRITE to support VM_IO | VM_PFNMAP (as well?) ?

What would be a proper point to bind that behaviour? For mmap/mprotect it'd be probably populate_vma_page_range() because that would span both mmap() and mprotect() (Dave's suggestion in this thread).

MADV_POPULATE_* ends up in faultin_vma_page_range(), right next to populate_vma_page_range(). So it might require a similar way to hook into the driver I guess.

...

For MAP_POPULATE I did not have hard proof to show that it would be used by other drivers but for madvice() you can find at least a few ioctl based implementations:

$ git grep -e madv --and ( -e ioc ) drivers/ drivers/gpu/drm/i915/gem/i915_gem_ioctls.h:int i915_gem_madvise_ioctl(struct drm_device *dev, void *data, drivers/gpu/drm/i915/i915_driver.c: DRM_IOCTL_DEF_DRV(I915_GEM_MADVISE, i915_gem_madvise_ioctl, DRM_RENDER_ALLOW), drivers/gpu/drm/i915/i915_gem.c:i915_gem_madvise_ioctl(struct drm_device *dev, void *data, drivers/gpu/drm/msm/msm_drv.c:static int msm_ioctl_gem_madvise(struct drm_device *dev, void *data, drivers/gpu/drm/msm/msm_drv.c: DRM_IOCTL_DEF_DRV(MSM_GEM_MADVISE, msm_ioctl_gem_madvise, DRM_RENDER_ALLOW), drivers/gpu/drm/panfrost/panfrost_drv.c:static int panfrost_ioctl_madvise(struct drm_device *dev, void *data, drivers/gpu/drm/vc4/vc4_drv.c: DRM_IOCTL_DEF_DRV(VC4_GEM_MADVISE, vc4_gem_madvise_ioctl, DRM_RENDER_ALLOW), drivers/gpu/drm/vc4/vc4_drv.h:int vc4_gem_madvise_ioctl(struct drm_device *dev, void *data, drivers/gpu/drm/vc4/vc4_gem.c:int vc4_gem_madvise_ioctl(struct drm_device *dev, void *data,

IMHO this also provides supportive claim for MAP_POPULATE, and yeah, I agree that to be consistent implementation, both madvice() and MAP_POPULATE should work.

MADV_POPULATE_WRITE + MADV_DONTNEED/FALLOC_FL_PUNCH_HOLE is one way to dynamically manage memory consumption inside a sparse memory mapping (preallocate/populate via MADV_POPULATE_WRITE, discard via MADV_DONTNEED/FALLOC_FL_PUNCH_HOLE). Extending that whole mechanism to deal with VM_IO | VM_PFNMAP mappings as well could be interesting.

At least I herd about some ideas where we might want to dynamically expose memory to a VM (via virtio-mem) inside a sparse memory mapping, and the memory in that sparse memory mapping is provided from a dedicated memory pool managed by a device driver -- not just using ordinary anonymous/file/hugetlb memory as we do right now.

Now, this is certainly stuff for the future, I just wanted to mention it.

-- Thanks, David / dhildenb

Jarkko Sakkinen

3:49 p.m.

On Mon, Mar 07, 2022 at 03:33:52PM +0100, David Hildenbrand wrote:

...

On 07.03.22 15:22, Jarkko Sakkinen wrote:

...
On Mon, Mar 07, 2022 at 11:12:44AM +0100, David Hildenbrand wrote:

...
On 06.03.22 06:32, Jarkko Sakkinen wrote:

...
For device memory (aka VM_IO | VM_PFNMAP) MAP_POPULATE does nothing. Allow to use that for initializing the device memory by providing a new callback f_ops->populate() for the purpose.

SGX patches are provided to show the callback in context.

An obvious alternative is a ioctl but it is less elegant and requires two syscalls (mmap + ioctl) per memory range, instead of just one (mmap).

What about extending MADV_POPULATE_READ | MADV_POPULATE_WRITE to support VM_IO | VM_PFNMAP (as well?) ?

What would be a proper point to bind that behaviour? For mmap/mprotect it'd be probably populate_vma_page_range() because that would span both mmap() and mprotect() (Dave's suggestion in this thread).

MADV_POPULATE_* ends up in faultin_vma_page_range(), right next to populate_vma_page_range(). So it might require a similar way to hook into the driver I guess.

...
For MAP_POPULATE I did not have hard proof to show that it would be used by other drivers but for madvice() you can find at least a few ioctl based implementations:

$ git grep -e madv --and ( -e ioc ) drivers/ drivers/gpu/drm/i915/gem/i915_gem_ioctls.h:int i915_gem_madvise_ioctl(struct drm_device *dev, void *data, drivers/gpu/drm/i915/i915_driver.c: DRM_IOCTL_DEF_DRV(I915_GEM_MADVISE, i915_gem_madvise_ioctl, DRM_RENDER_ALLOW), drivers/gpu/drm/i915/i915_gem.c:i915_gem_madvise_ioctl(struct drm_device *dev, void *data, drivers/gpu/drm/msm/msm_drv.c:static int msm_ioctl_gem_madvise(struct drm_device *dev, void *data, drivers/gpu/drm/msm/msm_drv.c: DRM_IOCTL_DEF_DRV(MSM_GEM_MADVISE, msm_ioctl_gem_madvise, DRM_RENDER_ALLOW), drivers/gpu/drm/panfrost/panfrost_drv.c:static int panfrost_ioctl_madvise(struct drm_device *dev, void *data, drivers/gpu/drm/vc4/vc4_drv.c: DRM_IOCTL_DEF_DRV(VC4_GEM_MADVISE, vc4_gem_madvise_ioctl, DRM_RENDER_ALLOW), drivers/gpu/drm/vc4/vc4_drv.h:int vc4_gem_madvise_ioctl(struct drm_device *dev, void *data, drivers/gpu/drm/vc4/vc4_gem.c:int vc4_gem_madvise_ioctl(struct drm_device *dev, void *data,

IMHO this also provides supportive claim for MAP_POPULATE, and yeah, I agree that to be consistent implementation, both madvice() and MAP_POPULATE should work.

MADV_POPULATE_WRITE + MADV_DONTNEED/FALLOC_FL_PUNCH_HOLE is one way to dynamically manage memory consumption inside a sparse memory mapping (preallocate/populate via MADV_POPULATE_WRITE, discard via MADV_DONTNEED/FALLOC_FL_PUNCH_HOLE). Extending that whole mechanism to deal with VM_IO | VM_PFNMAP mappings as well could be interesting.

At least I herd about some ideas where we might want to dynamically expose memory to a VM (via virtio-mem) inside a sparse memory mapping, and the memory in that sparse memory mapping is provided from a dedicated memory pool managed by a device driver -- not just using ordinary anonymous/file/hugetlb memory as we do right now.

Now, this is certainly stuff for the future, I just wanted to mention it.

For SGX purposes I'm now studying the possibly to use ra_state to get idea where do "prefetching" (EAUG's) in batches, as it is something that would not require any intrusive changes to mm but thank you for sharing this. Looking into implementing this properly is the 2nd option, if that does not work out.

...

-- Thanks,

David / dhildenb

BR, Jarkko

1150

Age (days ago)

1151

Last active (days ago)

dri-devel@lists.freedesktop.org

16 comments

6 participants

tags (0)

participants (6)

'Jarkko Sakkinen'
David Hildenbrand
David Laight
Greg Kroah-Hartman
Jarkko Sakkinen
Matthew Wilcox